This repository is a collection of functions designed to deal with values measured over an interval, typically as a time-integrated average.
Some context: One approach to measuring particulate matter such as PM2.5 would be to actively collect PM2.5 on a filter (at a constant flow rate) over a two-week period. Information about instantaneous PM2.5 concentration over that two-week period is not retreivable from the filter, but a very accurate measurement of the filter’s mass can be used to calculated an accurate 2-week time-integrated average. At a single location, we might measure PM2.5 over several repeated two-week periods. This measurement approach is much more accurate than non-gravimetric approaches, but has the obvious limitation that the measurement is over an extended time period. If the measurement period (in this case, 2 weeks) is small relative to the period over which we’re interested in PM2.5 concentrations (typically, months to years), then the time-integrated average issue becomes negligible.
However, if filters at different locations are measuring concentration for intervals of variable length or on non-aligned schedules, the resulting dataset would need to be processed to align measurements into a consistent set of intervals (e.g., adjacent, non-overlapping, recurring intervals of fixed length).
The following R code is designed to address this data processing step. It is generalizable to any measures taken over periods (not just air pollution data).
interval_weighted_avg_f is a function which takes values recorded over non-overlapping intervals and averages them to defined intervals, possibly within individuals/monitors/locations.
this function uses the data.table package.
Arguments:
- x: a data.table object containing measuresments over intervals which must be completely non-overlapping (i.e periods can’t even be partially overlapping) within groups defined by group_vars. if group_vars is specified (non-NULL), BOTH x and y must contain columns specified in group_vars.
- y: a data.table object containing intervals over which you’d like averages of x-measures computed. y intervals, unlike x intervals, may be overlapping. if group_vars is specified (non-NULL), y must contains those group_vars column names, and this would allow different averaging period for each subject/monitor/location.
- interval_vars: a length-2 character vector of column names in both x and y. these columns in x and y should be all numeric or all Dates.
- value_vars: a character vector of column names in x, to be averaged
- group_vars: a character vector of colum names in x and in y (note that this is a change from a previous version) specifying subjects/monitors/locations within which to take averages. can by NULL, in which case averages are taken over the entire x dataset for each y period.
- required_percentage: the percentage of non-missing, measured x-observations in periods defined by y for the resulting measure in the return to be nonmissing. by default, 100 (any missing observations will result in an NA).
- skip_overlap_check: by default, FALSE. setting this to TRUE will skip internal checks to make sure x intervals are non-overlapping within groups defined by group_vars. intervals in x must be non-overlapping, but you may want to skip this check if you’ve already checked this because it is computationally intensive for large datasets.
returns a data.table object. Rows correspond to intervals from y, separately for groups defined by group_vars. nrow of the return will be nrow of y because the function is designed to behave like a right inner join on intervals (and optional additional grouping variables) with an average. Rows of y where an average cannot be calculated (either due to the measurements being present in X but NA or the measurements not being in x at all) are still returned with value variable columns set to NA (see the required_percentage argument for more details). Columns of the returned data.table are as follows
- grouping variables as specified in group_vars
- interval columns corresponding to intervals in y. columns are named the same as they were in x and y.
- value variable columns from x, averaged to periods in y. named the same as they were in x
- yduration: the length of the interval (ie as a count) specified in y
- xduration: the total length of the intervals (ie as a count) from x that fall into this interval from y. this will be equal to yduration if x is comprehensive for (ie, fully covers) this interval from y.
- nobs_<value_vars>: for each value_var specified, this is the count of non-missing values from x that fall into this interval from y. this will be equal to xduration if the value_var contains no NA values over the y interval. If there are NAs in value variables, then nobs_<value_vars> will be different from xduration and won’t necessarily be all the same for each value_var.
- xminstart: the minimum of the start intervals in x used in averaging returned y intervals, within groups. If the start of the earliest x interval is less than the start of the y interval, the minumum of the y interval is returned. Note, this is the minimum start time whether or not value_vars were missing or not for that start time. If you really need non-missing minimum start times, you can remove missing intervals from x prior to calling interval_weighted_avg_f (calling this separately for each value_var).
- xmaxend: the maximum of the end intervals in x used in averaging returned y intervals, within groups. Again, like for xminstart, this does not pay attention to whether the interval in x had non-missing value_vars.
Additional notes:
- When required_percentage is less than 100, xminstart and xmaxend may be useful to determine whether an average meets specified coverage requirements in terms of span (range) of the y interval.
- Warning, this function will reorder the rows of x and y. This occurs because data.table objects to funtions are pass by reference for efficiency (R is normally pass by value). If this is a concern, a couple of quick edits would maintain the order (by reordering at function completion).
library(data.table)
source("timeperiod_functions.R")
First let’s start with some measurement data collected on a regular schedule where each interval is 5 units in length. Let’s say we want to average this measurement data to a long interval containing measured intervals (0-30),as well as to a specific set of intervals corresponding to a schedule where each interval is 7 units in length.
(x <- data.table(value1=c(1,2,3,2,1),start=c(1L,6L,11L,16L,21L),end=c(5L,10L,15L,20L,25L)))
## value1 start end
## 1: 1 1 5
## 2: 2 6 10
## 3: 3 11 15
## 4: 2 16 20
## 5: 1 21 25
(y <- data.table(start=c(0L,0L,7L,14L,21L),end=c(30L,6L,13L,20L,27L)))
## start end
## 1: 0 30
## 2: 0 6
## 3: 7 13
## 4: 14 20
## 5: 21 27
interval_weighted_avg_f(x,y,interval_vars=c("start","end"),value_vars=c("value1"))
## [1] "2020-02-23 19:48:12 passed errorcheck: x is non-overlapping."
## start end value1 yduration xduration nobs_value1 xminstart xmaxend
## 1: 0 6 NA 7 6 6 1 6
## 2: 0 30 NA 31 25 25 1 25
## 3: 7 13 2.428571 7 7 7 7 13
## 4: 14 20 2.285714 7 7 7 14 20
## 5: 21 27 NA 7 5 5 21 25
Note that average value is missing (NA) for periods which are not completely observed in x. By using looser competeness requirements we can get values (although note that these are not truly the average if we consider value1 to be an unmeasured random variable at non-measured timepoints):
interval_weighted_avg_f(x,y,interval_vars=c("start","end"),value_vars=c("value1"),required_percentage=.8)
## [1] "2020-02-23 19:48:12 passed errorcheck: x is non-overlapping."
## start end value1 yduration xduration nobs_value1 xminstart xmaxend
## 1: 0 6 1.166667 7 6 6 1 6
## 2: 0 30 1.800000 31 25 25 1 25
## 3: 7 13 2.428571 7 7 7 7 13
## 4: 14 20 2.285714 7 7 7 14 20
## 5: 21 27 1.000000 7 5 5 21 25
It would also be possible to calculate multiple averages simulataneously or calculate averages for different y-intervals for different sets of observations defined by a group_var.
See tests.R for more examples.
This function merges periods in y to periods in x by interval variables and possible grouping variables using foverlaps. The goal is to calculate averages for values in y, so y is actually passed to the first argument of foverlaps. This confusion is due to inconsistency in how x[y] syntax is a right join whereas foverlaps(x,y) is a left join (this is probably due to foverlaps replicating an existing bioconductor function). Anyway, I chose to have interval_weighted_avg_f(x,y) be a right join to match x[y] rather than foverlaps. foverlaps is called internally even though there’s an equivalent non-equi x[y, on=] join because foverlaps is slightly faster at producing all the columns needed to do the calculates.
Specifically, in order to keep track of counts of observe time (which become weights in the weighted average function), I need to calculate the duration of overlap within a joined interval. It’s surprising that foverlaps doesn’t include this, but it needs to be calculated explicitly in my function using pmin on both of the end intervals (which becomes inteval_end) and pmax on both of the start intervals (which becomes interval_start). The differerence (interval_end-interval_start+1) is the duration (“dur”) of an overlapping joined interval. The sum of these durations across a y interval is the total count of unique units of time present in x that overlap with that y interval (note that x intervals need to be non-overlapping within groupings). Therefore, these units of time (“dur”) can be used as weights in a weighted average.
The value variables in the resulting data.table are averaged in a group-by statement (by periods defined in y as well by optional grouping variables). Specifically, the average is a weighted average that takes into account the duration of an observed measurement in a desired averaging period (y interval). The function is written to make use of data.table’s Gforce capabilities. data.table’s gforce efficeintly does group-by operations directly in C rather than in R. Since data.table’s ability to detect an optimized group-by function call is limited, using gforce for a weighted mean (and even just for an abritrary number of value_vars) requires use of some nonstandard evaluation. In this context, the non-standard evaluation is constructing a data.table call via paste and calling it with eval(parse(text=<>)).
An additional consideration is that since the function is written to accept data.tables with arbitrary columns, and in the joined data.table z there could be naming conflicts between user-supplied columns and constructed columns, all user-supplied columns are renamed to placeholder names until the end of the function. This avoids potential issues with naming conflicts with temporary internal columns that are not included in the output. However, output columns (xduration, yduration, xminstart, xmaxend) are reserved column names and can’t exist as columns that need to end up in z (specifically, they can’t be in group_vars, interval_vars, or value_vars). Since only group_vars,interval_vars, and value_vars go into the foverlaps merge, columns named xduration, yduration, xminstart, and xmaxend are allowed to be other columns of x/y as long as they’re not group_vars, value_vars, or interval_vars.
Memory: The headline here is that if you run into memory issues you can always do a for loop iterating over group vars. There’s an example of this in tests.R.
In greater detail: This function unfortunately does create a large allocation (the intermediate joined table z). Ideally the function would be written to avoid this large allocation by doing an immediate by=.EACHI (using an x[y] non-equi join). Unfortunately I’ve tested this approach and while it avoids the large memory allocation, it’s much slower due to pmin/pmax needing to be calculated within groups defined by .EACHI. If foverlaps directly produced the equivalents of the interval_start/interval_end variables, and if foverlaps supported immediate group-by operations then this function could be memory-efficient and relatively fast. That approach might still be slower than the current approach since GFORCE isn’t supported for by-without-by/.EACHI joins. Conceptually by-without-by creates each group and immediately summarizes that group before iterating to the next group. I think that .EACHI is probably conceptually at odds with the idea of GFORCE since GFORCE saves time by doing group-by operations in C on a bunch of groups at once. Still, there might be an eventual possibility for trading some speed for avoing the large memory allocation.
This is a more algorithmicly straightforward but more memory intensive version of interval_weighted_avg_f. Specifically, it expands all individual measurements over an interval into repeated observations of that measurement for each discrete time increment in that interval. Since the resulting data is now observed on time-increments which are equal in length, weighted averaging is no longer necessary–a simple average can be used. Since this approach expands the data, it is not appropriate for replacing sql-pipilines for large data in R and will result in out-of-memory errors. Due to the mission-critical work that the interval_weighted_avg_f function will be performing, I wrote interval_weighted_avg_slow_f as a secondary testing algorithm. Numerous tests invovling different data scenarios comparing interval_weighted_avg_f and interval_weighted_avg_slow_f return identical results.
The remove_overlaps function takes measures over intervals which may be partially-overlapping and breaks these intervals into non-overlapping intervals and exactly overlapping intervals. This would allow you to then average values over exactly-overlapping intervals. useful if you have overlapping monitoring data at a single site.
- timeperiod_functions.R contains the source for the functions.
- The example file generates some example data and demonstrates how to use the interval_weighted_avg_f function.
- the tests file contains testing to ensure the interval_weighted_avg_f function is working and contains some more complicated examples, particularly when dealing with recorded data that is overlapping.