Skip to content
This repository was archived by the owner on Jul 18, 2020. It is now read-only.

equivalence of foverlaps and non-equijoins together with by=.EACHI suggests the time weighted average could be done in a single pass #21

Open
myoung3 opened this issue Feb 22, 2020 · 1 comment

Comments

@myoung3
Copy link
Member

myoung3 commented Feb 22, 2020

but only if gforce weighted.mean gets implemented

x = data.table(start=c(5,31,22,16), end=c(8,50,25,18), val2 = 7:10)
y = data.table(start=c(10, 20, 30), end=c(15, 35, 45), val1 = 1:3)
setkey(y, start, end)
setkey(x, start, end)

x[,startx1:=start]
x[,startx2:=start]
x[,endx1:=end]
x[,endx2:=end]

y[,starty1:=start]
y[,starty2:=start]
y[,endy1:=end]
y[,endy2:=end]

x[,start:=NULL]
y[,start:=NULL]
x[,end:=NULL]
y[,end:=NULL]

setkey(y, starty1, endy1)
setkey(x, startx1, endx1)

note the equivalence of foverlaps and the non-equi joins:

(the complicated column renaming is due to the confusing way that data.table renames columns during non-equijoins)

foverlaps(x, y)
y[x, on=list(endy1>=startx1,starty1<=endx1)]

foverlaps(y,x)
x[y, on=list(endx1>=starty1,startx1<=endy1)]

Would this work? How would this count days?
y[x, weighted.mean(???),on=list(endy1>=startx1,starty1<=endx1),by=.EACHI]

@myoung3 myoung3 changed the title equivalence of foverlaps and non-equijoins together with by=.EACHI suggests this could be done in a single pass equivalence of foverlaps and non-equijoins together with by=.EACHI suggests the time weighted average could be done in a single pass Feb 22, 2020
@myoung3
Copy link
Member Author

myoung3 commented Feb 23, 2020

pmax/pmin seem much slower when done within group.

It's possible that foverlaps could compute the pmin/pmax (overlapping intervals) directly in C but this isnt implemented.

Additionally i think i remember arun stateting stating somewhere on the github issues tracker that the possibility of immediate grouping and a j statement (a la .eachi) would be possible to implement for foverlaps (but this statement might predate the introduction of nonequijoin).

So it might be worth creating two issues on the data.table page:
Why is foverlaps faster than the equivalent non-equijoin?

Is there a way to more quickly calculate the matching interval from a foverlaps join (and equivalent non-equijoin) without doing pmin/pmax in R?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant