Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

forderv orders NA's different than base::order by default #2594

Closed
MarkusBonsch opened this issue Jan 28, 2018 · 3 comments · Fixed by #4427
Closed

forderv orders NA's different than base::order by default #2594

MarkusBonsch opened this issue Jan 28, 2018 · 3 comments · Fixed by #4427
Assignees
Milestone

Comments

@MarkusBonsch
Copy link
Contributor

Probably, this is known and intended:

x <- c(3,NA,6,NaN, 2, 1)
x[data.table:::forderv(x)]
# [1]  NA NaN   1   2   3   6
x[order(x)]
#  1   2   3   6  NA NaN

The reason is the different default for the na.last argument.
Is there a reason for this?
The result is that merge.data.frame and merge.data.table have different row orders since the setkey used for sorting has the default NA ordering of data.table:

df1 <- data.frame(x = c(2,NA,1, NaN), y = 1:4)
df2 <- data.frame(x = c(NA,1), y = 5:8)
merge(df1, df2, by = "x", sort = TRUE)
# x y.x y.y
# 1  1   3   6
# 2  1   3   8
# 3 NA   2   5
# 4 NA   2   7
merge(setDT(df1), setDT(df2), by = "x", sort = TRUE)
# x y.x y.y
# 1: NA   2   5
# 2: NA   2   7
# 3:  1   3   6
# 4:  1   3   8
@mattdowle
Copy link
Member

Yes that default difference is deliberate and long standing.

  1. NA is internally INT_MIN in R. Keys and indexes are always increasing order so if NAs go first, nothing special is needed in many data.table internals. This is deliberately not optional for speed and simplicity of internals at C level.
  2. If there are NAs in a key or in data then that's something I like to be in my face right up front. So I like to see NAs (the problems) first not hidden at the end.

@MarkusBonsch
Copy link
Contributor Author

@mattdowle Great, I completely agree, thanks for explaining. We should update the documentation of merge.data.table to reflect this. I will create a pull request.

@jangorecki
Copy link
Member

jangorecki commented Apr 22, 2020

It is worth to note that we have forder which behaves like order in regards to NAs. So it is perfectly fine to divert from that in forderv.
If we would keep also NA count as a result of forderv computation, we could eventually easily shuffle that result for na.last=T/F without the need for recomputation of order.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants