Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check for index in setkey #3582

Merged
merged 21 commits into from
May 28, 2019
Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,8 @@ Authors@R: c(
person("Michael","Quinn", role="ctb"),
person("@javrucebo","", role="ctb"),
person("@marc-outins","", role="ctb"),
person("Roy","Storey", role="ctb"))
person("Roy","Storey", role="ctb"),
person("Manish","Saraswat", role="ctb"))
Depends: R (>= 3.1.0)
Imports: methods
Suggests: bit64, curl, R.utils, knitr, xts, nanotime, zoo, yaml
Expand Down
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,8 @@

16. `rbind` and `rbindlist` of an item containing an ordered factor with levels containing an `NA` (as opposed to an NA integer) could segfault, [#3601](https://github.com/Rdatatable/data.table/issues/3601). This was a a regression in v1.12.2. Thanks to Damian Betebenner for reporting.

17. `setkey` now checks for existing index before doing `forder` [#3582](https://github.com/Rdatatable/data.table/pull/3582/). Executes `forder` only if the index does not exist. Thanks to @MichaelChirico for reporting and @saraswatmks for the PR.

#### NOTES

1. `rbindlist`'s `use.names="check"` now emits its message for automatic column names (`"V[0-9]+"`) too, [#3484](https://github.com/Rdatatable/data.table/pull/3484). See news item 5 of v1.12.2 below.
Expand Down
25 changes: 20 additions & 5 deletions R/setkey.R
Original file line number Diff line number Diff line change
Expand Up @@ -86,12 +86,27 @@ setkeyv = function(x, cols, verbose=getOption("datatable.verbose"), physical=TRU
if (!typeof(.xi) %chin% c("integer","logical","character","double")) stop("Column '",i,"' is type '",typeof(.xi),"' which is not supported as a key column type, currently.")
}
if (!is.character(cols) || length(cols)<1L) stop("Internal error. 'cols' should be character at this point in setkey; please report.") # nocov
if (verbose) {
tt = suppressMessages(system.time(o <- forderv(x, cols, sort=TRUE, retGrp=FALSE))) # system.time does a gc, so we don't want this always on, until refcnt is on by default in R
# suppress needed for tests 644 and 645 in verbose mode
cat("forder took", tt["user.self"]+tt["sys.self"], "sec\n")

# get existing index name if any
newkey = paste0(cols, collapse="__")

# forder only if index is not present
if (!any(indices(x) == newkey)){
if (verbose) {
tt = suppressMessages(system.time(o <- forderv(x, cols, sort=TRUE, retGrp=FALSE))) # system.time does a gc, so we don't want this always on, until refcnt is on by default in R
# suppress needed for tests 644 and 645 in verbose mode
cat("forder took", tt["user.self"]+tt["sys.self"], "sec\n")
} else {
o <- forderv(x, cols, sort=TRUE, retGrp=FALSE)
}
} else {
o = forderv(x, cols, sort=TRUE, retGrp=FALSE)
# find the name of matching index
if (verbose){
cat("using existing index for", newkey, "\n")
o <- attr(attributes(x)$index, which=newkey, exact = TRUE)
} else {
o <- attr(attributes(x)$index, which=newkey, exact = TRUE)
}
}
if (!physical) {
if (is.null(attr(x,"index",exact=TRUE))) setattr(x, "index", integer())
Expand Down
31 changes: 28 additions & 3 deletions inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -5668,13 +5668,13 @@ test(1376.06, indices(DT), c("b","a")) # 2 secondary keys of single columns
test(1376.07, DT[a==7L,verbose=TRUE], DT[7L], output="Optimized subsetting with index 'a'")
setkey(DT,b)
test(1376.08, indices(DT), NULL)
test(1376.09, list(DT[a==2L], indices(DT)), list(DT[9L],"a")) # create indices for next test
test(1376.09, list(DT[a==2L], indices(DT)), list(DT[2L],"a")) # create indices for next test
setindex(DT,NULL)
test(1376.10, list(key(DT), indices(DT)), list("b", NULL))
options(datatable.auto.index = FALSE)
test(1376.11, list(DT[a==2L], indices(DT)), list(DT[9L],NULL))
test(1376.11, list(DT[a==2L], indices(DT)), list(DT[2L],NULL))
options(datatable.auto.index = TRUE)
test(1376.12, list(DT[a==2L], indices(DT)), list(DT[9L],"a"))
test(1376.12, list(DT[a==2L], indices(DT)), list(DT[2L],"a"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 3 tests changes from 9L to 2L were strange that they were needed. Anyway, the latest commit leaves these tests unchanged which seems more correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep: those 3 tests should not have been changed to 9L in this PR. The setkey(DT,b) inbetween tests 1376.07 and 1376.08 had broken under this PR and wasn't changing the row order properly.


# When i is FALSE and a column is being added by reference, for consistency with cases when i is not FALSE
# we should still add the column. But we need to know what type it should be, so the user supplied RHS of :=
Expand Down Expand Up @@ -6031,6 +6031,31 @@ thisDT <- copy(DT)[, c("aaa", "b") := 2]
test(1419.58, indices(thisDT), c("a", "ab"))
test(1419.59, allIndicesValid(thisDT), TRUE)

## setkey on same col as index before
DT <- data.table(a =c(4,1,3,9,2,1,8,7,6,5),
aaa = c(1,1,2,2,2,1,1,2,2,2))
setindex(DT, a)
test(1419.60, allIndicesValid(DT), TRUE)
setindex(DT, NULL)
setkey(DT, a)
test(1419.61, DT$a, c(1,1,2,3,4,5,6,7,8,9))
setkey(DT, NULL)
setindex(DT, a)
test(1419.62, setkey(DT, a, verbose=TRUE), output="using existing index for a")

# check setkey incase of existing multiple indexes
DT <- data.table(a = c(3,3,4,4,5,6,1,1,7,2,2),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a has 11 items here but aaa and bbb have 10 so there's a warning when running test.data.table() runs in dev and when users would run test.data.table(). It's a lot easier and faster to run locally first before pushing. See the cc.R script in the root of the project : that's what I use in dev.

aaa = c(1,1,2,2,2,1,1,2,2,2),
bbb = c(1,1,2,0,1,1,1,0,1,1))
setkey(DT, a)
test(1419.63, DT$a, c(1,1,2,2,3,3,4,4,5,6,7))
setkey(DT, NULL)
test(1419.64, setkey(DT, a, verbose=TRUE), output="forder took")
setkey(DT, NULL)
setindex(DT, a)
setindex(DT, aaa)
test(1419.65, allIndicesValid(DT), TRUE)
test(1419.66, setkey(DT, aaa, verbose=TRUE), output="using existing index for aaa")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't check the result of setkey is correct. Please see previous comments where that was requested : #3582 (comment) and #3582 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for chipping in. I wasn't fully aware of test function parameters. Now I understand it better. Also, I will try to fix my local environment (I still don't know how am I gonna do that) to avoid stupid errors in the PR.


# setnames updates secondary key
DT = data.table(a=1:5,b=10:6)
Expand Down