Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

auto-names throwing warnings+error with if/else list #4274

Closed
robitalec opened this issue Mar 2, 2020 · 10 comments · Fixed by #4275
Closed

auto-names throwing warnings+error with if/else list #4274

robitalec opened this issue Mar 2, 2020 · 10 comments · Fixed by #4275

Comments

@robitalec
Copy link

robitalec commented Mar 2, 2020

Description

My package spatsoc v0.1.11 was returning two warnings and one error from data.table recently. I narrowed it down to version 1.12.6 vs 1.12.8, potentially the new auto-name j (#3802). These are internal warnings so not something I'm too confident diagnosing, but I did find a simple workaround that seems to make them disappear - so that might be informative for you.

The warnings:

longer object length is not a multiple of shorter object length

Different branches of j expression produced different auto-named columns: [NA!=distance]; using the most "last" names

and the error:

Internal error: jvnames is length 2 but ans is 4 and bynames is 1

The fix

The fix: assign the if/else lists and explicitly return it. Here is a commit that makes that change.

The warnings and error are present in v1.12.8 but not in 1.12.6 and disappear completely with the above fix.

Minimal reproducible example

This will require an install of a couple packages, as I'm not sure how to reproduce it outside of my situation but the warnings and error can be reproduced using one of spatsoc's function's examples from the man.

# Install spatsoc v0.1.11
remotes::install_github('ropensci/[email protected]')

# Load packages
library(data.table)
library(spatsoc)

# Read example data
DT <- fread(system.file("extdata", "DT.csv", package = "spatsoc"))

# Cast the character column to POSIXct
DT[, datetime := as.POSIXct(datetime, tz = 'UTC')]

# Temporal grouping
group_times(DT, datetime = 'datetime', threshold = '20 minutes')

# Edge list generation
# NOTE: this returns the two warnings
edge_nn(DT, id = 'ID', coords = c('X', 'Y'),
        timegroup = 'timegroup')

# Edge list generation, returning distance between nearest neighbours
# NOTE: this returns the two warnings and the error
edge_nn(DT, id = 'ID', coords = c('X', 'Y'),
        timegroup = 'timegroup', threshold = 100,
        returnDist = TRUE)

Output of sessionInfo()

R version 3.6.2 (2019-12-12)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Arch Linux

Matrix products: default
BLAS:   /usr/lib/libopenblasp-r0.3.8.so
LAPACK: /usr/lib/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_CA.UTF-8      
 [2] LC_NUMERIC=C              
 [3] LC_TIME=en_CA.UTF-8       
 [4] LC_COLLATE=en_CA.UTF-8    
 [5] LC_MONETARY=en_CA.UTF-8   
 [6] LC_MESSAGES=en_CA.UTF-8   
 [7] LC_PAPER=en_CA.UTF-8      
 [8] LC_NAME=C                 
 [9] LC_ADDRESS=C              
[10] LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_CA.UTF-8
[12] LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils    
[5] datasets  methods   base     

other attached packages:
[1] spatsoc_0.1.11    data.table_1.12.8

loaded via a namespace (and not attached):
[1] compiler_3.6.2 tools_3.6.2   
[3] packrat_0.5.0 
@MichaelChirico
Copy link
Member

@robitalec thanks so much for the thorough report. Unfortunately your last line gave me an error:

Error in edge_nn(DT, id = "ID", coords = c("X", "Y"), timegroup = "timegroup", :
unused argument (returnDist = TRUE)

@MichaelChirico
Copy link
Member

CRAN shows me 0.1.9 but your sessionInfo shows 0.1.11. Your dev version is currently at 0.1.12 & doesn't give me any warnings

@robitalec
Copy link
Author

robitalec commented Mar 3, 2020

Sorry about that - I should've clarified which version of spatsoc gives the warnings+error.

v0.1.12 has the fix (so don't use that one)
v0.1.11 adds the returnDist argument
v0.1.9 doesn't have the returnDist argument (this is the version on CRAN), which is why you're getting that error.

I'll update above - the warnings+error can be reproduced with spatsoc v0.1.11.

remotes::install_github('ropensci/[email protected]')

@MichaelChirico
Copy link
Member

MichaelChirico commented Mar 3, 2020

Thanks @robitalec I see the issue:

    DT[, {
        distMatrix <- as.matrix(stats::dist(.SD[, 2:3], method = "euclidean"))
        diag(distMatrix) <- NA
        if (is.null(threshold)) {
            wm <- apply(distMatrix, MARGIN = 2, which.min)
        }
        else {
            distMatrix[distMatrix > threshold] <- NA
            wm <- apply(distMatrix, MARGIN = 2, function(x) ifelse(sum(!is.na(x)) > 
                0, which.min(x), NA))
        }
        if (returnDist) {
            w <- wm + (length(wm) * (as.numeric(names(wm)) - 
                1))
            list(ID = .SD[[1]][as.numeric(names(wm))], NN = .SD[[1]][wm], 
                distance = distMatrix[w])
        }
        else {
            list(ID = .SD[[1]][as.numeric(names(wm))], NN = .SD[[1]][wm])
        }
    }, by = splitBy, .SDcols = c(id, coords)]

if (returnDistance), you return a different number of columns.

I can see three options:

(1) explicitly add distance=NA in the else branch, then add [ , distance := NULL] to remove the column if (!returnDistance)

(2) Branch your code before j: if (returnDistance) DT[ , {}] else DT[ , {}]. I would prefer this option personally -- returnDistance is globally true, so it is inefficient to evaluate it for every group of by within the data.table context

(3) use suppressWarnings or tryCatch to anticipate this warning

I'm not sure there's any easy way to get around this from data.table perspective -- generally, it's an error to return a different number of columns by branch, e.g. with your DT:

DT[ , if (.BY$ID > 'A') .(1) else .(1, 2), by = ID]

Error in [.data.table(DT, , if (.BY$ID > "A") .(1) else .(1, 2), by = ID) :
j doesn't evaluate to the same number of columns for each group

IINM it's only possible for code like yours to succeed if the if() condition doesn't depend on the data itself (and hence the if() could/should be evaluated in the parent environment)

That said, I see a logic error that can help to have made a better error message for diagnosis, at least

@robitalec
Copy link
Author

Thanks for looking into this so quickly @MichaelChirico.

So data.table is looking at the if else to pick up the names, but not at what is returned for each by?
Because like you say in (2), returnDist is a variable set once, so for each call to, for example spatsoc::edge_dist, we are only returning either 2 columns or 3 columns consistently for all bys.
Is that why assigning the list and returning it after the if else block, gets rid of the error+warnings?

I'll think about just simplifying this argument or evaluating it outside of the j.

@MichaelChirico
Copy link
Member

Yes & yes. the analysis of your j "query" is done before evaluating it (evaluation might take a very long time & there's no way to know in advance that it's trivially fast, like in your case). Your workaround of returning an object assigned to the output of your if/else instead of returning directly I think is perfectly fine, with again the caveat that this could be inefficient.

if your input has 1M groups, each group will run if(returnDistance). that's 999,999 times too many & I'm not sure whether we benefit from branch prediction in such a complex program

@robitalec
Copy link
Author

Ok! Neat.

And yes - for sure. I'll do some tests and try to simplify things. Thanks for the advice.

@jangorecki
Copy link
Member

@robitalec please close this issue if you will fix it in your package, and ideally you could link a commit in your package that fixed it, so anyone who will face similar issue can look at your solution. Thanks

@robitalec
Copy link
Author

Will do @jangorecki, I wasn't sure if it would be closed by #4275.
The fix is here: ropensci/spatsoc@81a4e15.

@MichaelChirico by the way, there wasn't a noticeable difference in run time for the current version and a modified version of this function that splits the logic if(returnDistance) out of the j query.

Thank you both!

@jangorecki
Copy link
Member

@robitalec Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants