-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pattern matching for character columns compared to Factor columns #4748
Comments
For problems with lots of duplicated strings, matching regular expression on the unique levels then mapping back is much faster, since the number of unique levels if fairly small. Under such circumstances, I suggest to use the factor column. (In fact, converting characters to factors takes time. I mean In addition, the reason of DT2 being much faster than DT3 is because your implementation is faster by subsetting the result directly, saving the time from coercing the factor values into integer values + matching again. Lines 5 to 7 in 7bf5748
From ?
I think we should improve this case based on your findings. |
Hello, it seems that you beat me to the PR :\ Thanks for your answer, can I digress a bit on this please ? Let's run the same test as above but without the data.table subseting
You can see that
In my work we typically have tables of a few 10s millions rows and characters are usually "symbols (ie strings)" with low cardinality (100s unique). Industry standard is to use some factor-like objects. I red in different places (typically this well know SO q) that I should prefer Character. |
Could you point to this? I would want to reword this as I don't think this is the case. Especially with all the internal parallelism in |
That's not on a vignette, that's on
I see you actually amended this SO post I'd like to have a clearer picture of performance for factor vs character for cases where cardinality is small (which would be typical in say finance), is there such a document somewhere ? if not I can create a Rmd testing for J(), ==, pattern matching, merging if you guys think it could be useful (in that case I can commit the document somewhere). |
I do think that would be quite useful. I'm not sure if vignette is the right medium for it (it would be our first such vignette) -- would a blog post be more appropriate? |
Coming by chance from the NEWS.md 1.14.7 file where this is still referenced as a new feature in development it looks like this has been merged for quite some time. I hope it is fine for me to close it. |
Actually I am confused as I checked https://github.com/Rdatatable/data.table/pull/4750/files#diff-88384aa2504222157297b908ad7dc25684c172e11a9da29bcf848283e05290d6 and although I understand the PR has been merged (a long time ago) it does not seem to be in the latest cran version (I checked like.R in the package source available on cran). |
This is something that confused me as well. Some fixes that have been merged for more than 1 year are not in CRAN yet, as stated in #5538 (comment) . |
Hello, I would like to know what is the appropriate way of matching patterns in Character columns.
In the vignettes it is clearly explained that Character columns should be preferred over Factor columns.
I know of the like function in the package but its performance seems fairly low in the bellow problem where I try to detect a pattern in a Character column and a Factor column.
Is there a recommended way to do pattern matching on character columns in data.table
The text was updated successfully, but these errors were encountered: