-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
documentation on how openMP is used are lacking #2687
Comments
I read ?setDTthreads with the expectation of finding some indication of how openMP is used for parallelization, so I could get some idea about what kinds of data would result in speedups using more than one thread. The closest I found to an answer was:
There are links to these functions, so I expected to find more details in the linked man pages, but I did not find a sufficiently detailed description of how openMP is used / what kinds of data/operations would result in speedups when using multiple threads.
Can someone please add details about how multi-threading is used, and when speedups should be expected?
The linked blog post has some benchmarking of fread and fwrite (computation times for some particular numbers of rows and columns), but it would be useful to have some description like this on the man pages. |
I wrote a blog that compares time and memory usage of CSV read/write functions, https://tdhock.github.io/blog/2023/compare-read-write/ |
Reading character columns needs to populate global string cache. That is always single threaded. Try another CSV. |
hey @jangorecki thanks for the feedback. So you think that we should expect speedups with multiple threads, when we are using fread, as long as there is no bottleneck from the global string cache?
chr_mat <- function(N.rows, N.cols){
data.vec <- paste0("'quoted", c(" ", "_"), "data'")
matrix(data.vec, N.rows, N.cols)
} The figure below shows that same benchmark run on a different machine with up to 64 threads, which shows very little difference between single and multiple threads. I also did try another CSV, with random real numbers. random_real <- function(N.rows, N.cols){
set.seed(1)
matrix(rnorm(N.rows*N.cols), N.rows, N.cols)
} For real numbers I observed qualitatively similar results in the figure below, (no big speedups when using multiple threads) Overall I have not observed any big speedups when using multiple threads (with fread or any other data.table function), so I wonder if you know of any examples that I could run to observe that? |
Which version are you trying out? |
The only docs I could find on how openMP is actually used are from "old news" file Changes in v1.9.8 (on CRAN 25 Nov 2016) which reads:
it would be great if this could be in the FAQ with a little info on how (much) this helps what sort of operations.
Or am I missing something?
edit: I just found a little detail of the sort I'm seeking: reading/writing biggish data, revisited
The text was updated successfully, but these errors were encountered: