Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

documentation on how openMP is used are lacking #2687

Open
malcook opened this issue Mar 18, 2018 · 6 comments
Open

documentation on how openMP is used are lacking #2687

malcook opened this issue Mar 18, 2018 · 6 comments

Comments

@malcook
Copy link

malcook commented Mar 18, 2018

The only docs I could find on how openMP is actually used are from "old news" file Changes in v1.9.8 (on CRAN 25 Nov 2016) which reads:

Added setDTthreads() and getDTthreads() to control the threads used in data.table functions that are now parallelized with OpenMP on all architectures including Windows (fwrite(), fsort() and subsetting).

it would be great if this could be in the FAQ with a little info on how (much) this helps what sort of operations.

Or am I missing something?

edit: I just found a little detail of the sort I'm seeking: reading/writing biggish data, revisited

@tdhock
Copy link
Member

tdhock commented Dec 22, 2022

I read ?setDTthreads with the expectation of finding some indication of how openMP is used for parallelization, so I could get some idea about what kinds of data would result in speedups using more than one thread. The closest I found to an answer was:

     Internally parallelized code is used in the following places:
        • ‘between.c’ - between()
        • ‘cj.c’ - CJ()
        • ‘coalesce.c’ - fcoalesce()
        • ‘fifelse.c’ - fifelse()
        • ‘fread.c’ - fread()
        • ‘forder.c’, ‘fsort.c’, and ‘reorder.c’ - forder() and related
        • ‘froll.c’, ‘frolladaptive.c’, and ‘frollR.c’ - froll() and family
        • ‘fwrite.c’ - fwrite()
        • ‘gsumm.c’ - GForce in various places, see GForce
        • ‘nafill.c’ - nafill()
        • ‘subset.c’ - Used in ‘[.data.table’ subsetting
        • ‘types.c’ - Internal testing usage

There are links to these functions, so I expected to find more details in the linked man pages, but I did not find a sufficiently detailed description of how openMP is used / what kinds of data/operations would result in speedups when using multiple threads.
For example on ?fread the only mention of threads is:

 nThread: The number of threads to use. Experiment to see what works
          best for your data on your hardware.

Can someone please add details about how multi-threading is used, and when speedups should be expected?
for example, something like the following:

 nThread: The number of threads to use in the for loop over columns. 
          (THIS IS JUST AN EXAMPLE, I DO NOT KNOW IF THIS IS TRUE)
          Speedups should be expected when there are a large number of columns. 
          Experiment to see what works best for your data on your hardware.

The linked blog post has some benchmarking of fread and fwrite (computation times for some particular numbers of rows and columns), but it would be useful to have some description like this on the man pages.

@tdhock
Copy link
Member

tdhock commented Mar 27, 2023

I wrote a blog that compares time and memory usage of CSV read/write functions, https://tdhock.github.io/blog/2023/compare-read-write/
I did not observe any big speed differences when using only 1 vs multiple threads; is this expected?

@jangorecki
Copy link
Member

Reading character columns needs to populate global string cache. That is always single threaded. Try another CSV.

@tdhock
Copy link
Member

tdhock commented Mar 28, 2023

hey @jangorecki thanks for the feedback. So you think that we should expect speedups with multiple threads, when we are using fread, as long as there is no bottleneck from the global string cache?
My examples seem to contradict that expectation.

macbook-read-char-vary-rows
In the example above, I don't think the global string cache is an issue, because it is always the same string (on every line and row/column). And in fact we see a small speedup when using two threads instead of one in this case. It used this code to generate the data:

chr_mat <- function(N.rows, N.cols){
  data.vec <- paste0("'quoted", c(" ", "_"), "data'")
  matrix(data.vec, N.rows, N.cols)
}

The figure below shows that same benchmark run on a different machine with up to 64 threads, which shows very little difference between single and multiple threads.
cluster-read-char-vary-rows

I also did try another CSV, with random real numbers.

random_real <- function(N.rows, N.cols){
  set.seed(1)
  matrix(rnorm(N.rows*N.cols), N.rows, N.cols)
}

For real numbers I observed qualitatively similar results in the figure below, (no big speedups when using multiple threads)
cluster-read-real-vary-rows

Overall I have not observed any big speedups when using multiple threads (with fread or any other data.table function), so I wonder if you know of any examples that I could run to observe that?

@jangorecki
Copy link
Member

Which version are you trying out?

@tdhock
Copy link
Member

tdhock commented Mar 30, 2023

I'm not sure what version of data.table was used for those previous figures, but I just re-did some, using max 4 CPUs instead of 64, and data.table 1.14.8, and I observe the results below.
image
character data above, real number data below
image
Some speedups are apparent (4 faster than 2, which is in turn faster than 1 thread) on large real number data sets, is that the extent of speedups you would expect for fread?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants