Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data.table doesn't seem to be using multiple cores on Slurm cluster. How to troubleshoot? #5573

Open
billytcl opened this issue Dec 24, 2022 · 6 comments
Labels

Comments

@billytcl
Copy link

I'm using data.table on a SLURM cluster and for some reason it's having trouble using multiple cores on something as simple as fread, even though it's detecting them when loading the library. The file is a 46GB tab-delimited file in 4-column long format.

R version 4.1.2 (2021-11-01) -- "Bird Hippie"
Copyright (C) 2021 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(data.table)
data.table 1.14.2 using 3 threads (see ?getDTthreads).  Latest news: r-datatable.com
> options(datatable.verbose = TRUE)          
> tmp <- fread("[..REDACTED..]")
  OpenMP version (_OPENMP)       201511
  omp_get_num_procs()            6
  R_DATATABLE_NUM_PROCS_PERCENT  unset (default 50)
  R_DATATABLE_NUM_THREADS        unset
  R_DATATABLE_THROTTLE           unset (default 1024)
  omp_get_thread_limit()         2147483647
  omp_get_max_threads()          6
  OMP_THREAD_LIMIT               unset
  OMP_NUM_THREADS                6
  RestoreAfterFork               true
  data.table is using 3 threads with throttle==1024. See ?setDTthreads.
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 3 threads (omp_get_max_threads()=6, nth=3)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as integer
[02] Opening the file
  Opening file [..REDACTED..]
  File opened, size = 45.65GB (49011968734 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<chr1 10468   10469   0.0     P7740_237>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep automatically ...
  sep=0x9  with 100 lines of 5 fields using quote rule 0
  Detected 5 columns on line 1. This line is either column names or first data row. Line starts as: <<chr1      10468   10469   0.0     P7740_237>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 5
[07] Detect column types, good nrow estimate and whether first row is column names
  Number of sampling jump points = 100 because (49011968733 bytes from row 1 to eof) / (2 * 5114 jump0size) == 4791940
  Type codes (jump 000)    : C557C  Quote rule 0
  Type codes (jump 100)    : C557C  Quote rule 0
  'header' determined to be false because there are some number columns and those columns do not have a string field at the top of them
  =====
  Sampled 10040 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 1 to the end of last row: 49011968685
  Line length: mean=57.97 sd=2.32 min=48 max=69
  Estimated number of rows: 49011968685 / 57.97 = 845441913
  Initial alloc = 929986104 rows (845441913 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : C557C
[10] Allocate memory for the datatable
  Allocating 5 column slots (5 - 0 dropped) with 929986104 rows
[11] Read the data
  jumps=[0..46743), chunk_size=1048541, total_size=49011968733
|--------------------------------------------------|
|===================

When I ssh into the node, it's not even using all of the CPUs:

top - 12:30:20 up 3 days, 15:38,  1 user,  load average: 8.58, 8.33, 7.57
Tasks:   4 total,   1 running,   3 sleeping,   0 stopped,   0 zombie
%Cpu(s): 25.8 us,  0.5 sy,  0.0 ni, 72.9 id,  0.9 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 26357425+total, 19523254+free, 41698536 used, 26643172 buff/cache
KiB Swap:  4194300 total,  4194300 free,        0 used. 21920500+avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                                      
32464 billylau  20   0   74.4g  36.1g  17.4g S  31.6 14.4   1:28.06 R                                                                                                                                                            
32189 billylau  20   0  116440   3760   1772 S   0.0  0.0   0:00.05 bash                                                                                                                                                         
32509 billylau  20   0  116212   3492   1724 S   0.0  0.0   0:00.03 bash                                                                                                                                                         
32697 billylau  20   0  161244   2108   1512 R   0.0  0.0   0:00.05 top                                                                                                                                                          

I can verify that when I use it on our personal workstations it is using multiple threads. How should I go about troubleshooting this? My guess is that SLURM/R/data.table are having some kind of weird interaction that is not provisioning the CPUs properly.

# Output of sessionInfo()

R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /share/software/user/open/openblas/0.3.10/lib/libopenblas_haswellp-r0.3.10.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.14.2

loaded via a namespace (and not attached):
[1] compiler_4.1.2
@ben-schwen
Copy link
Member

Data.table uses per default 50% of the available virtual cores. You can raise this limit, e.g., by setting Sys.setenv(R_DATATABLE_NUM_PROCS_PERCENT="90")

@jangorecki
Copy link
Member

jangorecki commented Dec 25, 2022

There may be multiple steps in fread that may not be parallelized. So for example if your file has character columns then a lot of time will be spent single threaded. I suggest to try running forder (or frollmean algo=exact) in a loop and then observe top.

@tdhock
Copy link
Member

tdhock commented Dec 27, 2022

from the output it looks like data.table is using 3 threads out of 6 on that cluster node, so I'm not sure this is a problem with data.table, and you may consider closing the issue.
When using SLURM you can tell data.table to use all SLURM CPUs via

data.table::setDTthreads(as.integer(Sys.getenv("SLURM_JOB_CPUS_PER_NODE", "1")))

@tdhock
Copy link
Member

tdhock commented Dec 27, 2022

when using 3 threads, you would have at best 3x speedups relative to a single thread, but that would be only in an ideal case. related to #2687 we should add some docs to clarify how exactly openmp is used, so people can have realistic expectations of when speedups should happen.

@tdhock tdhock added the omp label Dec 27, 2022
@tdhock
Copy link
Member

tdhock commented Dec 27, 2022

in fread.c the only instance of pragma omp for I see is

    #pragma omp for ordered schedule(dynamic) reduction(+:thRead,thPush)
    for (int jump = jump0; jump < nJumps; jump++) {

but I am not an expert on fread so I am not sure what exactly happens in this for loop, and if using several threads in this for loop should result in big speedups.

@HenrikBengtsson
Copy link

HenrikBengtsson commented Mar 27, 2023

When using SLURM you can tell data.table to use all SLURM CPUs via

data.table::setDTthreads(as.integer(Sys.getenv("SLURM_JOB_CPUS_PER_NODE", "1")))

Note that SLURM_JOB_CPUS_PER_NODE may hold multi-host values, e.g. 4,8 and 10,2(x3). This depends on what parallel resources the Slurm job requested. If you're interested in the number CPUs allotted on the current machine, I think you want to use SLURM_CPUS_ON_NODE instead - that holds an integer scalar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants