DistributeFilesDataset _num_shards issue #1678

Judyxujj · 2025-01-20T10:37:20Z

For the latest RETURNN, when I use DistributeFilesDataset, I have this error.

File "/nas/models/asr/am/multilingual/16kHz/2024-11-08--jxu-best-rq-pretrain/work/i6_core/tools/git/CloneGitRepositoryJob.LD5f1wKK7LPo/output/returnn/returnn/datasets/basic.py", line 227, in Dataset._create_from_reduce                     
  line: ds = cls(**kwargs)                                                                                                                     
  locals:                                                                                                                              
   ds = <not found>                                                                                                                        
   cls = <local> <class 'returnn.datasets.distrib_files.DistributeFilesDataset'>                                                                                          
   kwargs = <local> {'files': ['/ssd/jxu/nas/data/speech/FR_FR/16kHz/EPPS/corpus/batch.1.v1/hdf-raw_wav.16kHz.split-25/EPPS-batch.1.v1.hdf.15', '/ssd/jxu/nas/data/speech/EN_US/16kHz/NEWS.HQ/corpus/batch.2.NPR.v3/hdf-raw_wav.16kHz.split-261/NEWS.HQ-batch.2.NP
R.v3.hdf.7', '/ssd/jxu/nas/data/speech/IT_IT/16kHz/IT.parli..., len = 25                                                                                               
 File "/nas/models/asr/am/multilingual/16kHz/2024-11-08--jxu-best-rq-pretrain/work/i6_core/tools/git/CloneGitRepositoryJob.LD5f1wKK7LPo/output/returnn/returnn/datasets/distrib_files.py", line 171, in DistributeFilesDataset.__init__               
  line: assert self._num_shards == 1 and self._shard_index == 0, ( # ensure defaults are set                                                                                    
       f"{self}: Cannot use both dataset-sharding via properties _num_shards and _shard index "                                                                                
       f"and {self.__class__.__name__}'s own sharding implementation based on the trainings rank and size."                                                                          
     )                                                                                                                              
  locals:                                                                                                                              
   self = <local> <DistributeFilesDataset 'train' epoch=None>                                                                                                   
   self._num_shards = <local> 8                                                                                                                  
   self._shard_index = <local> 6

The DistributeFilesDataset is inherited from CachedDataset2, which is again inherited from Dataset, the the _num_shards should be set to 1 in the init function. I am not sure how self._num_shards is changed to num of gpus in my case.

(cc @NeoLegends, @michelwi)

The text was updated successfully, but these errors were encountered:

albertz · 2025-01-20T10:39:58Z

As you told me, before, you used a RETURNN version from 2024-07, where it was working fine.

albertz · 2025-01-20T10:40:08Z

What is the dataset config? What is the training config (distributed setting)?

Judyxujj · 2025-01-20T13:41:54Z

What is the dataset config? What is the training config (distributed setting)?

the dataset config is like this

train = {                                                                                                                          
    "buffer_size": 100,                                                                                        
    "class": "DistributeFilesDataset",                                                                                                   
    "distrib_shard_files": True,  
    "get_sub_epoch_dataset": get_sub_epoch_dataset,                                                                             
    "partition_epoch": 200,                                                                                   
    "seq_ordering": "random"
}

NeoLegends · 2025-02-04T13:02:58Z

The issue was likely introduced in #1630, which added the _num_shards and _shard_index as __init__ parameters to Dataset. Parameters w/ a leading underscore are pickled by Dataset.__reduce__, which means the values were carried over into subprocesses. In there the assertion in DistributeFilesDataset.__init__ is triggered as the init does not know about the pickling process. I believe these properties simply shouldn't be pickled for DistributeFilesDataset. #1676 is going to be a fix, but I can also ship a smaller fix before #1676 lands.

…nstead Fixes #1678

albertz assigned NeoLegends Jan 20, 2025

NeoLegends added a commit that referenced this issue Feb 4, 2025

DFDataset: do not pickle sharding info, but rely on _distrib_info i…

8bbf5c6

…nstead Fixes #1678

NeoLegends mentioned this issue Feb 4, 2025

DFDataset: do not pickle sharding info #1685

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DistributeFilesDataset _num_shards issue #1678

DistributeFilesDataset _num_shards issue #1678

Judyxujj commented Jan 20, 2025 •

edited by albertz

Loading

albertz commented Jan 20, 2025

albertz commented Jan 20, 2025 •

edited

Loading

Judyxujj commented Jan 20, 2025 •

edited by albertz

Loading

NeoLegends commented Feb 4, 2025 •

edited

Loading

DistributeFilesDataset _num_shards issue #1678

DistributeFilesDataset _num_shards issue #1678

Comments

Judyxujj commented Jan 20, 2025 • edited by albertz Loading

albertz commented Jan 20, 2025

albertz commented Jan 20, 2025 • edited Loading

Judyxujj commented Jan 20, 2025 • edited by albertz Loading

NeoLegends commented Feb 4, 2025 • edited Loading

Judyxujj commented Jan 20, 2025 •

edited by albertz

Loading

albertz commented Jan 20, 2025 •

edited

Loading

Judyxujj commented Jan 20, 2025 •

edited by albertz

Loading

NeoLegends commented Feb 4, 2025 •

edited

Loading