Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DistributeFilesDataset _num_shards issue #1678

Open
Judyxujj opened this issue Jan 20, 2025 · 4 comments
Open

DistributeFilesDataset _num_shards issue #1678

Judyxujj opened this issue Jan 20, 2025 · 4 comments
Assignees

Comments

@Judyxujj
Copy link
Contributor

Judyxujj commented Jan 20, 2025

For the latest RETURNN, when I use DistributeFilesDataset, I have this error.

File "/nas/models/asr/am/multilingual/16kHz/2024-11-08--jxu-best-rq-pretrain/work/i6_core/tools/git/CloneGitRepositoryJob.LD5f1wKK7LPo/output/returnn/returnn/datasets/basic.py", line 227, in Dataset._create_from_reduce                     
  line: ds = cls(**kwargs)                                                                                                                     
  locals:                                                                                                                              
   ds = <not found>                                                                                                                        
   cls = <local> <class 'returnn.datasets.distrib_files.DistributeFilesDataset'>                                                                                          
   kwargs = <local> {'files': ['/ssd/jxu/nas/data/speech/FR_FR/16kHz/EPPS/corpus/batch.1.v1/hdf-raw_wav.16kHz.split-25/EPPS-batch.1.v1.hdf.15', '/ssd/jxu/nas/data/speech/EN_US/16kHz/NEWS.HQ/corpus/batch.2.NPR.v3/hdf-raw_wav.16kHz.split-261/NEWS.HQ-batch.2.NP
R.v3.hdf.7', '/ssd/jxu/nas/data/speech/IT_IT/16kHz/IT.parli..., len = 25                                                                                               
 File "/nas/models/asr/am/multilingual/16kHz/2024-11-08--jxu-best-rq-pretrain/work/i6_core/tools/git/CloneGitRepositoryJob.LD5f1wKK7LPo/output/returnn/returnn/datasets/distrib_files.py", line 171, in DistributeFilesDataset.__init__               
  line: assert self._num_shards == 1 and self._shard_index == 0, ( # ensure defaults are set                                                                                    
       f"{self}: Cannot use both dataset-sharding via properties _num_shards and _shard index "                                                                                
       f"and {self.__class__.__name__}'s own sharding implementation based on the trainings rank and size."                                                                          
     )                                                                                                                              
  locals:                                                                                                                              
   self = <local> <DistributeFilesDataset 'train' epoch=None>                                                                                                   
   self._num_shards = <local> 8                                                                                                                  
   self._shard_index = <local> 6 

The DistributeFilesDataset is inherited from CachedDataset2, which is again inherited from Dataset, the the _num_shards should be set to 1 in the init function. I am not sure how self._num_shards is changed to num of gpus in my case.

(cc @NeoLegends, @michelwi)

@albertz
Copy link
Member

albertz commented Jan 20, 2025

As you told me, before, you used a RETURNN version from 2024-07, where it was working fine.

@albertz
Copy link
Member

albertz commented Jan 20, 2025

What is the dataset config? What is the training config (distributed setting)?

@Judyxujj
Copy link
Contributor Author

Judyxujj commented Jan 20, 2025

What is the dataset config? What is the training config (distributed setting)?

the dataset config is like this

train = {                                                                                                                          
    "buffer_size": 100,                                                                                        
    "class": "DistributeFilesDataset",                                                                                                   
    "distrib_shard_files": True,  
    "get_sub_epoch_dataset": get_sub_epoch_dataset,                                                                             
    "partition_epoch": 200,                                                                                   
    "seq_ordering": "random"
}

@NeoLegends
Copy link
Member

NeoLegends commented Feb 4, 2025

The issue was likely introduced in #1630, which added the _num_shards and _shard_index as __init__ parameters to Dataset. Parameters w/ a leading underscore are pickled by Dataset.__reduce__, which means the values were carried over into subprocesses. In there the assertion in DistributeFilesDataset.__init__ is triggered as the init does not know about the pickling process. I believe these properties simply shouldn't be pickled for DistributeFilesDataset. #1676 is going to be a fix, but I can also ship a smaller fix before #1676 lands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants