-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow Performance of tokenize_midi_dataset
Function
#147
Comments
Hi @Kinyugo , thank you for the report! :) |
Thanks @Natooz. You are right miditok v3.0 is much faster I have seen about a 10x improvement. |
If the tokenization is handled by the
I didn't so I can't answer with numbers, however as the collator has multiple workers running on CPU and as tokenizing is pretty fast, I can't guarantee it but we can expect none to very little speed up with pre-tokenizing |
Interesting. If I get some time I'll test this and share the results with you. That will help reduce complexity a lot. But I assume for bpe you will want to pre-train the tokenizer first. Do you have an idea for how one might extract the time say for the start of the segment and end of the segment? I do think this would be quite possible when tokenizing on the fly as we have access to the midi file itself. BTW awesome job with miditok, it has been quite instrumental in my work. |
I just realised that I mixed-up collator and data loader in my last comment. 😅
That would be awesome! They could be integrated in the docs!
Yes indeed. Note that BPE training is also done by tokenizing MIDIs on the fly.
I'm not sure to understand exactly what you mean by start and end of segment. Do you refer to the segments of tokens created by the dataset, and time in beat?
Thank you! Such comments motivates the most to contribute to open-source! 🫶 |
As I thought of how to implement a good Hence the right way would be to have one MIDI --> one token sequence. The data flow would be something like: The MIDI splitting could be achieved by several ways:
It's probably best to considered all these cases, that could cover various cases for users having different hardware configurations and usages. What do you think of this? |
We could also do the MIDI splitting in the |
Hello. Thanks for your apt replies. I have a few questions though.
Do you mean that I won't have to pretrain the tokenizer before starting training?
Thanks so much for the insights on splitting midi. I have been splitting the tokens directly as we do in language modelling tasks and haven't taken into account the midi structure. I'll definitely try your approach.
I appreciate all the good work you are doing with miditok. I'll be experimenting with these over the weekend and hopefully share my insights too. |
No I just meant that when training the tokenizer, the training data (MIDIs) is tokenized on the fly, there is no need to pre-tokenise it. :)
Thank you! The drawback however will be that if the number of tokens/beat has a great variance, we will either lose a portion of the training tokens (cut by the maximum token sequence length) and/or train with potential short sequences (that would be padded), increasing the overall training time. I assume a quick analysis of the tokens/beat of the training data is necessary in order to split it the best way.
🫶 |
Nice. I had misunderstood that.
I am also not sure how we will teach the model to generate full samples. Previously even with splitting we can just append BOS and EOS tokens. How do we work with that splitting at midi level? Would we have to split overlapping sequences? |
About full samples: I am currently experimenting with a
The MIDI themselves are split in chunks of |
Nice! Looking forward to your findings.
I now understand why splitting at midi level makes sense. In that case it might make sense to split dynamically during training that way we can also easily figure out where and when to add the BOS and EOS tokens. Also we can split at different points reducing the need for padding/trimming. |
Absolutely. We will have to find a way to design such dynamic trimming method. |
No problem if I do get time too may be I can contribute that. Currently held up as well. |
This issue is stale because it has been open for 30 days with no activity. |
Hi @Kinyugo 👋 I ended up making a "dynamic" splitting solution based on note note densities of each bars in order to reduce padding. You can take a look in #148 I still have a few things to adjust but that part is almost done! 🙌 |
Hello @Natooz Thanks for looking into the issue. I am currently running tests for this. However, I do notice an Here is my tokenizer config: tokenizer = REMI(
TokenizerConfig(
use_tempos=True,
use_programs=True,
use_time_signatures=True,
use_chords=True,
use_rests=True,
one_token_stream_for_programs=True,
special_tokens=["PAD", "BOS", "EOS"],
)
) |
Thank for taking the time to test it, and for reporting this bug! |
Just pushed it, I think it should do it 🙌 |
Thanks for the quick response. In a typical sequence modelling scenario wouldn't it make sense to have overlapping sequences. Or what's your recommended approach? |
By overlapping sequences are you referring to chunks of music that would cover several portions of the same bars? |
Not necessarily an N-1 overlap but just a way to control the overlap between the sequences. I reckon this would help the model learn continuation better. Also do you pass the type of chunk for each sequence or how would one go about adding BOS and EOS tokens? |
That's a good idea, I'll add a For BOS and EOS tokens, that's actually the last things remaining to do. Right now BOS and EOS tokens are added for each chunk, but I intend to add markers in each chunk to indicate their "number" in the original MIDI, and add a BOS only to the first chunk, and EOS only to the last. |
That will be awesome. Do you have an idea about how you will go about adding the markers, such that when someone is using a custom dataset/dataloader they can replicate this functionality easily? |
I think a good way is to add a MIDI marker at the first tick of each chunk. |
Could you link the exact lines for how they are detected? FYI I have verified that the fix for the IndexError works. Thanks for fixing that. |
Its in # If this file is a chunk (split_midis_for_training), determine its id.
# By default, we add BOS and EOS tokens following the values of
# self.bos_token_id and self.eos_token_id (that may be None), except when the
# file is identified as a chunk.
add_bos_token = add_eos_token = True
for marker in midi.markers:
if marker.time != 0:
break
if marker.text.startswith("miditok: chunk"):
chunk_id, chunk_id_last = map(
int, marker.text.split(" ")[-1].split("/")
)
add_bos_token = chunk_id == 0
add_eos_token = chunk_id == chunk_id_last
# Adds BOS/EOS if necessary... |
Chunk overlap is pushed, I'll have a few tests to do then and to modify the docs and we'll be good to merge! 🙌 |
That's a nice way to implement it. You should consider adding it somewhere in the docs/tutorials that's easier to access. Thanks for the updates. I can't test them atm but will let you know once I do. |
The new version with the new |
I have noticed that there is a significant performance gap between two different scripts I am using to tokenize my dataset. The first script, which filters MIDI files and handles saving/loading manually, operates at approximately 300 iter/s. In comparison, the second script utilizing
tokenize_midi_dataset
function operates at a much slower pace of only 15x slower (around 20 iter/s) than the manual implementation.I think the
tokenize_midi_dataset
function doesn't take advantage of all cores.Note that the filter midi script saves midi while the
tokenizer_midi_dataset
saves json. Also the filter midi script utilizes all the cores available.Filter Midi Script
Tokenize Midi Script
The text was updated successfully, but these errors were encountered: