Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to apply BPE or convert BPE-encoded JSON to MIDI #137

Closed
oiabtt opened this issue Jan 24, 2024 · 11 comments
Closed

Unable to apply BPE or convert BPE-encoded JSON to MIDI #137

oiabtt opened this issue Jan 24, 2024 · 11 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@oiabtt
Copy link

oiabtt commented Jan 24, 2024

When I try to convert a BPE-encoded JSON file to MIDI, I keep encountering errors.
I suspect there might be some errors in the BPE encoding/decoding code, so I have run the code provided in the documentation(https://miditok.readthedocs.io/en/latest/bpe.html), but exceptions are still being thrown.

from copy import deepcopy
from miditok import TokenizerConfig, Structured, TokSequence
from miditoolkit import MidiFile
from pathlib import Path

# create json
config = TokenizerConfig()
tokenizer = Structured(config)
midi_path =  "/maestro-v3.0.0/2004/MIDI-Unprocessed_SMF_02_R1_2004_01-05_ORIG_MID--AUDIO_02_R1_2004_05_Track05_wav.midi"
tokens = tokenizer(midi)
tokenizer.save_tokens(tokens, "./test.json")

# load json
tokens = tokenizer.load_tokens(out_path)
tokens = TokSequence(ids=tokens)

# load bpe tokenizer
# bpe tokenizer already trained like this: 
# config = TokenizerConfig()
# tokenizer = Structured(config)
# tokenizer.learn_bpe(vocab_size=30000, files_paths=midi_aug_paths)
tokenizer = Structured(params=Path("tokenizer.json"))

# Error
tokens_with_bpe = tokenizer.apply_bpe(deepcopy(tokens)) 

errors here:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
     20 tokenizer = Structured(params=Path("tokenizer.json"))
     22 # Error
---> 23 tokens_with_bpe = tokenizer.apply_bpe(deepcopy(tokens))

File ~/anaconda3/envs/midi/lib/python3.10/site-packages/miditok/midi_tokenizer.py:2386, in MIDITokenizer.apply_bpe(self, seq)
   2383         seq_.ids_bpe_encoded = True
   2385 else:
-> 2386     self.complete_sequence(seq)
   2387     encoded_tokens = self._bpe_model.encode([seq.bytes], is_pretokenized=True)
   2388     seq.ids = encoded_tokens.ids

File ~/anaconda3/envs/midi/lib/python3.10/site-packages/miditok/midi_tokenizer.py:1334, in MIDITokenizer.complete_sequence(self, seq)
   1332     seq.tokens = self._events_to_tokens(seq.events)
   1333 elif seq.ids is not None:
-> 1334     seq.tokens = self._ids_to_tokens(seq.ids)
   1335 elif seq.bytes is not None:
   1336     seq.tokens = self._bytes_to_tokens(seq.bytes)

File ~/anaconda3/envs/midi/lib/python3.10/site-packages/miditok/midi_tokenizer.py:1379, in MIDITokenizer._ids_to_tokens(self, ids, as_str)
   1377 if len(ids) == 0:
   1378     return tokens
-> 1379 if isinstance(ids[0], list):  # multiple vocabularies
   1380     for (
   1381         multi_ids
   1382     ) in ids:  # cannot use recursion here because of the vocabulary type id
   1383         multi_event = []

KeyError: 0

Would you like me to provide a runnable code snippet to convert BPE-encoded JSON to MIDI?

@Natooz
Copy link
Owner

Natooz commented Jan 24, 2024

Hi, this is due to a i/o format mismatch.

Here is how you can fix it:

tokens = tokenizer(midi)  # list of TokSequence
tokenizer.save_tokens(tokens, "./test.json")

# load json
tokens = tokenizer.load_tokens(out_path)  # dictionary
# here tokens["ids"]'s value is a list of list of integers (first dim is track, second tokens)

# Create the TokSequence from the ids of the first sequence in the dictionary
tokens = TokSequence(ids=tokens["ids"][0])

But now that I think of it, it would be better indeed if all of this could be handled directly within tokenizer.load_token.
We can leave this issue open, and when I'll have more time I'll implement this so that it automatically returns a (list of) TokSequence depending on the tokenizer's configuration.

@Natooz Natooz added enhancement New feature or request help wanted Extra attention is needed labels Jan 24, 2024
@oiabtt
Copy link
Author

oiabtt commented Jan 24, 2024

Thanks, another problem left. Now I want to convert BPE-encoded JSON to MIDI, but I'm unable to do so successfully no matter what I try.

# bpe json
bpe_tokens = tokenizer.load_tokens("MIDI-Unprocessed_SMF_02_R1_2004_01-05_ORIG_MID--AUDIO_02_R1_2004_05_Track05_wav.json")
bpe_tokens_seq = TokSequence(ids=bpe_tokens["ids"][0], ids_bpe_encoded=True)
# tokenizer.complete_sequence(bpe_tokens_seq)  # Error 1 (show below)
bpe_tokens_midi = tokenizer.tokens_to_midi(bpe_tokens_seq) # Error 2 (show below)

Error 1
If I run this line:

----> 1 tokenizer.complete_sequence(bpe_tokens_seq)

File ~/anaconda3/envs/midi/lib/python3.10/site-packages/miditok/midi_tokenizer.py:1334, in MIDITokenizer.complete_sequence(self, seq)
   1332     seq.tokens = self._events_to_tokens(seq.events)
   1333 elif seq.ids is not None:
-> 1334     seq.tokens = self._ids_to_tokens(seq.ids)
   1335 elif seq.bytes is not None:
   1336     seq.tokens = self._bytes_to_tokens(seq.bytes)
...
   3068     voc = (
   3069         self.__vocab_base_inv[vocab_id]
   3070         if self.is_multi_voc
   3071         else self.__vocab_base_inv
   3072     )
-> 3073 return voc[item]

KeyError: 453

Error 2
If I comment out this line(# tokenizer.complete_sequence(bpe_tokens_seq))

---> 1 bpe_tokens_midi = tokenizer.tokens_to_midi(bpe_tokens_seq)

File ~/anaconda3/envs/midi/lib/python3.10/site-packages/miditok/midi_tokenizer.py:1590, in MIDITokenizer.tokens_to_midi(self, tokens, programs, output_path)
   1582 if not isinstance(tokens, (TokSequence, list)) or (
   1583     isinstance(tokens, list)
   1584     and any(not isinstance(seq, TokSequence) for seq in tokens)
   1585 ):
   1586     tokens = self._convert_sequence_to_tokseq(
   1587         tokens, complete_seq=True, decode_bpe=True
   1588     )
-> 1590 midi = self._tokens_to_midi(tokens, programs)
...
    234     tokens = [tokens]
    235 for i in range(len(tokens)):
--> 236     tokens[i] = tokens[i].tokens
    237 midi = Score(self.time_division)
    239 # RESULTS

AttributeError: 'int' object has no attribute 'tokens'

@oiabtt
Copy link
Author

oiabtt commented Jan 24, 2024

Hi, this is due to a i/o format mismatch.

Here is how you can fix it:

tokens = tokenizer(midi)  # list of TokSequence
tokenizer.save_tokens(tokens, "./test.json")

# load json
tokens = tokenizer.load_tokens(out_path)  # dictionary
# here tokens["ids"]'s value is a list of list of integers (first dim is track, second tokens)

# Create the TokSequence from the ids of the first sequence in the dictionary
tokens = TokSequence(ids=tokens["ids"][0])

But now that I think of it, it would be better indeed if all of this could be handled directly within tokenizer.load_token. We can leave this issue open, and when I'll have more time I'll implement this so that it automatically returns a (list of) TokSequence depending on the tokenizer's configuration.

Thanks, another problem left. Now I want to convert BPE-encoded JSON to MIDI, but I'm unable to do so successfully no matter what I try.

# bpe json
bpe_tokens = tokenizer.load_tokens("MIDI-Unprocessed_SMF_02_R1_2004_01-05_ORIG_MID--AUDIO_02_R1_2004_05_Track05_wav.json")
bpe_tokens_seq = TokSequence(ids=bpe_tokens["ids"][0], ids_bpe_encoded=True)
# tokenizer.complete_sequence(bpe_tokens_seq)  # Error 1 (show below)
bpe_tokens_midi = tokenizer.tokens_to_midi(bpe_tokens_seq) # Error 2 (show below)

Error 1
If I run this line:

----> 1 tokenizer.complete_sequence(bpe_tokens_seq)

File ~/anaconda3/envs/midi/lib/python3.10/site-packages/miditok/midi_tokenizer.py:1334, in MIDITokenizer.complete_sequence(self, seq)
   1332     seq.tokens = self._events_to_tokens(seq.events)
   1333 elif seq.ids is not None:
-> 1334     seq.tokens = self._ids_to_tokens(seq.ids)
   1335 elif seq.bytes is not None:
   1336     seq.tokens = self._bytes_to_tokens(seq.bytes)
...
   3068     voc = (
   3069         self.__vocab_base_inv[vocab_id]
   3070         if self.is_multi_voc
   3071         else self.__vocab_base_inv
   3072     )
-> 3073 return voc[item]

KeyError: 453

Error 2
If I comment out this line(# tokenizer.complete_sequence(bpe_tokens_seq))

---> 1 bpe_tokens_midi = tokenizer.tokens_to_midi(bpe_tokens_seq)

File ~/anaconda3/envs/midi/lib/python3.10/site-packages/miditok/midi_tokenizer.py:1590, in MIDITokenizer.tokens_to_midi(self, tokens, programs, output_path)
   1582 if not isinstance(tokens, (TokSequence, list)) or (
   1583     isinstance(tokens, list)
   1584     and any(not isinstance(seq, TokSequence) for seq in tokens)
   1585 ):
   1586     tokens = self._convert_sequence_to_tokseq(
   1587         tokens, complete_seq=True, decode_bpe=True
   1588     )
-> 1590 midi = self._tokens_to_midi(tokens, programs)
...
    234     tokens = [tokens]
    235 for i in range(len(tokens)):
--> 236     tokens[i] = tokens[i].tokens
    237 midi = Score(self.time_division)
    239 # RESULTS

AttributeError: 'int' object has no attribute 'tokens'

@Natooz
Copy link
Owner

Natooz commented Jan 24, 2024

Could you provide the tokenizer config and tokens files?

@oiabtt
Copy link
Author

oiabtt commented Jan 24, 2024

Could you provide the tokenizer config and tokens files?

bpe json is created by

tokenizer.tokenize_midi_dataset(        # 2 velocity and 1 duration values
    midi_aug_paths,
    tokens_dir,
    # midi_valid,
)

MIDI-Unprocessed_SMF_02_R1_2004_01-05_ORIG_MID--AUDIO_02_R1_2004_05_Track05_wav.json

tokenizer is created by

augment_midi_dataset(
    dataset_dir,
    pitch_offsets=[-12, 12],
    velocity_offsets=[-4, 5],
    duration_offsets=[-0.5, 1],
    out_path=midi_aug_path,
)
tokenizer.learn_bpe(vocab_size=30000, files_paths=midi_aug_paths)

tokenizer.json

@Natooz
Copy link
Owner

Natooz commented Jan 24, 2024

i/o format mismatch again.

bpe_tokens_seq = miditok.TokSequence(ids=bpe_tokens["ids"][0], ids_bpe_encoded=True)
bpe_tokens_midi = tokenizer.tokens_to_midi([bpe_tokens_seq])

Here the tokenizer has a (I,T) i/o format. It means that is expects inputs as a lists of TokSequences, one for each track of the MIDI (to tokenizer or retokenize).

@oiabtt
Copy link
Author

oiabtt commented Jan 24, 2024

```python
bpe_tokens_midi = tokenizer.tokens_to_midi([bpe_tokens_seq])
TypeError                                 Traceback (most recent call last)
Cell In[22], line 2
      1 # tokenizer.complete_sequence(bpe_tokens_seq)  # Error 1 (show below)
----> 2 bpe_tokens_midi = tokenizer.tokens_to_midi([bpe_tokens_seq])

File ~/anaconda3/envs/midi/lib/python3.10/site-packages/miditok/midi_tokenizer.py:1590, in MIDITokenizer.tokens_to_midi(self, tokens, programs, output_path)
   1582 if not isinstance(tokens, (TokSequence, list)) or (
   1583     isinstance(tokens, list)
   1584     and any(not isinstance(seq, TokSequence) for seq in tokens)
   1585 ):
   1586     tokens = self._convert_sequence_to_tokseq(
   1587         tokens, complete_seq=True, decode_bpe=True
   1588     )
-> 1590 midi = self._tokens_to_midi(tokens, programs)
   1592 # Create controls for pedals
   1593 # This is required so that they are saved when the MIDI is dumped, as symusic
   1594 # will only write the control messages.
   1595 if self.config.use_sustain_pedals:

File ~/anaconda3/envs/midi/lib/python3.10/site-packages/miditok/tokenizations/structured.py:270, in Structured._tokens_to_midi(self, tokens, programs)
    261     current_instrument = Track(
    262         program=current_program,
    263         is_drum=is_drum,
   (...)
    266         else MIDI_INSTRUMENTS[current_program]["name"],
    267     )
    269 # Decode tokens
--> 270 for ti, token in enumerate(seq):
    271     token_type, token_val = token.split("_")
    272     if token_type == "TimeShift" and token_val != "0.0.1":

TypeError: 'NoneType' object is not iterable

@Natooz
Copy link
Owner

Natooz commented Jan 24, 2024

This should do it:

bpe_tokens_seq = miditok.TokSequence(ids=bpe_tokens["ids"][0], ids_bpe_encoded=True)
tokenizer.decode_bpe(bpe_tokens_seq)
tokenizer.complete_sequence(bpe_tokens_seq)  # Error 1 (show below)
bpe_tokens_midi = tokenizer.tokens_to_midi([bpe_tokens_seq])

Note for myself: automatic BPE decoding we preprocessing TokSequences in tokens_to_midi

@Natooz
Copy link
Owner

Natooz commented Jan 24, 2024

What would have work out of the box:

bpe_tokens = tokenizer.load_tokens(
    "MIDI-Unprocessed_SMF_02_R1_2004_01-05_ORIG_MID--AUDIO_02_R1_2004_05_Track05_wav.json"
)
bpe_tokens_midi = tokenizer.tokens_to_midi(bpe_tokens["ids"])

tokens_to_midi already automatically converts ids as lists of ints, arrays and tensors to TokSequence(s) internally. This is the simplest way to use the method. But it's good that you tried this way as I didn't spotted the cases were the user already gives a TokSequence, which needs to be preprocessed.

This should be fixed in the PR referencing this issue.

@oiabtt
Copy link
Author

oiabtt commented Jan 24, 2024

This should do it:

bpe_tokens_seq = miditok.TokSequence(ids=bpe_tokens["ids"][0], ids_bpe_encoded=True)
tokenizer.decode_bpe(bpe_tokens_seq)
tokenizer.complete_sequence(bpe_tokens_seq)  # Error 1 (show below)
bpe_tokens_midi = tokenizer.tokens_to_midi([bpe_tokens_seq])

Note for myself: automatic BPE decoding we preprocessing TokSequences in tokens_to_midi

Thanks, it's ok now

@Natooz
Copy link
Owner

Natooz commented Jan 24, 2024

The two i/o automations mentioned here have been merge, closing this!
You can get them by installing from git. I'll wait a few days again before releasing the next version, waiting for potential other bugs/things to fix as we released the v3 less than a week.
Thanks again for the feedback!

@Natooz Natooz closed this as completed Jan 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants