Unable to apply BPE or convert BPE-encoded JSON to MIDI #137

oiabtt · 2024-01-24T10:32:14Z

When I try to convert a BPE-encoded JSON file to MIDI, I keep encountering errors.
I suspect there might be some errors in the BPE encoding/decoding code, so I have run the code provided in the documentation(https://miditok.readthedocs.io/en/latest/bpe.html), but exceptions are still being thrown.

from copy import deepcopy
from miditok import TokenizerConfig, Structured, TokSequence
from miditoolkit import MidiFile
from pathlib import Path

# create json
config = TokenizerConfig()
tokenizer = Structured(config)
midi_path =  "/maestro-v3.0.0/2004/MIDI-Unprocessed_SMF_02_R1_2004_01-05_ORIG_MID--AUDIO_02_R1_2004_05_Track05_wav.midi"
tokens = tokenizer(midi)
tokenizer.save_tokens(tokens, "./test.json")

# load json
tokens = tokenizer.load_tokens(out_path)
tokens = TokSequence(ids=tokens)

# load bpe tokenizer
# bpe tokenizer already trained like this: 
# config = TokenizerConfig()
# tokenizer = Structured(config)
# tokenizer.learn_bpe(vocab_size=30000, files_paths=midi_aug_paths)
tokenizer = Structured(params=Path("tokenizer.json"))

# Error
tokens_with_bpe = tokenizer.apply_bpe(deepcopy(tokens))

errors here:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
     20 tokenizer = Structured(params=Path("tokenizer.json"))
     22 # Error
---> 23 tokens_with_bpe = tokenizer.apply_bpe(deepcopy(tokens))

File ~/anaconda3/envs/midi/lib/python3.10/site-packages/miditok/midi_tokenizer.py:2386, in MIDITokenizer.apply_bpe(self, seq)
   2383         seq_.ids_bpe_encoded = True
   2385 else:
-> 2386     self.complete_sequence(seq)
   2387     encoded_tokens = self._bpe_model.encode([seq.bytes], is_pretokenized=True)
   2388     seq.ids = encoded_tokens.ids

File ~/anaconda3/envs/midi/lib/python3.10/site-packages/miditok/midi_tokenizer.py:1334, in MIDITokenizer.complete_sequence(self, seq)
   1332     seq.tokens = self._events_to_tokens(seq.events)
   1333 elif seq.ids is not None:
-> 1334     seq.tokens = self._ids_to_tokens(seq.ids)
   1335 elif seq.bytes is not None:
   1336     seq.tokens = self._bytes_to_tokens(seq.bytes)

File ~/anaconda3/envs/midi/lib/python3.10/site-packages/miditok/midi_tokenizer.py:1379, in MIDITokenizer._ids_to_tokens(self, ids, as_str)
   1377 if len(ids) == 0:
   1378     return tokens
-> 1379 if isinstance(ids[0], list):  # multiple vocabularies
   1380     for (
   1381         multi_ids
   1382     ) in ids:  # cannot use recursion here because of the vocabulary type id
   1383         multi_event = []

KeyError: 0

Would you like me to provide a runnable code snippet to convert BPE-encoded JSON to MIDI?

The text was updated successfully, but these errors were encountered:

Natooz · 2024-01-24T10:47:34Z

Hi, this is due to a i/o format mismatch.

Here is how you can fix it:

tokens = tokenizer(midi)  # list of TokSequence
tokenizer.save_tokens(tokens, "./test.json")

# load json
tokens = tokenizer.load_tokens(out_path)  # dictionary
# here tokens["ids"]'s value is a list of list of integers (first dim is track, second tokens)

# Create the TokSequence from the ids of the first sequence in the dictionary
tokens = TokSequence(ids=tokens["ids"][0])

But now that I think of it, it would be better indeed if all of this could be handled directly within tokenizer.load_token.
We can leave this issue open, and when I'll have more time I'll implement this so that it automatically returns a (list of) TokSequence depending on the tokenizer's configuration.

oiabtt · 2024-01-24T11:09:37Z

Thanks, another problem left. Now I want to convert BPE-encoded JSON to MIDI, but I'm unable to do so successfully no matter what I try.

# bpe json
bpe_tokens = tokenizer.load_tokens("MIDI-Unprocessed_SMF_02_R1_2004_01-05_ORIG_MID--AUDIO_02_R1_2004_05_Track05_wav.json")
bpe_tokens_seq = TokSequence(ids=bpe_tokens["ids"][0], ids_bpe_encoded=True)
# tokenizer.complete_sequence(bpe_tokens_seq)  # Error 1 (show below)
bpe_tokens_midi = tokenizer.tokens_to_midi(bpe_tokens_seq) # Error 2 (show below)

Error 1
If I run this line:

----> 1 tokenizer.complete_sequence(bpe_tokens_seq)

File ~/anaconda3/envs/midi/lib/python3.10/site-packages/miditok/midi_tokenizer.py:1334, in MIDITokenizer.complete_sequence(self, seq)
   1332     seq.tokens = self._events_to_tokens(seq.events)
   1333 elif seq.ids is not None:
-> 1334     seq.tokens = self._ids_to_tokens(seq.ids)
   1335 elif seq.bytes is not None:
   1336     seq.tokens = self._bytes_to_tokens(seq.bytes)
...
   3068     voc = (
   3069         self.__vocab_base_inv[vocab_id]
   3070         if self.is_multi_voc
   3071         else self.__vocab_base_inv
   3072     )
-> 3073 return voc[item]

KeyError: 453

Error 2
If I comment out this line(# tokenizer.complete_sequence(bpe_tokens_seq))

---> 1 bpe_tokens_midi = tokenizer.tokens_to_midi(bpe_tokens_seq)

File ~/anaconda3/envs/midi/lib/python3.10/site-packages/miditok/midi_tokenizer.py:1590, in MIDITokenizer.tokens_to_midi(self, tokens, programs, output_path)
   1582 if not isinstance(tokens, (TokSequence, list)) or (
   1583     isinstance(tokens, list)
   1584     and any(not isinstance(seq, TokSequence) for seq in tokens)
   1585 ):
   1586     tokens = self._convert_sequence_to_tokseq(
   1587         tokens, complete_seq=True, decode_bpe=True
   1588     )
-> 1590 midi = self._tokens_to_midi(tokens, programs)
...
    234     tokens = [tokens]
    235 for i in range(len(tokens)):
--> 236     tokens[i] = tokens[i].tokens
    237 midi = Score(self.time_division)
    239 # RESULTS

AttributeError: 'int' object has no attribute 'tokens'

oiabtt · 2024-01-24T11:10:51Z

Hi, this is due to a i/o format mismatch.

Here is how you can fix it:
tokens = tokenizer(midi)  # list of TokSequence
tokenizer.save_tokens(tokens, "./test.json")

# load json
tokens = tokenizer.load_tokens(out_path)  # dictionary
# here tokens["ids"]'s value is a list of list of integers (first dim is track, second tokens)

# Create the TokSequence from the ids of the first sequence in the dictionary
tokens = TokSequence(ids=tokens["ids"][0])
But now that I think of it, it would be better indeed if all of this could be handled directly within tokenizer.load_token. We can leave this issue open, and when I'll have more time I'll implement this so that it automatically returns a (list of) TokSequence depending on the tokenizer's configuration.

Thanks, another problem left. Now I want to convert BPE-encoded JSON to MIDI, but I'm unable to do so successfully no matter what I try.

# bpe json
bpe_tokens = tokenizer.load_tokens("MIDI-Unprocessed_SMF_02_R1_2004_01-05_ORIG_MID--AUDIO_02_R1_2004_05_Track05_wav.json")
bpe_tokens_seq = TokSequence(ids=bpe_tokens["ids"][0], ids_bpe_encoded=True)
# tokenizer.complete_sequence(bpe_tokens_seq)  # Error 1 (show below)
bpe_tokens_midi = tokenizer.tokens_to_midi(bpe_tokens_seq) # Error 2 (show below)

Error 1
If I run this line:

----> 1 tokenizer.complete_sequence(bpe_tokens_seq)

File ~/anaconda3/envs/midi/lib/python3.10/site-packages/miditok/midi_tokenizer.py:1334, in MIDITokenizer.complete_sequence(self, seq)
   1332     seq.tokens = self._events_to_tokens(seq.events)
   1333 elif seq.ids is not None:
-> 1334     seq.tokens = self._ids_to_tokens(seq.ids)
   1335 elif seq.bytes is not None:
   1336     seq.tokens = self._bytes_to_tokens(seq.bytes)
...
   3068     voc = (
   3069         self.__vocab_base_inv[vocab_id]
   3070         if self.is_multi_voc
   3071         else self.__vocab_base_inv
   3072     )
-> 3073 return voc[item]

KeyError: 453

Error 2
If I comment out this line(# tokenizer.complete_sequence(bpe_tokens_seq))

---> 1 bpe_tokens_midi = tokenizer.tokens_to_midi(bpe_tokens_seq)

File ~/anaconda3/envs/midi/lib/python3.10/site-packages/miditok/midi_tokenizer.py:1590, in MIDITokenizer.tokens_to_midi(self, tokens, programs, output_path)
   1582 if not isinstance(tokens, (TokSequence, list)) or (
   1583     isinstance(tokens, list)
   1584     and any(not isinstance(seq, TokSequence) for seq in tokens)
   1585 ):
   1586     tokens = self._convert_sequence_to_tokseq(
   1587         tokens, complete_seq=True, decode_bpe=True
   1588     )
-> 1590 midi = self._tokens_to_midi(tokens, programs)
...
    234     tokens = [tokens]
    235 for i in range(len(tokens)):
--> 236     tokens[i] = tokens[i].tokens
    237 midi = Score(self.time_division)
    239 # RESULTS

AttributeError: 'int' object has no attribute 'tokens'

Natooz · 2024-01-24T11:18:04Z

Could you provide the tokenizer config and tokens files?

oiabtt · 2024-01-24T11:29:01Z

Could you provide the tokenizer config and tokens files?

bpe json is created by

tokenizer.tokenize_midi_dataset(        # 2 velocity and 1 duration values
    midi_aug_paths,
    tokens_dir,
    # midi_valid,
)

MIDI-Unprocessed_SMF_02_R1_2004_01-05_ORIG_MID--AUDIO_02_R1_2004_05_Track05_wav.json

tokenizer is created by

augment_midi_dataset(
    dataset_dir,
    pitch_offsets=[-12, 12],
    velocity_offsets=[-4, 5],
    duration_offsets=[-0.5, 1],
    out_path=midi_aug_path,
)
tokenizer.learn_bpe(vocab_size=30000, files_paths=midi_aug_paths)

tokenizer.json

Natooz · 2024-01-24T11:39:28Z

i/o format mismatch again.

bpe_tokens_seq = miditok.TokSequence(ids=bpe_tokens["ids"][0], ids_bpe_encoded=True)
bpe_tokens_midi = tokenizer.tokens_to_midi([bpe_tokens_seq])

Here the tokenizer has a (I,T) i/o format. It means that is expects inputs as a lists of TokSequences, one for each track of the MIDI (to tokenizer or retokenize).

oiabtt · 2024-01-24T11:46:00Z

```python
bpe_tokens_midi = tokenizer.tokens_to_midi([bpe_tokens_seq])

TypeError                                 Traceback (most recent call last)
Cell In[22], line 2
      1 # tokenizer.complete_sequence(bpe_tokens_seq)  # Error 1 (show below)
----> 2 bpe_tokens_midi = tokenizer.tokens_to_midi([bpe_tokens_seq])

File ~/anaconda3/envs/midi/lib/python3.10/site-packages/miditok/midi_tokenizer.py:1590, in MIDITokenizer.tokens_to_midi(self, tokens, programs, output_path)
   1582 if not isinstance(tokens, (TokSequence, list)) or (
   1583     isinstance(tokens, list)
   1584     and any(not isinstance(seq, TokSequence) for seq in tokens)
   1585 ):
   1586     tokens = self._convert_sequence_to_tokseq(
   1587         tokens, complete_seq=True, decode_bpe=True
   1588     )
-> 1590 midi = self._tokens_to_midi(tokens, programs)
   1592 # Create controls for pedals
   1593 # This is required so that they are saved when the MIDI is dumped, as symusic
   1594 # will only write the control messages.
   1595 if self.config.use_sustain_pedals:

File ~/anaconda3/envs/midi/lib/python3.10/site-packages/miditok/tokenizations/structured.py:270, in Structured._tokens_to_midi(self, tokens, programs)
    261     current_instrument = Track(
    262         program=current_program,
    263         is_drum=is_drum,
   (...)
    266         else MIDI_INSTRUMENTS[current_program]["name"],
    267     )
    269 # Decode tokens
--> 270 for ti, token in enumerate(seq):
    271     token_type, token_val = token.split("_")
    272     if token_type == "TimeShift" and token_val != "0.0.1":

TypeError: 'NoneType' object is not iterable

Natooz · 2024-01-24T11:56:58Z

This should do it:

bpe_tokens_seq = miditok.TokSequence(ids=bpe_tokens["ids"][0], ids_bpe_encoded=True)
tokenizer.decode_bpe(bpe_tokens_seq)
tokenizer.complete_sequence(bpe_tokens_seq)  # Error 1 (show below)
bpe_tokens_midi = tokenizer.tokens_to_midi([bpe_tokens_seq])

Note for myself: automatic BPE decoding we preprocessing TokSequences in tokens_to_midi

Natooz · 2024-01-24T12:23:49Z

What would have work out of the box:

bpe_tokens = tokenizer.load_tokens(
    "MIDI-Unprocessed_SMF_02_R1_2004_01-05_ORIG_MID--AUDIO_02_R1_2004_05_Track05_wav.json"
)
bpe_tokens_midi = tokenizer.tokens_to_midi(bpe_tokens["ids"])

tokens_to_midi already automatically converts ids as lists of ints, arrays and tensors to TokSequence(s) internally. This is the simplest way to use the method. But it's good that you tried this way as I didn't spotted the cases were the user already gives a TokSequence, which needs to be preprocessed.

This should be fixed in the PR referencing this issue.

oiabtt · 2024-01-24T12:29:00Z

This should do it:

bpe_tokens_seq = miditok.TokSequence(ids=bpe_tokens["ids"][0], ids_bpe_encoded=True)
tokenizer.decode_bpe(bpe_tokens_seq)
tokenizer.complete_sequence(bpe_tokens_seq)  # Error 1 (show below)
bpe_tokens_midi = tokenizer.tokens_to_midi([bpe_tokens_seq])

Note for myself: automatic BPE decoding we preprocessing TokSequences in tokens_to_midi

Thanks, it's ok now

Natooz · 2024-01-24T16:02:15Z

The two i/o automations mentioned here have been merge, closing this!
You can get them by installing from git. I'll wait a few days again before releasing the next version, waiting for potential other bugs/things to fix as we released the v3 less than a week.
Thanks again for the feedback!

Natooz added enhancement New feature or request help wanted Extra attention is needed labels Jan 24, 2024

Natooz mentioned this issue Jan 24, 2024

decoding: automatically making sure to decode BPE then completing tokens #138

Merged

Natooz mentioned this issue Jan 24, 2024

load_tokens now returning TokSequence #139

Merged

Natooz closed this as completed Jan 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to apply BPE or convert BPE-encoded JSON to MIDI #137

Unable to apply BPE or convert BPE-encoded JSON to MIDI #137

oiabtt commented Jan 24, 2024 •

edited

Loading

Natooz commented Jan 24, 2024

oiabtt commented Jan 24, 2024

oiabtt commented Jan 24, 2024

Natooz commented Jan 24, 2024

oiabtt commented Jan 24, 2024 •

edited

Loading

Natooz commented Jan 24, 2024

oiabtt commented Jan 24, 2024

Natooz commented Jan 24, 2024

Natooz commented Jan 24, 2024

oiabtt commented Jan 24, 2024

Natooz commented Jan 24, 2024

Unable to apply BPE or convert BPE-encoded JSON to MIDI #137

Unable to apply BPE or convert BPE-encoded JSON to MIDI #137

Comments

oiabtt commented Jan 24, 2024 • edited Loading

Natooz commented Jan 24, 2024

oiabtt commented Jan 24, 2024

oiabtt commented Jan 24, 2024

Natooz commented Jan 24, 2024

oiabtt commented Jan 24, 2024 • edited Loading

Natooz commented Jan 24, 2024

oiabtt commented Jan 24, 2024

Natooz commented Jan 24, 2024

Natooz commented Jan 24, 2024

oiabtt commented Jan 24, 2024

Natooz commented Jan 24, 2024

oiabtt commented Jan 24, 2024 •

edited

Loading

oiabtt commented Jan 24, 2024 •

edited

Loading