Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: Natooz/MidiTok
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: v3.0.4
Choose a base ref
...
head repository: Natooz/MidiTok
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: v3.0.5
Choose a head ref
  • 18 commits
  • 51 files changed
  • 7 contributors

Commits on Sep 25, 2024

  1. updating .gitignore

    Natooz committed Sep 25, 2024

    Verified

    This commit was signed with the committer’s verified signature.
    Natooz Nathan Fradet
    Copy the full SHA
    ced3331 View commit details

Commits on Sep 26, 2024

  1. fix import HfHubHTTPError with latest hf hub package update (#199)

    Natooz authored Sep 26, 2024

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature.
    Copy the full SHA
    c161900 View commit details

Commits on Sep 27, 2024

  1. using --dist worksteal for tests distribution among workers (#201)

    Natooz authored Sep 27, 2024

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature.
    Copy the full SHA
    597c18f View commit details

Commits on Oct 1, 2024

  1. gitignore update: removing macos specific files

    Natooz committed Oct 1, 2024

    Verified

    This commit was signed with the committer’s verified signature.
    Natooz Nathan Fradet
    Copy the full SHA
    b53111a View commit details

Commits on Oct 7, 2024

  1. Bump codecov/codecov-action from 4.5.0 to 4.6.0 (#202)

    Bumps [codecov/codecov-action](https://github.com/codecov/codecov-action) from 4.5.0 to 4.6.0.
    - [Release notes](https://github.com/codecov/codecov-action/releases)
    - [Changelog](https://github.com/codecov/codecov-action/blob/main/CHANGELOG.md)
    - [Commits](codecov/codecov-action@v4.5.0...v4.6.0)
    
    ---
    updated-dependencies:
    - dependency-name: codecov/codecov-action
      dependency-type: direct:production
      update-type: version-update:semver-minor
    ...
    
    Signed-off-by: dependabot[bot] <[email protected]>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
    dependabot[bot] authored Oct 7, 2024

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature.
    Copy the full SHA
    1045aed View commit details
  2. Adding PerTok to README.md

    Natooz authored Oct 7, 2024

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature.
    Copy the full SHA
    c3800fb View commit details

Commits on Nov 12, 2024

  1. MDTK_200 : implemeted add_trailing_bars (#204)

    * MDTK_200 : implemeted add_trailing_bars
    
    * MDTK_200 : moved add_trailing_bars to REMI additional_params; fixed pytest python-version up to 3.12
    
    * Update miditok/tokenizations/remi.py
    
    Co-authored-by: Nathan Fradet <[email protected]>
    
    * Apply suggestions from code review
    
    Co-authored-by: Nathan Fradet <[email protected]>
    
    ---------
    
    Co-authored-by: Mintas <[email protected]>
    Co-authored-by: Nathan Fradet <[email protected]>
    3 people authored Nov 12, 2024

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature.
    Copy the full SHA
    415126f View commit details

Commits on Nov 13, 2024

  1. Remove refs to split_midis_for_training in doc (#205)

    Instead, put split_files_for_training
    Zaka authored Nov 13, 2024

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature.
    Copy the full SHA
    76bc990 View commit details

Commits on Nov 18, 2024

  1. Bump codecov/codecov-action from 4.6.0 to 5.0.2 (#207)

    Bumps [codecov/codecov-action](https://github.com/codecov/codecov-action) from 4.6.0 to 5.0.2.
    - [Release notes](https://github.com/codecov/codecov-action/releases)
    - [Changelog](https://github.com/codecov/codecov-action/blob/main/CHANGELOG.md)
    - [Commits](codecov/codecov-action@v4.6.0...v5.0.2)
    
    ---
    updated-dependencies:
    - dependency-name: codecov/codecov-action
      dependency-type: direct:production
      update-type: version-update:semver-major
    ...
    
    Signed-off-by: dependabot[bot] <[email protected]>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
    dependabot[bot] authored Nov 18, 2024

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature.
    Copy the full SHA
    5654535 View commit details

Commits on Nov 25, 2024

  1. Bump codecov/codecov-action from 5.0.2 to 5.0.7 (#209)

    Bumps [codecov/codecov-action](https://github.com/codecov/codecov-action) from 5.0.2 to 5.0.7.
    - [Release notes](https://github.com/codecov/codecov-action/releases)
    - [Changelog](https://github.com/codecov/codecov-action/blob/main/CHANGELOG.md)
    - [Commits](codecov/codecov-action@v5.0.2...v5.0.7)
    
    ---
    updated-dependencies:
    - dependency-name: codecov/codecov-action
      dependency-type: direct:production
      update-type: version-update:semver-patch
    ...
    
    Signed-off-by: dependabot[bot] <[email protected]>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
    dependabot[bot] authored Nov 25, 2024

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature.
    Copy the full SHA
    f050072 View commit details

Commits on Nov 30, 2024

  1. Catching exception when decoding velocity values in MIDILike (#210)

    * catching exception when decoding velocity values in MIDILike
    
    * switching to `src` directory structure, dropping support Python 3.8
    
    * using python 3.12 for tests
    
    * adding codecov token to tests action
    Natooz authored Nov 30, 2024

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature.
    Copy the full SHA
    0f7c273 View commit details

Commits on Dec 9, 2024

  1. Bump codecov/codecov-action from 5.0.7 to 5.1.1 (#213)

    Bumps [codecov/codecov-action](https://github.com/codecov/codecov-action) from 5.0.7 to 5.1.1.
    - [Release notes](https://github.com/codecov/codecov-action/releases)
    - [Changelog](https://github.com/codecov/codecov-action/blob/main/CHANGELOG.md)
    - [Commits](codecov/codecov-action@v5.0.7...v5.1.1)
    
    ---
    updated-dependencies:
    - dependency-name: codecov/codecov-action
      dependency-type: direct:production
      update-type: version-update:semver-minor
    ...
    
    Signed-off-by: dependabot[bot] <[email protected]>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
    dependabot[bot] authored Dec 9, 2024

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature.
    Copy the full SHA
    370ffbe View commit details

Commits on Dec 23, 2024

  1. Bump codecov/codecov-action from 5.1.1 to 5.1.2 (#215)

    Bumps [codecov/codecov-action](https://github.com/codecov/codecov-action) from 5.1.1 to 5.1.2.
    - [Release notes](https://github.com/codecov/codecov-action/releases)
    - [Changelog](https://github.com/codecov/codecov-action/blob/main/CHANGELOG.md)
    - [Commits](codecov/codecov-action@v5.1.1...v5.1.2)
    
    ---
    updated-dependencies:
    - dependency-name: codecov/codecov-action
      dependency-type: direct:production
      update-type: version-update:semver-patch
    ...
    
    Signed-off-by: dependabot[bot] <[email protected]>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
    dependabot[bot] authored Dec 23, 2024

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature.
    Copy the full SHA
    c63ce0b View commit details
  2. Update example notebook reference (#216)

    Signed-off-by: Emmanuel Ferdman <[email protected]>
    emmanuel-ferdman authored Dec 23, 2024

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature.
    Copy the full SHA
    515ed50 View commit details

Commits on Jan 27, 2025

  1. Bump actions/stale from 9.0.0 to 9.1.0 (#218)

    Bumps [actions/stale](https://github.com/actions/stale) from 9.0.0 to 9.1.0.
    - [Release notes](https://github.com/actions/stale/releases)
    - [Changelog](https://github.com/actions/stale/blob/main/CHANGELOG.md)
    - [Commits](actions/stale@v9.0.0...v9.1.0)
    
    ---
    updated-dependencies:
    - dependency-name: actions/stale
      dependency-type: direct:production
      update-type: version-update:semver-minor
    ...
    
    Signed-off-by: dependabot[bot] <[email protected]>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
    dependabot[bot] authored Jan 27, 2025

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature.
    Copy the full SHA
    53f868b View commit details
  2. Bump codecov/codecov-action from 5.1.2 to 5.3.1 (#219)

    Bumps [codecov/codecov-action](https://github.com/codecov/codecov-action) from 5.1.2 to 5.3.1.
    - [Release notes](https://github.com/codecov/codecov-action/releases)
    - [Changelog](https://github.com/codecov/codecov-action/blob/main/CHANGELOG.md)
    - [Commits](codecov/codecov-action@v5.1.2...v5.3.1)
    
    ---
    updated-dependencies:
    - dependency-name: codecov/codecov-action
      dependency-type: direct:production
      update-type: version-update:semver-minor
    ...
    
    Signed-off-by: dependabot[bot] <[email protected]>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
    dependabot[bot] authored Jan 27, 2025

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature.
    Copy the full SHA
    a2290f2 View commit details

Commits on Feb 8, 2025

  1. bugfix training initial alphabet (#220)

    Co-authored-by: Nathan Fradet <@>
    Natooz authored Feb 8, 2025

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature.
    Copy the full SHA
    2982218 View commit details

Commits on Feb 14, 2025

  1. Add a parameter augment_copy to the augment_score function (#221)

    * added new parameter augment_copy to the augment_score function
    
    * fixed linter warning
    pstrepetov authored Feb 14, 2025

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature.
    Copy the full SHA
    e94589d View commit details
Showing with 381 additions and 229 deletions.
  1. +1 −1 .github/codecov.yml
  2. +1 −1 .github/workflows/close-stale-issues.yml
  3. +10 −4 .github/workflows/pytest.yml
  4. +83 −18 .gitignore
  5. +3 −2 README.md
  6. +2 −2 benchmarks/miditok_preprocess_file/benchmark_preprocess.py
  7. +2 −2 benchmarks/miditok_tokenize/benchmark_tokenize.py
  8. +2 −2 benchmarks/tokenizer_training/benchmark_training.py
  9. +1 −1 docs/conf.py
  10. +3 −3 docs/pytorch_data.rst
  11. +13 −12 pyproject.toml
  12. 0 { → src}/miditok/__init__.py
  13. 0 { → src}/miditok/attribute_controls/__init__.py
  14. 0 { → src}/miditok/attribute_controls/bar_attribute_controls.py
  15. 0 { → src}/miditok/attribute_controls/classes.py
  16. 0 { → src}/miditok/attribute_controls/track_attribute_controls.py
  17. +13 −0 { → src}/miditok/classes.py
  18. +1 −0 { → src}/miditok/constants.py
  19. 0 { → src}/miditok/data_augmentation/__init__.py
  20. +5 −1 { → src}/miditok/data_augmentation/data_augmentation.py
  21. +3 −3 { → src}/miditok/midi_tokenizer.py
  22. 0 { → src}/miditok/pytorch_data/__init__.py
  23. 0 { → src}/miditok/pytorch_data/collators.py
  24. 0 { → src}/miditok/pytorch_data/datasets.py
  25. 0 { → src}/miditok/tokenizations/__init__.py
  26. 0 { → src}/miditok/tokenizations/cp_word.py
  27. +4 −1 { → src}/miditok/tokenizations/midi_like.py
  28. 0 { → src}/miditok/tokenizations/mmm.py
  29. 0 { → src}/miditok/tokenizations/mumidi.py
  30. 0 { → src}/miditok/tokenizations/octuple.py
  31. 0 { → src}/miditok/tokenizations/pertok.py
  32. +193 −120 { → src}/miditok/tokenizations/remi.py
  33. 0 { → src}/miditok/tokenizations/structured.py
  34. 0 { → src}/miditok/tokenizations/tsd.py
  35. 0 { → src}/miditok/tokenizer_training_iterator.py
  36. 0 { → src}/miditok/utils/__init__.py
  37. +4 −5 { → src}/miditok/utils/split.py
  38. 0 { → src}/miditok/utils/utils.py
  39. +3 −4 tests/test_attribute_controls.py
  40. +2 −3 tests/test_data_augmentation.py
  41. +2 −3 tests/test_hf_hub.py
  42. +1 −2 tests/test_io_formats.py
  43. +2 −3 tests/test_methods.py
  44. +1 −2 tests/test_preprocess.py
  45. +1 −2 tests/test_pytorch_data_loading.py
  46. +1 −2 tests/test_saving_loading_config.py
  47. +2 −3 tests/test_tokenize.py
  48. +0 −1 tests/test_toksequence.py
  49. +2 −3 tests/test_train.py
  50. +10 −11 tests/test_utils.py
  51. +10 −12 tests/utils_tests.py
2 changes: 1 addition & 1 deletion .github/codecov.yml
Original file line number Diff line number Diff line change
@@ -10,7 +10,7 @@ coverage:
target: 70%
source:
paths:
- "miditok/"
- "src/miditok/"
target: 75%
threshold: 0.5%
patch:
2 changes: 1 addition & 1 deletion .github/workflows/close-stale-issues.yml
Original file line number Diff line number Diff line change
@@ -13,7 +13,7 @@ jobs:
issues: write
pull-requests: write
steps:
- uses: actions/stale@v9.0.0
- uses: actions/stale@v9.1.0
with:
days-before-issue-stale: 21
days-before-issue-close: 7
14 changes: 10 additions & 4 deletions .github/workflows/pytest.yml
Original file line number Diff line number Diff line change
@@ -13,8 +13,8 @@ jobs:
runs-on: ${{ matrix.os }}
strategy:
matrix:
python-version: ["3.8", "3.9", "3.x"]
os: [ ubuntu-latest, macos-latest ] # windows-latest is extremely slow
python-version: ["3.9", "3.10", "3.12"]
os: [ ubuntu-latest, macos-latest, windows-latest ]

steps:
- uses: actions/checkout@v4
@@ -32,13 +32,19 @@ jobs:
python -m pip install --upgrade pip
pip install -e ".[tests]"
# Tokenizer training tests are significantly slower than others.
# So that xdist don't assign chunks of training tests to the same worker, we use
# the `--dist worksteal` distribution mode to dynamically reassign queued tests to
# free workers.
- name: Test with pytest
run: pytest --cov=./ --cov-report=xml -n logical --durations=0 -v tests
run: python -m pytest --cov=./ --cov-report=xml -n logical --dist worksteal --durations=0 -v tests
env:
HF_TOKEN_HUB_TESTS: ${{ secrets.HF_TOKEN_HUB_TESTS }}

- name: Codecov
uses: codecov/codecov-action@v4.5.0
uses: codecov/codecov-action@v5.3.1
with:
token: ${{ secrets.CODECOV_TOKEN }}

build:
runs-on: ubuntu-latest
101 changes: 83 additions & 18 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,20 +1,3 @@
# macOS DS_STORE files
**/*.DS_STORE
# Python precompiled files
*.pyc
# Builds dir
dist/
# PyCharm config files
.idea/
# PyCharm Virtual Environment
venv/

# personal test file
test.py

# Dataset directory
# data/*

# Generated files in test
tests/configs
tests/Multitrack_tokens
@@ -131,8 +114,10 @@ ipython_config.py
#pdm.lock
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
# in version control.
# https://pdm.fming.dev/#use-with-ide
# https://pdm.fming.dev/latest/usage/project/#working-with-version-control
.pdm.toml
.pdm-python
.pdm-build/

# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/
@@ -183,3 +168,83 @@ cython_debug/
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
# Covers JetBrains IDEs: IntelliJ, RubyMine, PhpStorm, AppCode, PyCharm, CLion, Android Studio, WebStorm and Rider
# Reference: https://intellij-support.jetbrains.com/hc/en-us/articles/206544839

# User-specific stuff
.idea/**/workspace.xml
.idea/**/tasks.xml
.idea/**/usage.statistics.xml
.idea/**/dictionaries
.idea/**/shelf

# AWS User-specific
.idea/**/aws.xml

# Generated files
.idea/**/contentModel.xml

# Sensitive or high-churn files
.idea/**/dataSources/
.idea/**/dataSources.ids
.idea/**/dataSources.local.xml
.idea/**/sqlDataSources.xml
.idea/**/dynamic.xml
.idea/**/uiDesigner.xml
.idea/**/dbnavigator.xml

# Gradle
.idea/**/gradle.xml
.idea/**/libraries

# Gradle and Maven with auto-import
# When using Gradle or Maven with auto-import, you should exclude module files,
# since they will be recreated, and may cause churn. Uncomment if using
# auto-import.
# .idea/artifacts
# .idea/compiler.xml
# .idea/jarRepositories.xml
# .idea/modules.xml
# .idea/*.iml
# .idea/modules
# *.iml
# *.ipr

# CMake
cmake-build-*/

# Mongo Explorer plugin
.idea/**/mongoSettings.xml

# File-based project format
*.iws

# IntelliJ
out/

# mpeltonen/sbt-idea plugin
.idea_modules/

# JIRA plugin
atlassian-ide-plugin.xml

# Cursive Clojure plugin
.idea/replstate.xml

# SonarLint plugin
.idea/sonarlint/

# Crashlytics plugin (for Android Studio and IntelliJ)
com_crashlytics_export_strings.xml
crashlytics.properties
crashlytics-build.properties
fabric.properties

# Editor-based Rest Client
.idea/httpRequests

# Android studio 3.1+ serialized cache file
.idea/caches/build_file_checksums.ser

# Aider cache directory
.aider*
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -5,7 +5,7 @@ Python package to tokenize music files, introduced at the ISMIR 2021 LBDs.
![MidiTok Logo](docs/assets/miditok_logo_stroke.png?raw=true "")

[![PyPI version fury.io](https://badge.fury.io/py/miditok.svg)](https://pypi.python.org/pypi/miditok/)
[![Python 3.8](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/release/)
[![Python 3.9](https://img.shields.io/badge/python-≥3.9-blue.svg)](https://www.python.org/downloads/release/)
[![Documentation Status](https://readthedocs.org/projects/miditok/badge/?version=latest)](https://miditok.readthedocs.io/en/latest/?badge=latest)
[![GitHub CI](https://github.com/Natooz/MidiTok/actions/workflows/pytest.yml/badge.svg)](https://github.com/Natooz/MidiTok/actions/workflows/pytest.yml)
[![Codecov](https://img.shields.io/codecov/c/github/Natooz/MidiTok)](https://codecov.io/gh/Natooz/MidiTok)
@@ -45,7 +45,7 @@ tokens = tokenizer(midi) # calling the tokenizer will automatically detect MIDI
converted_back_midi = tokenizer(tokens) # PyTorch, Tensorflow and Numpy tensors are supported
```

Here is a complete yet concise example of how you can use MidiTok to train any PyTorch model. And [here](colab-notebooks/Full_Example_HuggingFace_GPT2_Transformer.ipynb) is a simple notebook example showing how to use Hugging Face models to generate music, with MidiTok taking care of tokenizing music files.
Here is a complete yet concise example of how you can use MidiTok to train any PyTorch model. And [here](colab-notebooks/Example_HuggingFace_Mistral_Transformer.ipynb) is a simple notebook example showing how to use Hugging Face models to generate music, with MidiTok taking care of tokenizing music files.

```python
from miditok import REMI, TokenizerConfig
@@ -102,6 +102,7 @@ MidiTok implements the tokenizations: (links to original papers)
* [Octuple](https://aclanthology.org/2021.findings-acl.70)
* [MuMIDI](https://dl.acm.org/doi/10.1145/3394171.3413721)
* [MMM](https://arxiv.org/abs/2008.06048)
* [PerTok](https://www.arxiv.org/abs/2410.02060)

You can find short presentations in the [documentation](https://miditok.readthedocs.io/en/latest/tokenizations.html).

4 changes: 2 additions & 2 deletions benchmarks/miditok_preprocess_file/benchmark_preprocess.py
Original file line number Diff line number Diff line change
@@ -8,14 +8,14 @@
from pathlib import Path
from time import time

import miditok
import numpy as np
from miditok.constants import SCORE_LOADING_EXCEPTION
from pandas import DataFrame, read_csv
from symusic import Score
from tqdm import tqdm

import miditok
from benchmarks.utils import mean_std_str
from miditok.constants import SCORE_LOADING_EXCEPTION

TOKENIZER_CONFIG_KWARGS = {
"use_tempos": True,
4 changes: 2 additions & 2 deletions benchmarks/miditok_tokenize/benchmark_tokenize.py
Original file line number Diff line number Diff line change
@@ -7,14 +7,14 @@
from pathlib import Path
from time import time

import miditok
import numpy as np
from miditok.constants import SCORE_LOADING_EXCEPTION
from pandas import DataFrame, read_csv
from symusic import Score
from tqdm import tqdm

import miditok
from benchmarks import mean_std_str
from miditok.constants import SCORE_LOADING_EXCEPTION

TOKENIZER_CONFIG_KWARGS = {
"use_tempos": True,
4 changes: 2 additions & 2 deletions benchmarks/tokenizer_training/benchmark_training.py
Original file line number Diff line number Diff line change
@@ -9,14 +9,14 @@
from time import time
from typing import Literal

import miditok
import numpy as np
from miditok.constants import SCORE_LOADING_EXCEPTION
from pandas import DataFrame, read_csv # requires tabulate package
from symusic import Score
from tqdm import tqdm

import miditok
from benchmarks.utils import mean_std_str
from miditok.constants import SCORE_LOADING_EXCEPTION

# Tokenizer
TOKENIZER_PARAMS = {
2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
@@ -9,7 +9,7 @@
import tomllib
from pathlib import Path

sys.path.insert(0, str(Path("..").resolve()))
sys.path.insert(0, str(Path("..").resolve() / "src"))

# -- Project information -----------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information
6 changes: 3 additions & 3 deletions docs/pytorch_data.rst
Original file line number Diff line number Diff line change
@@ -19,7 +19,7 @@ Preparing data

When training a model, you will likely want to limit the possible token sequence length in order to not run out of memory. The dataset classes handle such case and can trim the token sequences. However, **it is not uncommon for a single MIDI to be tokenized into sequences that can contain several thousands tokens, depending on its duration and number of notes. In such case, using only the first portion of the token sequence would considerably reduce the amount of data used to train and test a model.**

To handle such case, MidiTok provides the :py:func:`miditok.pytorch_data.split_midis_for_training` method to dynamically split MIDI files into chunks that should be tokenized in approximately the number of tokens you want.
To handle such case, MidiTok provides the :py:func:`miditok.pytorch_data.split_files_for_training` method to dynamically split MIDI files into chunks that should be tokenized in approximately the number of tokens you want.
If you cannot fit most of your MIDIs into single usable token sequences, we recommend to split your dataset with this method.

Data loading example
@@ -31,7 +31,7 @@ Here is a complete example showing how to use this module to train any model.
.. code-block:: python
from miditok import REMI, TokenizerConfig
from miditok.pytorch_data import DatasetMIDI, DataCollator, split_midis_for_training
from miditok.pytorch_data import DatasetMIDI, DataCollator, split_files_for_training
from torch.utils.data import DataLoader
from pathlib import Path
@@ -48,7 +48,7 @@ Here is a complete example showing how to use this module to train any model.
# Split MIDIs into smaller chunks for training
dataset_chunks_dir = Path("path", "to", "midi_chunks")
split_midis_for_training(
split_files_for_training(
files_paths=midi_paths,
tokenizer=tokenizer,
save_dir=dataset_chunks_dir,
25 changes: 13 additions & 12 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -4,11 +4,11 @@ build-backend = "hatchling.build"

[project]
name = "miditok"
version = "3.0.4"
version = "3.0.5"
description = "MIDI / symbolic music tokenizers for Deep Learning models."
readme = {file = "README.md", content-type = "text/markdown"}
license = {file = "LICENSE"}
requires-python = ">=3.8.0"
requires-python = ">=3.9"
authors = [
{ name = "Nathan Fradet" },
]
@@ -29,11 +29,11 @@ classifiers = [
"License :: OSI Approved :: MIT License",
"Programming Language :: Python",
"Programming Language :: Python :: 3 :: Only",
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
"Programming Language :: Python :: 3.13",
"Operating System :: OS Independent",
]
dependencies = [
@@ -65,13 +65,8 @@ Repository = "https://github.com/Natooz/MidiTok.git"
Documentation = "https://miditok.readthedocs.io"
Issues = "https://github.com/Natooz/MidiTok/issues"

[tool.hatch.version]
path = "miditok/__init__.py"

[tool.hatch.build.targets.sdist]
include = [
"/miditok",
]
[tool.hatch.build.targets.wheel]
packages = ["src/treeval"]

[mypy]
warn_return_any = "True"
@@ -82,6 +77,12 @@ exclude = [
".venv",
]

[tool.pytest.ini_options]
pythonpath = "src"
addopts = [
"--import-mode=importlib",
]

[tool.coverage.report]
exclude_also = [
"def __repr__",
@@ -92,7 +93,7 @@ omit = [
]

[tool.ruff]
target-version = "py312"
target-version = "py313"

[tool.ruff.lint]
extend-select = [
@@ -211,7 +212,7 @@ ignore = [
# we don't use passwords in MidiTok, only HF token for the interactions with the hub.
# However we have a lot of variables with "token"(s) in their name, which would yield a
# lot of lint errors or require a lot of noqa exceptions.
"miditok/**" = [
"src/miditok/**" = [
"S105",
]
"tests/**" = [
File renamed without changes.
File renamed without changes.
File renamed without changes.
13 changes: 13 additions & 0 deletions miditok/classes.py → src/miditok/classes.py
Original file line number Diff line number Diff line change
@@ -832,6 +832,19 @@ def __init__(
# Additional params
self.additional_params = kwargs

# Using dataclass overly complicates all the checks performed after init and reduces
# the types flexibility (sequence etc...).
# Freezing the class could be done, but special cases (MMM) should be handled.
"""def __setattr__(self, name, value):
if getattr(self, "_is_frozen", False) and name != "_is_frozen":
raise AttributeError(
f"Cannot modify frozen instance of {self.__class__.__name__}"
)
super().__setattr__(name, value)
def freeze(self):
object.__setattr__(self, "_is_frozen", True)"""

@property
def max_num_pos_per_beat(self) -> int:
"""
1 change: 1 addition & 0 deletions miditok/constants.py → src/miditok/constants.py
Original file line number Diff line number Diff line change
@@ -138,6 +138,7 @@
# Tokenizers specific parameters
MMM_COMPATIBLE_TOKENIZERS = {"TSD", "REMI", "MIDILike"}
USE_BAR_END_TOKENS = False # REMI
ADD_TRAILING_BARS = False # REMI

# Defaults values when writing new files
TEMPO = 120
File renamed without changes.
Loading