Skip to content

A tool for automatic English to Katakana conversion

License

Notifications You must be signed in to change notification settings

Patchethium/e2k

Repository files navigation

English to Katakana Translator

PyPI version

e2k is a Python library that translates English to Katakana. It's based on a RNN model trained on a dictionary extracted from Wikitionary and JMdict / EDICT. It only requires numpy as a dependency.

We also provide a English to Katakana dictionary in the releases (not available in the PyPI package).

Usage

e2k is available on PyPI.

pip install e2k

Usage:

2 types of models are provided, one converts phoneme to Katakana and one that converts character to Katakana. Choose the one that fits your use case.

from e2k import P2K, C2K
from g2p_en import G2p # any g2p library with CMUdict will work

# cmudict phoneme to katakana
p2k = P2K()

g2p = G2p()

word = "word"

word = word.lower()

# you can check the accepted symbols using
in_table = c2k.in_table # `c2k` accepts lowercase characters, space and apostrophe
in_table = p2k.in_table # `p2k` accepts the phonemes from the CMUdict and space
# for output symbols
out_table = c2k.out_table
out_table = p2k.out_table

katakana = p2k(g2p(word))

print(katakana) # "ワード"

# characters directly to katakana
c2k = C2K()

katakana = c2k(word)

print(katakana) # "ワード"

# decode strategy
# greedy by default, top_k and top_p are available
# I tried implementing beam search but it barely differs from greedy
c2k.set_decode_strategy("top_k", top_k=2) # low quality results, not recommended
c2k.set_decode_strategy("top_p", top_p=0.6, t=0.8)
# for further information, see
# https://huggingface.co/docs/transformers/en/generation_strategies
print(katakana) # "ワード"

Warning

For any symbols not in the in_table, the model will ignore them and may produce unexpected results.

Performance

The BLEU score is calculated on a random subset with size of 10% of the dataset.

Model BLEU Score ↑
P2K 0.89
C2K 0.92

Dictionary

We train the model on a dictionary extracted from Wikitionary and JMdict / EDICT. The dictionary contains 30k entries, you can also find it in the releases.

Note

The dictionary is not included in the PyPI package. Either download it from the releases or create it yourself following the instructions below.

Dependencies

The extraction script has zero dependencies, as long as you have a Python 3 interpreter it should work.

However, it's not included in the PyPI package, you need to clone this repository to use it.

git clone https://github.com/Patchethium/e2k.git

Download data

Wikitionary

Download the raw dump of the Japanese Wikitionary from https://kaikki.org/dictionary/rawdata.html, they kindly provide the parsed data in a JSONL format.

Look for the Japanese ja-extract.jsonl.gz (compressed 37.5MB) entry and download it. If you prefer command line, use

curl -O https://kaikki.org/dictionary/downloads/ja/ja-extract.jsonl.gz

JMdict / EDICT

Download the JMdict and EDICT from https://www.edrdg.org/wiki/index.php/JMdict-EDICT_Dictionary_Project.

Look for the edict2.gz and download it. Or in command line:

curl -O http://ftp.edrdg.org/pub/Nihongo/edict2.gz

Extract both files into /vendor folder.

On Linux, you can use

gzip -d ja-extract.jsonl.gz
gzip -d edict2.gz

Run the extraction

python extract.py
# if you have another name for the file
python extract.py --path /path/to/your_file.jsonl

By default, a katakana_dict.jsonl file will be created in the vendor folder.

Development

Install the dependencies

I use uv to manage the dependencies and publish the package.

uv sync

Then activate the virtual environment with source .venv/bin/activate or add uv run before the commands.

Benchmark

The scores in Performance are obtained using the eval.py script.

# --p2k for phoneme to katakana, if not provided, it will be character to katakana
python eval.py --data ./vendor/katakana_dict.jsonl --model /path/to/your/model.pth --p2k

Train

After installing the dependencies, torch will be added as a development dependency. You can train the model using

python train.py --data ./vendor/katakana_dict.jsonl

It takes around 10 minutes on a desktop CPU. The model will be saved as vendor/model-{p2k/c2k}-e{epoch}.pth.

Also, you'll need to either download the katakana_dict.jsonl from the releases or create it yourself using the extract.py script.

CUDA

Although uv claims to be able to control torch version by extras, it doesn't work out in practice. For now you need to manually edit the pyproject.toml file for CUDA support.

Change

[tool.uv.sources]
torch = [{ index = "pytorch-cpu" }]

[[tool.uv.index]]
name = "pytorch-cpu"
url = "https://download.pytorch.org/whl/cpu"
explicit = true

To:

[tool.uv.sources]
torch = [{ index = "pytorch-cu121" }]

[[tool.uv.index]]
name = "pytorch-cu121"
url = "https://download.pytorch.org/whl/cu121"
explicit = true

And run uv lock && uv sync to update the dependencies.

Be noted not to commit the changes to the pyproject.toml and uv.lock files if you made the change.

Note

Yes it sucks, especially when you need to add other dependencies. I'm always looking for a better solution, PRs are welcome.

Export

The model should be exported to numpy format for production use.

# --p2k for phoneme to katakana, if not provided, it will be character to katakana
# --fp32 for double precision, by default we use fp16 to save space
# --output to specify the output file, in this project it's `model-{p2k/c2k}.npz`
# --safetenors to use safe tensors, it's for easier binding in some languages
python export.py --model /path/to/your/model.pth --p2k --output /path/to/your/model.npz

Note

The pretrained weights are not included in the Git registry, you can find them in the releases.

License

  • The code is released under WTFPL, you can do WTF you want with it.
  • The dictionary follows the Wikimedia's license and the JMdict / EDICT's Copyright license.
    • In short, they both fall into CC-BY-SA.
    • The model weights are trained using the dictionary. I am not a lawyer, whether the machine learning weights is considered as a derivative work is up to you.

Credits

About

A tool for automatic English to Katakana conversion

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages