Subword Neural Machine Translation

This repository contains preprocessing scripts to segment text into subword units. The primary purpose is to facilitate the reproduction of our experiments on Neural Machine Translation with subword units (see below for reference).

USAGE INSTRUCTIONS

Check the individual files for usage instructions.

To apply byte pair encoding to word segmentation, invoke these commands:

./learn_bpe.py -s {num_operations} < {train_file} > {codes_file}
./apply_bpe.py -c {codes_file} < {test_file}

To segment rare words into character n-grams, do the following:

./get_vocab.py < {train_file} > {vocab_file}
./segment-char-ngrams.py --vocab {vocab_file} -n {order} --shortlist {size} < {test_file}

The original segmentation can be restored with a simple replacement:

sed "s/@@ //g"

PUBLICATIONS

The segmentation methods are described in:

Rico Sennrich, Barry Haddow and Alexandra Birch (2016): Neural Machine Translation of Rare Words with Subword Units Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). Berlin, Germany.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
LICENSE		LICENSE
README.md		README.md
apply_bpe.py		apply_bpe.py
bpe_toy.py		bpe_toy.py
chrF.py		chrF.py
count_dictionary.py		count_dictionary.py
get_vocab.py		get_vocab.py
learn_bpe.py		learn_bpe.py
merge-lines.py		merge-lines.py
postmorf.py		postmorf.py
segment-char-ngrams.py		segment-char-ngrams.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Subword Neural Machine Translation

USAGE INSTRUCTIONS

PUBLICATIONS

About

Uh oh!

Releases

Packages

Languages

License

kspar/subword-nmt

Folders and files

Latest commit

History

Repository files navigation

Subword Neural Machine Translation

USAGE INSTRUCTIONS

PUBLICATIONS

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages