`stopes`: A library for preparing data for machine translation research

As part of the FAIR No Language Left Behind (NLLB) (Paper, Website, Blog) project to drive inclusion through machine translation, a large amount of data was processed to create training data. We provide the libraries and tools we used to:

create clean monolingual data from web data
mine bitext
easily write scalable pipelines for processing data for machine translation

Full documentation on https://facebookresearch.github.io/stopes

Examples

checkout the demo directory for an example usage with the WMT22 Shared Task: Large-Scale Machine Translation Evaluation for African Languages data.

Requirements

stopes relies on:

submitit to schedule jobs when ran on clusters
hydra-core version >= 1.2.0 for configuration
fairseq to use LASER encoders
PyTorch version >= 1.5.0
Python version >= 3.8

Installing stopes

stopes uses flit to manage its setup, you will need a recent version of pip for the install to work. We recommend that you first upgrade pip: python -m pip install --upgrade pip

The mining pipeline relies on fairseq to run LASER encoders, because of competing dependency version, you'll have to first install fairseq with pip separately:

pip install fairseq==0.12.1

You can then install stopes with pip:

cd ..
git clone https://github.com/facebookresearch/stopes.git
cd stopes
pip install -e '.[dev,mono,mining]'

You can choose what to install. If you are only interested in mining, you do not need to install dev, and mono. If you are interested in the distillation pipeline, you will need to install at least mono. mining will install the cpu version of the dependencies for mining, if you want to do mining on gpu, and your system is compatible, you can install [mining,mining-gpu].

Currently fairseq and stopes require different version of hydra, so pip might output some warnings, do not worry about them, we want hydra>=1.1.

If you plan to train a lot of NMT model you will also want to setup apex to get a faster training.

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
  --global-option="--deprecated_fused_adam" --global-option="--xentropy" \
  --global-option="--fast_multihead_attn" ./

How `stopes` works

stopes is made of a few different parts:

core provides a library to write readable piplines
modules provides a set of modules using the core library and implementing common steps in our mining and evaluation pipelines
pipelines provides pipeline implementation for the data pipelines we use in NLLB:

monolingual to preprocess and clean single language data
bitext to run the "global mining" pipeline and extract aligned sentences from two monolingual datasets. (inspired by CCMatric)
distilation to run our sequence-level knowledge distillation pipeline which trains a small student model from a pre-trained large teacher model (approach based on https://arxiv.org/abs/1606.07947)

Full documentation: see https://facebookresearch.github.io/stopes or the websites/docs folder.

Contributing

See the CONTRIBUTING file for how to help out.

Contributors

(in alphabetical order)

Citation

If you use stopes in your work or any models/datasets/artifacts published in NLLB, please cite :

@article{nllb2022,
  title={No Language Left Behind: Scaling Human-Centered Machine Translation},
  author={{NLLB Team} and Costa-jussà, Marta R. and Cross, James and Çelebi, Onur and Elbayad, Maha and Heafield, Kenneth and Heffernan, Kevin and Kalbassi, Elahe and Lam, Janice and Licht, Daniel and Maillard, Jean and Sun, Anna and Wang, Skyler and Wenzek, Guillaume and Youngblood, Al and Akula, Bapi and Barrault, Loic and Mejia-Gonzalez, Gabriel and Hansanti, Prangthip and Hoffman, John and Jarrett, Semarley and Sadagopan, Kaushik Ram and Rowe, Dirk and Spruit, Shannon and Tran, Chau and Andrews, Pierre and Ayan, Necip Fazil and Bhosale, Shruti and Edunov, Sergey and Fan, Angela and Gao, Cynthia and Goswami, Vedanuj and Guzmán, Francisco and Koehn, Philipp and Mourachko, Alexandre and Ropers, Christophe and Saleem, Safiyyah and Schwenk, Holger and Wang, Jeff},
  year={2022}
}

License

stopes is MIT licensed, as found in the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.github		.github
demo		demo
stopes		stopes
toxicity-alti-hb		toxicity-alti-hb
website		website
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

`stopes`: A library for preparing data for machine translation research

Examples

Requirements

Installing stopes

How `stopes` works

Contributing

Contributors

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

elukey/stopes

Folders and files

Latest commit

History

Repository files navigation

stopes: A library for preparing data for machine translation research

Examples

Requirements

Installing stopes

How stopes works

Contributing

Contributors

Citation

License

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

`stopes`: A library for preparing data for machine translation research

How `stopes` works

Packages