immorthon

An LLM-based definition generator

Dataset

The training data is a collection of word/definition pairs. Several corpuses of word have been tried:

The dataset's format is a CSV file with the following columns:

word: the word
definition: the definition of the word

The definitions are scraped from the Oxford Learner's Dictionaries.

Scraping dataset

The scraping script is located in the main.cpp file which can be compiled and run with the following command:

make && ./main <corpus_file> <output_file>

where <corpus_file> is the path to a file containing alist of words and <output_file> is the path to the created file which will contain the definitions scraped.

If a given word is not available in the dictionnary, its definition will not be added to <output_file> and a message will be printed to the console.

Parallel Computing

The code uses OpenMP to parallelize the scraping process. You can disable parallel computing by building the code with

make nooomp

This is significantly slower, but can be useful if a word causes you trouble and you need to debug it.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
corpora		corpora
dictionnaries		dictionnaries
docs		docs
.clang-format		.clang-format
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
immorthon.ipynb		immorthon.ipynb
main.cpp		main.cpp
res.txt		res.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

immorthon

Dataset

Scraping dataset

Parallel Computing

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

sully-vian/immorthon

Folders and files

Latest commit

History

Repository files navigation

immorthon

Dataset

Scraping dataset

Parallel Computing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages