An LLM-based definition generator
The training data is a collection of word/definition pairs. Several corpuses of word have been tried:
The dataset's format is a CSV file with the following columns:
word
: the worddefinition
: the definition of the word
The definitions are scraped from the Oxford Learner's Dictionaries.
The scraping script is located in the main.cpp
file which can be compiled and run with the following command:
make && ./main <corpus_file> <output_file>
where <corpus_file>
is the path to a file containing alist of words and
<output_file>
is the path to the created file which will contain the definitions scraped.
If a given word is not available in the dictionnary, its definition will not be added to <output_file>
and a message will be printed to the console.
The code uses OpenMP to parallelize the scraping process. You can disable parallel computing by building the code with
make nooomp
This is significantly slower, but can be useful if a word causes you trouble and you need to debug it.