DiMo. Distributional Models Evaluation

DiMo is a collection of scripts for my bachelor's thesis Comparison and Evaluation of Models for Distributional Semantics.

Take a look at notebooks on the thesis's official website to see how these scripts can be used:

nlp.fi.muni.cz/projekty/dimo

Notice! Parts of the code require internal Sketch Engine's packages manatee and wmap.

Other required packages are:

numpy
scipy
gensim
sklearn

The code runs on python 2.7.

Models

Sketch Engine Thesaurus (SkEThes)

Unlike the original implementation, the one in this project operates directly on a co-occurrence matrix.

If you have a corpus with compiled word sketches (let's say it is called bnc2), use wm2thes.py script to create such a matrix:

python wm2thes.py bnc2 bnc2-matrix

This creates 4 files representing a sparse word x (relation, word) matrix:

- bnc2-matrix-target2i.pickle  # dictionary: words to indices
- bnc2-matrix-rows.npy         # row indices
- bnc2-matrix-cols.npy         # col indices
- bnc2-matrix-vals.npy         # values

Now that you have the matrix, you may decide which similarity measure to use.

from models import SkEThesSKE, SkEThesCOS

model_ske = SkEThesSKE("bnc2-matrix")
model_cos = SkEThesCOS("bnc2-matrix")

Now you can call functions like similarity, similarities, most_similar or eval_analogy to evaluate the models on datasets of analogy queries.

There is also a wrapper for the original implementation in oskethes.py, but the interface is a bit different as it is just a collection of several word similarities, the co-occurrence matrix is gone, similarities < 0.05 are gone...

Word-Word Co-Occurrence Matrix

If you have a corpus in text file (one line -- one sentence), you may create a similar model with linear contexts (weighted symmetric context window):

python coocs.py plain-bnc.txt plain-bnc-matrix 20 5

20 is the minimum word frequency
5 is the context window size

The matrix will contain raw co-occurrence counts, so you may consider using some weighting.

from models import SkEThesCOS
from weightings import ppmi

model_ske = SkEThesSKE("plain-bnc-matrix", weighting=ppmi)

Word2Vec

For Word2Vec models, this project wraps over gensim package. Everything that you can open with:

from gensim.models import Word2Vec

model = Word2Vec(model_name)

... you can open also with:

from models import Word2Vec

model = Word2Vec(model_name)

The interface as well as the evaluation script stays the same as in SkEThesXXX.

Evaluation

evaluation = model.eval_analogy(dataset)

Dataset is a dictionary category: list_of_queries. Each query should be a tuple like:

("paris", "france", "london", {"england", "britain", "uk"})

You may configure the evaluation in various ways:

from formulas import mul

my_mul = lambda a, b, aa: mul(a, b, aa, coeff=0.05)
evaluation = model.eval_analogy(dataset, topn=5, exclusion_trick=False, formula=my_mul)

And see the results:

evaluation[category]["acc"]  # 0.0--1.0
evaluation[category]["acc_top1"]  # 0.0--1.0
evaluation[category]["oov"]  # nb of queries containing an oov word
evaluation[category]["oovs"]  # set of oov words
evaluation[category]["queries"]  # list of queries and their candidate answers  (excluding queries with oov words)

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE		LICENSE
README.md		README.md
conv.py		conv.py
coocs.py		coocs.py
datasets.py		datasets.py
deval.py		deval.py
formulas.py		formulas.py
misc.py		misc.py
models.py		models.py
oskethes.py		oskethes.py
weightings.py		weightings.py
wm2thes.py		wm2thes.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DiMo. Distributional Models Evaluation

Models

Sketch Engine Thesaurus (SkEThes)

Word-Word Co-Occurrence Matrix

Word2Vec

Evaluation

About

Uh oh!

Releases

Packages

Languages

License

nimcho/dimo

Folders and files

Latest commit

History

Repository files navigation

DiMo. Distributional Models Evaluation

Models

Sketch Engine Thesaurus (SkEThes)

Word-Word Co-Occurrence Matrix

Word2Vec

Evaluation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages