+
Skip to content

nimcho/dimo

Repository files navigation

DiMo. Distributional Models Evaluation

license

DiMo is a collection of scripts for my bachelor's thesis Comparison and Evaluation of Models for Distributional Semantics.

Take a look at notebooks on the thesis's official website to see how these scripts can be used:

Notice! Parts of the code require internal Sketch Engine's packages manatee and wmap.

Other required packages are:

  • numpy
  • scipy
  • gensim
  • sklearn

The code runs on python 2.7.

Models

Sketch Engine Thesaurus (SkEThes)

Unlike the original implementation, the one in this project operates directly on a co-occurrence matrix.

If you have a corpus with compiled word sketches (let's say it is called bnc2), use wm2thes.py script to create such a matrix:

python wm2thes.py bnc2 bnc2-matrix

This creates 4 files representing a sparse word x (relation, word) matrix:

- bnc2-matrix-target2i.pickle  # dictionary: words to indices
- bnc2-matrix-rows.npy         # row indices
- bnc2-matrix-cols.npy         # col indices
- bnc2-matrix-vals.npy         # values

Now that you have the matrix, you may decide which similarity measure to use.

from models import SkEThesSKE, SkEThesCOS

model_ske = SkEThesSKE("bnc2-matrix")
model_cos = SkEThesCOS("bnc2-matrix")

Now you can call functions like similarity, similarities, most_similar or eval_analogy to evaluate the models on datasets of analogy queries.

There is also a wrapper for the original implementation in oskethes.py, but the interface is a bit different as it is just a collection of several word similarities, the co-occurrence matrix is gone, similarities < 0.05 are gone...

Word-Word Co-Occurrence Matrix

If you have a corpus in text file (one line -- one sentence), you may create a similar model with linear contexts (weighted symmetric context window):

python coocs.py plain-bnc.txt plain-bnc-matrix 20 5
  • 20 is the minimum word frequency
  • 5 is the context window size

The matrix will contain raw co-occurrence counts, so you may consider using some weighting.

from models import SkEThesCOS
from weightings import ppmi

model_ske = SkEThesSKE("plain-bnc-matrix", weighting=ppmi)

Word2Vec

For Word2Vec models, this project wraps over gensim package. Everything that you can open with:

from gensim.models import Word2Vec

model = Word2Vec(model_name)

... you can open also with:

from models import Word2Vec

model = Word2Vec(model_name)

The interface as well as the evaluation script stays the same as in SkEThesXXX.

Evaluation

evaluation = model.eval_analogy(dataset)

Dataset is a dictionary category: list_of_queries. Each query should be a tuple like:

("paris", "france", "london", {"england", "britain", "uk"})

You may configure the evaluation in various ways:

from formulas import mul

my_mul = lambda a, b, aa: mul(a, b, aa, coeff=0.05)
evaluation = model.eval_analogy(dataset, topn=5, exclusion_trick=False, formula=my_mul)

And see the results:

evaluation[category]["acc"]  # 0.0--1.0
evaluation[category]["acc_top1"]  # 0.0--1.0
evaluation[category]["oov"]  # nb of queries containing an oov word
evaluation[category]["oovs"]  # set of oov words
evaluation[category]["queries"]  # list of queries and their candidate answers  (excluding queries with oov words)

About

Scripts for evaluating models for distributional semantics

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载