+
Skip to content

Improved accuracy for small documents #100

@fabiospampinato

Description

@fabiospampinato

I'd like to play with patching franc, or making some alternative to it, that can detect the language of small documents much more accurately.

First of all is this something that could be interesting to merge into franc itself?

Secondly I'm almost clueless about language classification, could trying the following things make sense?

  1. Storing more than 300 trigrams, maybe 400 or so.
  2. Using quadgrams or bigrams rather than trigrams.
  3. Extracting the trigrams from a longer and more diverse document than the UDHR.

From a shallow reading of this paper on n-grams it sounds to me like ngrams may be fundamentally not well suited for short documents because there just isn't enough data to reconstruct the top 300 or whatever ngrams reliably from that, maybe 🤔.

CLD3 seems to feed unigrams bigrams and trigrams to some neural network and that seems to work much better for smaller texts somehow, I'm not sure how or why, maybe that's the way to go.

Any other ideas that I should try?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载