Improved accuracy for small documents

I'd like to play with patching franc, or making some alternative to it, that can detect the language of small documents much more accurately.

First of all is this something that could be interesting to merge into franc itself?

Secondly I'm almost clueless about language classification, could trying the following things make sense?

1. Storing more than 300 trigrams, maybe 400 or so.
2. Using quadgrams or bigrams rather than trigrams.
3. Extracting the trigrams from a longer and more diverse document than the UDHR.

From a shallow reading of [this paper on n-grams](https://www.researchgate.net/profile/William-Cavnar/publication/2375544_N-Gram-Based_Text_Categorization/links/0fcfd50c7868e1cd75000000/N-Gram-Based-Text-Categorization.pdf) it sounds to me like ngrams may be fundamentally not well suited for short documents because there just isn't enough data to reconstruct the top 300 or whatever ngrams reliably from that, maybe &#129300;.

CLD3 seems to feed unigrams bigrams and trigrams to some neural network and that seems to work much better for smaller texts somehow, I'm not sure how or why, maybe that's the way to go.

Any other ideas that I should try? 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Improved accuracy for small documents #100

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Improved accuracy for small documents #100

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions