-
-
Notifications
You must be signed in to change notification settings - Fork 181
Description
I'd like to play with patching franc, or making some alternative to it, that can detect the language of small documents much more accurately.
First of all is this something that could be interesting to merge into franc itself?
Secondly I'm almost clueless about language classification, could trying the following things make sense?
- Storing more than 300 trigrams, maybe 400 or so.
- Using quadgrams or bigrams rather than trigrams.
- Extracting the trigrams from a longer and more diverse document than the UDHR.
From a shallow reading of this paper on n-grams it sounds to me like ngrams may be fundamentally not well suited for short documents because there just isn't enough data to reconstruct the top 300 or whatever ngrams reliably from that, maybe 🤔.
CLD3 seems to feed unigrams bigrams and trigrams to some neural network and that seems to work much better for smaller texts somehow, I'm not sure how or why, maybe that's the way to go.
Any other ideas that I should try?