Use languages' alphabets to make detection more accurate

`&#1063;&#1090;&#1086; &#1101;&#1090;&#1086; &#1079;&#1072; &#1103;&#1079;&#1099;&#1082;?` is a Russian sentence, which is detected as Bulgarian (bul 1, rus 0.938953488372093, mkd 0.9353197674418605). However, neither Bulgarian nor Macedonian have the letters &#1101; and &#1099; in their alphabets.

Same with `&#1063;&#1077;&#1082;&#1072;&#1102; &#1094;&#1110;&#1108;&#1111; &#1093;&#1074;&#1080;&#1083;&#1080;&#1085;&#1080;.`, which is Ukrainian, but is detected as Northern Uzbek with probability 1 whereas Ukrainian gets only 0.33999999999999997. However, the letters &#1108; and &#1111; are used only in Ukrainian whereas the Uzbek Cyrillic alphabet doesn't include as many as five letters from this sentence, namely: &#1102;, &#1094;, &#1110;, &#1108; and &#1111;.

I know that Franc is supposed to be not good with short input strings, but taking alphabets into account seems to be a promising way to improve the accuracy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Use languages' alphabets to make detection more accurate #83

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Use languages' alphabets to make detection more accurate #83

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions