-
Notifications
You must be signed in to change notification settings - Fork 37
Description
tl;dr Using a sequence alignment algorithm can improve the fuzzy matching quality, as what fzf does. But this approach is much simpler without sacrifice of quality. See the following example:
//////////// Ranks(pattern, {best_candidate, ........, worst_candidate})
// case
CHECK(Ranks("monad", {"monad", "Monad", "mONAD"}));
// initials
CHECK(Ranks("ab", {"ab", "aoo_boo", "acb"}));
CHECK(Ranks("CC", {"CamelCase", "camelCase", "camelcase"}));
CHECK(Ranks("cC", {"camelCase", "CamelCase", "camelcase"}));
CHECK(Ranks("c c", {"camel case", "camelCase", "CamelCase", "camelcase",
"camel ace"}));
CHECK(Ranks("Da.Te",
{"Data.Text", "Data.Text.Lazy", "Data.Aeson.Encoding.text"}));
CHECK(Ranks("foo bar.h", {"foo/bar.h", "foobar.h"}));
// prefix
CHECK(Ranks("is", {"isIEEE", "inSuf"}));
// shorter
CHECK(Ranks("ma", {"map", "many", "maximum"}));
CHECK(Ranks("print", {"printf", "sprintf"}));
// score(PRINT) = kMinScore
CHECK(Ranks("ast", {"ast", "AST", "INT_FAST16_MAX"}));
// score(PRINT) > kMinScore
CHECK(Ranks("Int", {"int", "INT", "PRINT"}));
What helm-M-x
returns for the pattern c m
:
0. c++-mode ; deserved champion
1. customize
2. eldoc-mode
4. company-mode ;;this should be the runner-up, because `c m` are Head-position matching
counsel-M-x
:
0. customize-group ; m is Tail-position matching, this should be given lower weight
1. chmod ; this is short so it comes first, but `m` is a Tail position
4. c++-mode ; our champion shouldn't be here
5. css-mode
..........
I have read many fuzzy matching algorithms, including the one used by fzf. It is too complex and also messes about heuristics.
I adapted and simplified the fuzzy matching algorithm used in clangd and created https://github.com/MaskRay/cquery/blob/fuzzy/src/fuzzy_match.cc
It takes many factors into account:
- case matching.
- whether pattern is case folding search
- The initial position (Head) of each word should be given more weight
- consecutive or prefix matching is preferred
- Break ties by length
While keeps it conceptually simple in the main loop:
dp[0][0][0] = dp[0][0][1] = 0;
for (int j = 0; j < n; j++) {
dp[0][j + 1][0] = dp[0][j][0] + MissScore(j, false);
dp[0][j + 1][1] = kMinScore * 2;
}
for (int i = 0; i < int(pat.size()); i++) {
int(*pre)[2] = dp[i & 1];
int(*cur)[2] = dp[(i + 1) & 1];
cur[i][0] = cur[i][1] = kMinScore;
for (int j = i; j < n; j++) {
cur[j + 1][0] = std::max(cur[j][0] + MissScore(j, false),
cur[j][1] + MissScore(j, true));
// For the first char of pattern, apply extra restriction to filter bad
// candidates (e.g. |int| in |PRINT|)
if (low_pat[i] == low_text[j] &&
(i || text_role[j] != Tail || pat[i] == text[j])) {
cur[j + 1][1] = std::max(pre[j][0] + MatchScore(i, j, false),
pre[j][1] + MatchScore(i, j, true));
} else
cur[j + 1][1] = kMinScore * 2;
}
}
The heuristics are gathered and coded in one place FuzzyMatcher::MatchScore
, instead of scattering all over the code (which is unfortunately the case for many other fuzzy matching algorithms). I have also done something similar for rofi which you can experience by appending -sort -matching fuzzy
to its command line.
I'll explain briefly about the current heuristics. Similar rules can be added.
int FuzzyMatcher::MatchScore(int i, int j, bool last) {
int s = 0;
// Bonus for case match
if (pat[i] == text[j]) {
s++;
// Bonus for prefix match or case match when the pattern contains upper-case letters
if ((pat_set & 1 << Upper) || i == j)
s++;
}
// For initial positions of pattern words
if (pat_role[i] == Head) {
// Bonus if it is matched to an initial position of some text word
if (text_role[j] == Head)
s += 30;
// Penalty for non-initial positions
else if (text_role[j] == Tail)
s -= 10;
}
if (text_role[j] == Tail && i && !last)
s -= 30;
// The first character of pattern is matched to a non-initial position in the text
if (i == 0 && text_role[j] == Tail)
s -= 40;
return s;
}
One caveat is that for space complexity I use compressed two-row dp[2][n][2]
instead of full dp[m][n][2]
. This way, we lose the ability to reconstruct the optimal path (matching positions).
Given pattern cm
, the sequence alignment algorithm will prefer [c]ompany-[m]ode
over [c]o[m]pany-mode
, as the initial positions of words are given more weight. However, I do not store the full sequence alignment table dp[m][n][2]
so it is not possible to reconstruct the optimal path. But this should only cause a minor visual glitch.
I'm not that good at elisp (implying that I cannot implement this ....) but I believe it is a huge improvement. Appreciated if someone could implement this..........
If you speak Chinese, you may read my complain in https://emacs-china.org/t/topic/5368 ............
ivy integration
- Use a subsequence filter because applying this O(n*m) algorithm on every candidate will be slow. And the algorithm will return an awful value for non-matching candidates anyway
- Calculate the score for each candidate and use it to sort candidates
- This can only be used for
ivy--regex-fuzzy
, as other modes do not allow subsequence filtering. Also we cannot use arbitrary regular expressions in patterns, because subsequence filter and this sequence alignment algorithm does not understand them. This is not a limitation, if the algorithm is smart enough to assign proper weights to consecutive matching, initial-character matching, .... You will find very rare case where regex is helpful - I don't read the ivy code carefully, because
ivy--flx-sort
and other regular expression constructions seem unnecessary.