Use O(n*m) sequence alignment algorithm

tl;dr Using a sequence alignment algorithm can improve the fuzzy matching quality, as what fzf does. But this approach is much simpler without sacrifice of quality. See the following example:

```
    //////////// Ranks(pattern,  {best_candidate, ........, worst_candidate})
    // case
    CHECK(Ranks("monad", {"monad", "Monad", "mONAD"}));
    // initials
    CHECK(Ranks("ab", {"ab", "aoo_boo", "acb"}));
    CHECK(Ranks("CC", {"CamelCase", "camelCase", "camelcase"}));
    CHECK(Ranks("cC", {"camelCase", "CamelCase", "camelcase"}));
    CHECK(Ranks("c c", {"camel case", "camelCase", "CamelCase", "camelcase",
                        "camel ace"}));
    CHECK(Ranks("Da.Te",
                {"Data.Text", "Data.Text.Lazy", "Data.Aeson.Encoding.text"}));
    CHECK(Ranks("foo bar.h", {"foo/bar.h", "foobar.h"}));
    // prefix
    CHECK(Ranks("is", {"isIEEE", "inSuf"}));
    // shorter
    CHECK(Ranks("ma", {"map", "many", "maximum"}));
    CHECK(Ranks("print", {"printf", "sprintf"}));
    // score(PRINT) = kMinScore
    CHECK(Ranks("ast", {"ast", "AST", "INT_FAST16_MAX"}));
    // score(PRINT) > kMinScore
    CHECK(Ranks("Int", {"int", "INT", "PRINT"}));
```

What `helm-M-x` returns for the pattern `c m`:
```
0. c++-mode          ; deserved champion
1. customize
2. eldoc-mode
4. company-mode   ;;this should be the runner-up, because `c m` are Head-position matching
```

`counsel-M-x`:
```
0. customize-group   ; m is Tail-position matching, this should be given lower weight
1. chmod                ; this is short so it comes first, but `m` is a Tail position
4. c++-mode        ; our champion shouldn't be here
5. css-mode
..........
```

I have read many fuzzy matching algorithms, including the one used by [fzf](https://github.com/junegunn/fzf). It is too complex and also messes about heuristics.

I adapted and simplified the fuzzy matching algorithm used in [clangd](https://github.com/llvm-mirror/clang-tools-extra/blob/master/clangd/FuzzyMatch.cpp) and created https://github.com/MaskRay/cquery/blob/fuzzy/src/fuzzy_match.cc

It takes many factors into account:

* case matching.
* whether pattern is case folding search
* The initial position (Head) of each word should be given more weight
* consecutive or prefix matching is preferred
* Break ties by length

While keeps it conceptually simple in the main loop:
```c++
  dp[0][0][0] = dp[0][0][1] = 0;
  for (int j = 0; j < n; j++) {
    dp[0][j + 1][0] = dp[0][j][0] + MissScore(j, false);
    dp[0][j + 1][1] = kMinScore * 2;
  }
  for (int i = 0; i < int(pat.size()); i++) {
    int(*pre)[2] = dp[i & 1];
    int(*cur)[2] = dp[(i + 1) & 1];
    cur[i][0] = cur[i][1] = kMinScore;
    for (int j = i; j < n; j++) {
      cur[j + 1][0] = std::max(cur[j][0] + MissScore(j, false),
                               cur[j][1] + MissScore(j, true));
      // For the first char of pattern, apply extra restriction to filter bad
      // candidates (e.g. |int| in |PRINT|)
      if (low_pat[i] == low_text[j] &&
          (i || text_role[j] != Tail || pat[i] == text[j])) {
        cur[j + 1][1] = std::max(pre[j][0] + MatchScore(i, j, false),
                                 pre[j][1] + MatchScore(i, j, true));
      } else
        cur[j + 1][1] = kMinScore * 2;
    }
  }
```

The heuristics are gathered and coded in **one place** `FuzzyMatcher::MatchScore`, instead of scattering all over the code (which is unfortunately the case for many other fuzzy matching algorithms). I have also done something similar for rofi which you can experience by appending `-sort -matching fuzzy` to its command line.

I'll explain briefly about the current heuristics. Similar rules can be added.
```c++
int FuzzyMatcher::MatchScore(int i, int j, bool last) {
  int s = 0;
  // Bonus for case match
  if (pat[i] == text[j]) {
    s++;
    // Bonus for prefix match or case match when the pattern contains upper-case letters
    if ((pat_set & 1 << Upper) || i == j)
      s++;
  }
  // For initial positions of pattern words
  if (pat_role[i] == Head) {
    // Bonus if it is matched to an initial position of some text word
    if (text_role[j] == Head)
      s += 30;
    // Penalty for non-initial positions
    else if (text_role[j] == Tail)
      s -= 10;
  }
  if (text_role[j] == Tail && i && !last)
    s -= 30;
  // The first character of pattern is matched to a non-initial position in the text
  if (i == 0 && text_role[j] == Tail)
    s -= 40;
  return s;
}
```

One caveat is that for space complexity I use compressed two-row `dp[2][n][2]` instead of full `dp[m][n][2]`. This way, we lose the ability to reconstruct the optimal path (matching positions).

Given pattern `cm`, the sequence alignment algorithm will prefer `[c]ompany-[m]ode` over `[c]o[m]pany-mode`, as the initial positions of words are given more weight. However, I do not store the full sequence alignment table `dp[m][n][2]` so it is not possible to reconstruct the optimal path. But this should only cause a minor visual glitch.

I'm not that good at elisp (implying that I cannot implement this ....) but I believe it is a huge improvement. Appreciated if someone could implement this..........

If you speak Chinese, you may read my complain in https://emacs-china.org/t/topic/5368 ............

### ivy integration

* Use a subsequence filter because applying this O(n*m) algorithm on every candidate will be slow. And the algorithm will return an awful value for non-matching candidates anyway
* Calculate the score for each candidate and use it to sort candidates
* This can only be used for `ivy--regex-fuzzy`, as other modes do not allow subsequence filtering. Also we cannot use arbitrary regular expressions in patterns, because subsequence filter and this sequence alignment algorithm does not understand them. This is not a limitation, if the algorithm is smart enough to assign proper weights to consecutive matching, initial-character matching, .... You will find very rare case where regex is helpful
* I don't read the ivy code carefully, because `ivy--flx-sort` and other regular expression constructions seem unnecessary.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use O(n*m) sequence alignment algorithm #98

ivy integration

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Use O(n*m) sequence alignment algorithm #98

Description

ivy integration

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions