这是indexloc提供的服务,不要输入任何密码
Skip to content

Conversation

@keegancsmith
Copy link
Member

@keegancsmith keegancsmith commented Jan 30, 2023

A common search Zoekt gets from Sourcegraph is "\bLITERAL\b". With this PR we avoid the regex engine for these type of queries and provide something faster.

Local benchmarks show that the new code runs 4.8x faster for select queries.

Co-authored-by: @stefanhengl

Still need to add tests for this
@keegancsmith
Copy link
Member Author

@stefanhengl I couldn't help myself and wrote a little more code which should make this not break anything.

@keegancsmith
Copy link
Member Author

@stefanhengl Thinking about next steps here. I believe this is working. What isn't known if this actually makes a difference.

  • Do a simple test with something like hyperfine on a decent corpus
  • Eyeball pprof
  • Improve implementation. Maybe we should make a wrapper around regex so this is easier to use in things like searcher/etc? Should we somehow implement this optimization in upstream regex?
  • Is it worth shipping a custom zoekt to the customer that inspired this optimization?

@stefanhengl
Copy link
Member

stefanhengl commented Feb 6, 2023

I ran a couple of tests and it looks like we are finding too many results. EG a search for \bFind\b case:yes will return FindID while the main branch doesn't. I will look into it.

EDIT:
We were treating aA or Aa as word boundary which is not what \bA\b does, hence the many results.

@stefanhengl
Copy link
Member

Performance looks promissing (4.8x faster).

→  hyperfine -w 1 "./zoekt_main --index_dir ~/corpus/index \"regex:\bFind\b case:yes\"" "./zoekt_word --index_
dir ~/corpus/index \"regex:\bFind\b case:yes\""
Benchmark 1: ./zoekt_main --index_dir ~/corpus/index "regex:\bFind\b case:yes"
  Time (mean ± σ):     532.7 ms ±   9.0 ms    [User: 1629.3 ms, System: 145.6 ms]
  Range (min … max):   516.2 ms … 548.9 ms    10 runs
 
Benchmark 2: ./zoekt_word --index_dir ~/corpus/index "regex:\bFind\b case:yes"
  Time (mean ± σ):     109.8 ms ±   5.9 ms    [User: 344.9 ms, System: 146.6 ms]
  Range (min … max):   103.3 ms … 127.4 ms    26 runs
 
Summary
  './zoekt_word --index_dir ~/corpus/index "regex:\bFind\b case:yes"' ran
    4.85 ± 0.28 times faster than './zoekt_main --index_dir ~/corpus/index "regex:\bFind\b case:yes"'

@keegancsmith
Copy link
Member Author

\o/

@stefanhengl
Copy link
Member

@keegancsmith Shall we roll this out? I think this can be live while we polish it.

@stefanhengl stefanhengl marked this pull request as ready for review February 7, 2023 13:31
@stefanhengl stefanhengl changed the title WIP special case word search matchtree: special case word search Feb 7, 2023
@keegancsmith
Copy link
Member Author

keegancsmith commented Feb 7, 2023

@keegancsmith Shall we roll this out? I think this can be live while we polish it.

I think writing a test is a minumum before landing. You wanna do that? Also I wonder if we should record running this code path as another Stat (instead of regexpsconsidered). But regexpsconsidered is probably fine.

Edit: I do believe this change is safe though, so making it in before branch cut would be awesome.

@stefanhengl
Copy link
Member

I think writing a test is a minumum before landing. You wanna do that?

👍

@stefanhengl
Copy link
Member

Also I wonder if we should record running this code path as another Stat (instead of regexpsconsidered). But regexpsconsidered is probably fine.

Not sure. Given the docstring we probably shouldn't count evaluations of the wordMatchTree as regexp.

// Number of times regexp was called on files that we evaluated.
RegexpsConsidered int

I removed the counter for wordMatchTree. Looking at stats I am not sure counting evaluations of wordMatchTree separately makes sense. The granularity seems too fine compared to the other stats. Additionally we already skip the regexp engine in other places (see regexpToMatchTreeRecursive) and we don't count that either. WDYT @keegancsmith ?

Copy link
Member Author

@keegancsmith keegancsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stefanhengl I can't approve, but LGTM. Ship it :)

@stefanhengl stefanhengl merged commit ea5ebff into main Feb 9, 2023
@stefanhengl stefanhengl deleted the k/word-search branch February 9, 2023 09:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants