matchtree: special case word search #526

keegancsmith · 2023-01-30T09:20:18Z

A common search Zoekt gets from Sourcegraph is "\bLITERAL\b". With this PR we avoid the regex engine for these type of queries and provide something faster.

Local benchmarks show that the new code runs 4.8x faster for select queries.

Co-authored-by: @stefanhengl

Still need to add tests for this

keegancsmith · 2023-01-30T14:33:00Z

@stefanhengl I couldn't help myself and wrote a little more code which should make this not break anything.

keegancsmith · 2023-02-05T21:34:42Z

@stefanhengl Thinking about next steps here. I believe this is working. What isn't known if this actually makes a difference.

Do a simple test with something like hyperfine on a decent corpus
Eyeball pprof
Improve implementation. Maybe we should make a wrapper around regex so this is easier to use in things like searcher/etc? Should we somehow implement this optimization in upstream regex?
Is it worth shipping a custom zoekt to the customer that inspired this optimization?

stefanhengl · 2023-02-06T12:10:42Z

I ran a couple of tests and it looks like we are finding too many results. EG a search for \bFind\b case:yes will return FindID while the main branch doesn't. I will look into it.

EDIT:
We were treating aA or Aa as word boundary which is not what \bA\b does, hence the many results.

stefanhengl · 2023-02-06T13:25:01Z

Performance looks promissing (4.8x faster).

→  hyperfine -w 1 "./zoekt_main --index_dir ~/corpus/index \"regex:\bFind\b case:yes\"" "./zoekt_word --index_
dir ~/corpus/index \"regex:\bFind\b case:yes\""
Benchmark 1: ./zoekt_main --index_dir ~/corpus/index "regex:\bFind\b case:yes"
  Time (mean ± σ):     532.7 ms ±   9.0 ms    [User: 1629.3 ms, System: 145.6 ms]
  Range (min … max):   516.2 ms … 548.9 ms    10 runs
 
Benchmark 2: ./zoekt_word --index_dir ~/corpus/index "regex:\bFind\b case:yes"
  Time (mean ± σ):     109.8 ms ±   5.9 ms    [User: 344.9 ms, System: 146.6 ms]
  Range (min … max):   103.3 ms … 127.4 ms    26 runs
 
Summary
  './zoekt_word --index_dir ~/corpus/index "regex:\bFind\b case:yes"' ran
    4.85 ± 0.28 times faster than './zoekt_main --index_dir ~/corpus/index "regex:\bFind\b case:yes"'

keegancsmith · 2023-02-06T13:58:03Z

\o/

stefanhengl · 2023-02-07T13:31:09Z

@keegancsmith Shall we roll this out? I think this can be live while we polish it.

keegancsmith · 2023-02-07T15:03:40Z

@keegancsmith Shall we roll this out? I think this can be live while we polish it.

I think writing a test is a minumum before landing. You wanna do that? Also I wonder if we should record running this code path as another Stat (instead of regexpsconsidered). But regexpsconsidered is probably fine.

Edit: I do believe this change is safe though, so making it in before branch cut would be awesome.

stefanhengl · 2023-02-07T16:20:41Z

I think writing a test is a minumum before landing. You wanna do that?

👍

stefanhengl · 2023-02-08T08:35:14Z

Also I wonder if we should record running this code path as another Stat (instead of regexpsconsidered). But regexpsconsidered is probably fine.

Not sure. Given the docstring we probably shouldn't count evaluations of the wordMatchTree as regexp.

// Number of times regexp was called on files that we evaluated.
RegexpsConsidered int

I removed the counter for wordMatchTree. Looking at stats I am not sure counting evaluations of wordMatchTree separately makes sense. The granularity seems too fine compared to the other stats. Additionally we already skip the regexp engine in other places (see regexpToMatchTreeRecursive) and we don't count that either. WDYT @keegancsmith ?

keegancsmith

@stefanhengl I can't approve, but LGTM. Ship it :)

matchtree.go

WIP special case word search

45f7de0

keegancsmith assigned stefanhengl Jan 30, 2023

correctly decide to use word search

4c9389a

Still need to add tests for this

mimic behavior of "\W, \A or \z"

ee49343

set fileName: bool correctly

0d7c679

stefanhengl marked this pull request as ready for review February 7, 2023 13:31

stefanhengl changed the title ~~WIP special case word search~~ matchtree: special case word search Feb 7, 2023

stefanhengl added 2 commits February 8, 2023 08:56

add tests

4e937f8

don't count regexpConsidered for wordMatchTree

536bf68

ChunkMatches

83a47dd

keegancsmith commented Feb 8, 2023

View reviewed changes

matchtree.go Show resolved Hide resolved

stefanhengl approved these changes Feb 9, 2023

View reviewed changes

typo

975d210

stefanhengl merged commit ea5ebff into main Feb 9, 2023

stefanhengl deleted the k/word-search branch February 9, 2023 09:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

matchtree: special case word search #526

matchtree: special case word search #526

Uh oh!

keegancsmith commented Jan 30, 2023 •

edited by stefanhengl

Loading

Uh oh!

keegancsmith commented Jan 30, 2023

Uh oh!

keegancsmith commented Feb 5, 2023

Uh oh!

stefanhengl commented Feb 6, 2023 •

edited

Loading

Uh oh!

stefanhengl commented Feb 6, 2023

Uh oh!

keegancsmith commented Feb 6, 2023

Uh oh!

stefanhengl commented Feb 7, 2023

Uh oh!

keegancsmith commented Feb 7, 2023 •

edited

Loading

Uh oh!

stefanhengl commented Feb 7, 2023

Uh oh!

stefanhengl commented Feb 8, 2023

Uh oh!

keegancsmith left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

matchtree: special case word search #526

matchtree: special case word search #526

Uh oh!

Conversation

keegancsmith commented Jan 30, 2023 • edited by stefanhengl Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

keegancsmith commented Jan 30, 2023

Uh oh!

keegancsmith commented Feb 5, 2023

Uh oh!

stefanhengl commented Feb 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stefanhengl commented Feb 6, 2023

Uh oh!

keegancsmith commented Feb 6, 2023

Uh oh!

stefanhengl commented Feb 7, 2023

Uh oh!

keegancsmith commented Feb 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stefanhengl commented Feb 7, 2023

Uh oh!

stefanhengl commented Feb 8, 2023

Uh oh!

keegancsmith left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

keegancsmith commented Jan 30, 2023 •

edited by stefanhengl

Loading

stefanhengl commented Feb 6, 2023 •

edited

Loading

keegancsmith commented Feb 7, 2023 •

edited

Loading