Adding a cache to speed up SearXNG #3386

Immortality-IMT · 2024-04-09T05:11:57Z

Immortality-IMT
Apr 9, 2024

Update from the maintainer: please take not of #3386 (comment)

I added a cache, here is how I did it.

make a directory in the searx folder named cache: /usr/local/searxng/searxng-src/searx/cache
make a sub-folder for every possible character in the cache directory, for instance a to z and 0 to 9
/usr/local/searxng/searxng-src/searx/cache/0
...
/usr/local/searxng/searxng-src/searx/cache/9
/usr/local/searxng/searxng-src/searx/cache/a
...
/usr/local/searxng/searxng-src/searx/cache/z

Edit the file: init.py in /usr/local/searxng/searxng-src/searx/search/init.py : class Search

`
class Search:

__slots__ = "search_query", "result_container", "start_time", "actual_timeout"


def search_multiple_requests(self, requests):
    # pylint: disable=protected-access
    search_id = str(uuid4())

    for engine_name, query, request_params in requests:
        _search = copy_current_request_context(PROCESSORS[engine_name].search)
        th = threading.Thread(  # pylint: disable=invalid-name
            target=_search,
            args=(query, request_params, self.result_container, self.start_time, self.actual_timeout),
            name=search_id,
        )
        th._timeout = False
        th._engine_name = engine_name
        th.start()

    for th in threading.enumerate():  # pylint: disable=invalid-name
        if th.name == search_id:
            remaining_time = max(0.0, self.actual_timeout - (default_timer() - self.start_time))
            th.join(remaining_time)
            if th.is_alive():
                th._timeout = True
                self.result_container.add_unresponsive_engine(th._engine_name, 'timeout')
                PROCESSORS[th._engine_name].logger.error('engine timeout')


def search_multiple_requests2(self, requests):
    # pylint: disable=protected-access
    search_id = str(uuid4())
    mock_result_container = ResultContainer()
    mock_results = [{'url': f'Mock Result {i}', 'content': ''} for i in range(1, 6)]

    threads = []

    for engine_name, _, _ in requests:
        th = threading.Thread(
            target=self.mock_search_function,
            args=(engine_name, mock_results, mock_result_container),
            name=search_id,
        )
        th._timeout = False
        th._engine_name = engine_name
        th.start()
        threads.append(th)

    remaining_time = None
    for th in threads:
        if th.name == search_id:
            if remaining_time is None:
                remaining_time = self.actual_timeout - (default_timer() - self.start_time)
            th.join(remaining_time)
            if th.is_alive():
                th._timeout = True
                self.result_container.add_unresponsive_engine(th._engine_name, 'timeout')
                PROCESSORS[th._engine_name].logger.error('engine timeout')

    # Wait for all threads to finish, even if some have timed out
    for th in threads:
        th.join()

    # Copy the mock results to the actual result_container
    self.result_container = mock_result_container

def mock_search_function(self, engine_name, mock_results, result_container):
    # This is a mock search function
    time.sleep(0.1)  # Simulate some processing time
    result_container.extend(engine_name, mock_results)

def search_standard(self):
    """
    Update self.result_container, self.actual_timeout
    """
    requests, self.actual_timeout = self._get_requests()

    cache_dir = 'cache'
    query_file_path = os.path.join(cache_dir, self.search_query.query)

    # send all search-request
    if requests:
        # Check if the file exists in the cache directory
        if os.path.isfile(query_file_path):
            self.search_multiple_requests2(requests)
        else:
            self.search_multiple_requests(requests)

    # return results, suggestions, answers and infoboxes
    return True

`
Simple as that any questions, improvements, suggestions, let me know how it goes

return42 · 2024-07-26T13:20:05Z

return42
Jul 26, 2024
Maintainer

My implementation was super simple, name the cache file after the search term and check if the file exists,

Such a "simple" caching method will break the functionality of SearXNG, because parameters of a query such as language, active engines and more are not taken into account.

If we think about the variation of query parameters and the actuality of the results, we will come to the conclusion that a simple caching method is not practicable ... a few reasons for this have already been mentioned here in the thread by me and others.

I don't want to stifle the discussion in principle, but it should take place at a higher level of abstraction and take all relevant aspects into consideration.

The solution proposed here should not find any imitators, which is why I unfortunately have to close this discussion now.

0 replies

Immortality-IMT · 2024-06-24T20:36:59Z

Immortality-IMT
Jun 24, 2024
Author

My implementation was super simple, name the cache file after the search term and check if the file exists, if so display it otherwise perform the search as normal.

I recently updated the code.

LLM's like chatgpt are utilizing search engines now, caching could speed this up substantially as the LLM does not have to go to the web, so it has to be distributed database otherwise it could be the nearest fast proxy instance. It could be opt-in if privacy over speed. If privacy is paramount, the user can opt out.

If you want to mess around with caching results locally, here is the updated code, no aff with searx or searxng, just to test.

https://www.imtcoin.com/kb.php?page=Caching+SearXNG&redirect=no

1 reply

glanham-jr Jun 26, 2024

I'd categorize LLMs using search engines as bots, likely something aimed to be blocked. But we don't even know if they are targeting SearxNG or rather the major engines.

With that, I'd say the primary focus should be on the maintainers of instances and real users of them. Opt out privacy is definitely not in that goal.

dalf · 2024-06-23T13:48:13Z

dalf
Jun 23, 2024

For reference, the fork from the /e/ foundation implements a cache : https://gitlab.e.foundation/e/infra/spot . However I can't find it in the code. Public instance: https://spot.murena.io/

1 reply

return42 Jun 24, 2024
Maintainer

However I can't find it in the code.

May it has been removed .. commit https://gitlab.e.foundation/e/infra/spot/-/commit/ebd5dee026 is what I have found, but this code does no longer exists in the master.

Public instance: https://spot.murena.io/

Very poor instance .. rather puts me off .. I've looked for interesting things in the e-spot branch from time to time (i have it in my git-remotes) ... but I've never actually found anything.

glanham-jr · 2024-06-22T20:39:12Z

glanham-jr
Jun 22, 2024

This seems like a reasonable feature to add. Could be beneficial to public server operators.

I don't think I see a TTL in this - we will need cache results over X period of time and purge them from storage after, to consider storage growth and just stale results.

This also something that should be disabled by default in the settings, at least for the initial implementation. We may want to request specific public maintainers to see if this feature is useful for them by seeing if it reduces the number of out-bound IO requests. I also am thinking we will need metrics to track how many requests were from the cache vs IO bound so we can measure how useful this is.

Further, I definitely think it makes sense to implement a generic cache API layer where we can connect to different backends (file storage, Redis-like, SQLITE, etc) for generic Key-Value storage. This would open up this feature to connect to various backends depending on what is available for the maintainer and their server constraints (i.e. lots of file storage, low memory, or lots of memory to spare, etc).

With this, would you mind if I tried to continue this idea further? I'd also want to hear an opinion on @return42 on this as they have a much better understanding of the current state of internals for SearxNG than I do. I tried searching the SearXNG docs and the default settings.yml and didn't see anything for a feature like this.

4 replies

return42 Jun 23, 2024
Maintainer

In my opinion, caching makes no sense, the scenarios in which it is useful are too rare. Example: a.) as a user I very rarely search for the same term, I adjust my search terms or the parameters like language or time period to narrow down the desired result. b.) you would have to adjust the caching to the engine; if you search for a term in a database e.g. news then you don't want to get a result from a cache. c.) Which results from which engines would you cache for how long?

TL;DR; caching makes no sense from my perspective, except for bot defense, because bots are usually the only ones that make identical requests over and over again.

glanham-jr Jun 23, 2024

I think we'd definitely have to introduce metrics by how often the same query is searched before consideration. As in, I think there would need to be an update where we hash a search queries term and start counting on it. I can't say whether or not the cache is useful until there are some numbers to back up cache hits.

return42 Jun 24, 2024
Maintainer

thought we didn't want to count and rate our users' search terms.

glanham-jr Jun 24, 2024

I was thinking hashing the query with a salt of sorts and not linking it to a specific user. We would not need to know what was searched, just if duplicates were found. If it doesn't seem privacy respecting, it doesn't need to be considered. Caching results would also have to follow the above - caching plaintext is not ideal.

Adding a cache to speed up SearXNG #3386

Uh oh!

Uh oh!

Immortality-IMT Apr 9, 2024

Replies: 4 comments · 6 replies

Uh oh!

return42 Jul 26, 2024 Maintainer

Uh oh!

Uh oh!

Immortality-IMT Jun 24, 2024 Author

Uh oh!

glanham-jr Jun 26, 2024

Uh oh!

Uh oh!

dalf Jun 23, 2024

Uh oh!

return42 Jun 24, 2024 Maintainer

Uh oh!

glanham-jr Jun 22, 2024

Uh oh!

return42 Jun 23, 2024 Maintainer

Uh oh!

glanham-jr Jun 23, 2024

Uh oh!

return42 Jun 24, 2024 Maintainer

Uh oh!

Uh oh!

glanham-jr Jun 24, 2024

Immortality-IMT
Apr 9, 2024

Replies: 4 comments 6 replies

return42
Jul 26, 2024
Maintainer

Immortality-IMT
Jun 24, 2024
Author

dalf
Jun 23, 2024

return42 Jun 24, 2024
Maintainer

glanham-jr
Jun 22, 2024

return42 Jun 23, 2024
Maintainer

return42 Jun 24, 2024
Maintainer