refactor `searx.network` #2685

dalf · 2023-08-26T17:23:28Z

What does this PR do?

See https://github.com/dalf/searxng/wiki/Branch-searx_network_rework to get an overview of the architecture change.

In few words:

remove async code (so we can drop uvloop)
there is better separation of concern
it keeps one HTTP client during the execution of one user query in one engine (this is optional). This might help in some scenarios ; if for example an engine returns a cookie linked to one IP address.
there are different retry mechanisms.

For each network, there is a new parameter retry_strategy which describes what to do when an outgoing HTTP request fail and retries > 1:

DIFFERENT_HTTP_CLIENT (default value): this is the current implementation in the master branch. If a request fails, then another (proxy, IP address) is tried. If an engine sends two HTTP requests, the first one goes through a proxy, the second request goes through another proxy (if two proxies or more are configured).
ENGINE: the (proxy, IP address) is kept during the process of one user query. If a HTTP request fails, the engine restart the process of the query with another (proxy, IP address).
SAME_HTTP_CLIENT: If a HTTP request fails, the request is sent again with the same (proxy, IP address).

All these strategies respect the user query timeout.

Why is this change important?

This is unfinished, but I'm not sure if this should be merged once done.

How to test this PR locally?

To be written, non-exhaustive list :

check outgoing connection with one proxy
check outgoing connection with multiple proxies
check outgoing connection with multiple IP addresses
check outgoing connection with retries > 1 and failing request, to check retries works as expected
check the image proxy does not have a memory leak

Author's checklist

About async code: it code was introduced when the target was to fully switch to async (async code in the engine, use quart, etc...). However, async won't fit the use case here: async is cooperative multitasking, and between the very small and fast function calls to deal with HTTP/2 and the very long pause to parse the engine result, I'm afraid it will provide slow down the response time (or it will break as soon we are not very careful).

The first commit brings NetworkContext but keep async code. The second commit removes async code.

The only point of async is to send multiple requests at the same time from one engine:

the bing engine was using it to resolve the redirection URL in the results. [mod] bing: resolve redirect without additional requests #2686 fixes that.
the baidu engine: it is not possible to make it work after a few request.

This PR brings 1401 new lines of code, but there is far more documentation and tests than before.

def initialize(settings_engines=None, settings_outgoing=None):
    ...
    NETWORKS.clear()
    NETWORKS[DEFAULT_NAME] = new_network({}, logger_name='default')
    NETWORKS['ipv4'] = new_network({'local_addresses': '0.0.0.0'}, logger_name='ipv4')
    NETWORKS['ipv6'] = new_network({'local_addresses': '::'}, logger_name='ipv6')

and for the management of the networks we have some lose functions like:

def get_network(name=None):
    return NETWORKS[name or DEFAULT_NAME]
...
def check_network_configuration():
    exception_count = 0
    for network in NETWORKS.values():
        ...

IMO we should have a class like NetworkManager from which a instance NETWORKS can be build:

the initialization of the manager is done in the class constructor
global functions like get_network(name=None) are implemented in methods like NetworkManager.get(name=None)

dalf · 2023-09-01T07:46:11Z

The line below makes code outside the webapp works out of the box : no need to do any kind of initialization, just use searx.network.get(...) same as requests.get(...) :

NETWORKS[DEFAULT_NAME] = Network()

For example, the traits are fetched outside the webapp:

searxng/searx/engines/google.py

Line 428 in 9fce459

resp = get('https://www.google.com/preferences')

The code can use httpx directly in some places, but not everywhere:

searxng/searxng_extra/update/update_currencies.py

Line 98 in 9fce459

    
           result = wikidata.send_wikidata_query(request.replace('%LANGUAGES_SPARQL%', LANGUAGES_SPARQL))

searxng/searx/engines/wikidata.py

Lines 153 to 159 in 9fce459

    
           def send_wikidata_query(query, method='GET'): 
        
               if method == 'GET': 
        
                   # query will be cached by wikidata 
        
                   http_response = get(SPARQL_ENDPOINT_URL + '?' + urlencode({'query': query}), headers=get_headers()) 
        
               else: 
        
                   # query won't be cached by wikidata 
        
                   http_response = post(SPARQL_ENDPOINT_URL, data={'query': query}, headers=get_headers())

Here get and post from searx.network need a Network (either the dedicate one in the webapp, either a default one in the standalone scripts).

Instead of using httpx directly, the standalone scripts can initialize searx.network with a dedicate initialization (a parameter in the constructor of NetworkManager, or whatever), so this line can be removed (if you see it as an issue):

searxng/searx/network/network.py

Line 427 in 9fce459

NETWORKS[DEFAULT_NAME] = Network()

What do you think?

check_network_configuration(...) : yes the call should be done from initialize.

NetworkManager.get(name=None) : who creates and own the NetworkManager singleton?

return42 · 2023-09-01T07:52:54Z

so this line can be removed (if you see it as an issue):

the line is not a big deal .. I just want suggest to have a manager for the task "networks management" in SXNG's network stack.

dalf · 2023-09-01T07:58:02Z

Why not a network management, but I can't answer who creates the NetworkManager, and then how to access the singleton.

return42 · 2023-09-01T08:22:04Z

Why not a network management

was what I suggested / in code-speak ..

class NetworkManager:

  def __init__(self):
      self.networks = {}

  def new_network(self, name):
      ...
  def check_network_configuration(self):
      ...
  def get_network(self, name = None):
      ret_val = self.networks.get(name or DEFAULT_NAME)
      if ret_val:
          return ret_val
      self.networks[name] = self.new_network(name)
      return self.networks[name]

NETWORKS = NetworkManager()

dalf · 2023-09-01T09:27:53Z

I've added a NetworkManager as you described, a lot of thing can be improved, but that's a start.

dalf · 2023-09-08T19:54:42Z

the last commit removes the async code.
Perhaps it's worth trying to enable HTTP/2 with the image proxy.

searx/network/context.py

searx/network/__init__.py

searx/network/network.py

searx/webapp.py

searx/network/context.py

searxng_extra/update/update_engine_descriptions.py

searx/network/__init__.py

searx/network/client.py

dalf · 2023-09-10T08:51:41Z

searx/network/__init__.py

+# pylint: disable=global-statement
+"""Provide HTTP clients following the user configuration
+"""
 import threading


~~Describe the API: the implementation uses httpx, but the API is mostly compatible with requests.~~

mostly because, may be there are some missing parts.
~~TODO : check https://www.python-httpx.org/compatibility/~~

See the cookie section --> should we drop the requests API emulation and keep only httpx?

…ients) In the bing engines, reuse of http client and SSL context leads to difficult to assess side effects. [1][2] - [1] searxng#2977 - [2] searxng#2685 (comment) Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>

Austin-Olacsi · 2023-11-29T19:50:31Z

Maybe we could add something for:
#2772

To set the correct header? I'm just not sure what the best approach is.

return42 · 2023-11-30T06:00:52Z

Maybe we could add something for: #2772

@Hackurei I think setting HTTP headers for image load of special engines is not in the scope of the network stack / BTW it would not help when there is no image proxy and the image needs to be load by the web client itself ..

TL;DR: #2772 is very special to https://www.pixiv.net/ and its WEB-Clients/WEB-Side communication and cannot be solved sustainably at the level of the network stack of this PR.

…ients) In the bing engines, reuse of http client and SSL context leads to difficult to assess side effects. [1][2] - [1] searxng#2977 - [2] searxng#2685 (comment) Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>

searx/network/context.py

return42 · 2023-12-03T15:18:22Z

searx/network/context.py

+    @final
+    def _get_http_client(self) -> ABCHTTPClient:
+        """Return the HTTP client to use for this context."""
+        if self._http_client is None:
+            raise ValueError("HTTP client has not been set")
+        return self._http_client
+
+    @final
+    def _set_http_client(self):
+        """Ask the NetworkContext to use another HTTP client using the factory.
+
+        Use the method _get_new_client_from_factory() to call the factory,
+        so the NetworkContext implementations can wrap the ABCHTTPClient.
+        """
+        self._http_client = self._get_new_client_from_factory()
+
+    @final
+    def _reset_http_client(self):
+        self._http_client = None
+
+    def _get_new_client_from_factory(self):
+        return self._http_client_factory()


Nit picking: do we really need such a lot of indirections .. where private methods get/set private members? By example to set a new http client in self._http_client

we call self._set_http_client() which does nothing more than

calling a method self._get_new_client_from_factory() which does nothing more than

calling self._http_client_factory()

BTW: the reader needs to know that a private method / a method with an underline prefix (_get_new_client_from_factory) is overwritten in the inheritance .. this needs really a lot of code reading to understand whats going on

I mean (haven't tested), this could be simplified to a property (which can be overwritten in the inheritance:

class NetworkContext: _http_client = None @property def http_client(self): if self._http_client is None: self._http_client = self._http_client_factory() return self._http_client @final def reset_http_client(self): self._http_client = None

do we really need such a lot of indirections .. where private methods get/set private members?

I don't remember exactly the last audio call, may be I said it was private methods ?
I should have said protected methods rather than private methods.

the reader needs to know that a private method / a method with an underline prefix

it seems to be a common practice:

a single _ as prefix for protected methods

a double __ as prefix for private methods

See:

https://softwareengineering.stackexchange.com/questions/228216/what-is-the-historical-reason-why-python-uses-the-double-underscore-for-class-pr

https://discuss.python.org/t/private-protected-modifier-and-notation/22356

https://stackoverflow.com/questions/6930144/underscore-vs-double-underscore-with-variables-and-methods

httpx uses that convention

pylint message : http://pylint-messages.wikidot.com/messages:w0212

also: Allow accessing protected and private attributes in __new__ pylint-dev/pylint#1802 (comment)

About indirection: I can't say I'm happy with them, however _get_new_client_from_factory is override by the different implementations:

searxng/searx/network/context.py

Lines 267 to 268 in ffcff58

def _get_new_client_from_factory(self):

return _RetryFunctionHTTPClient(super()._get_new_client_from_factory(), self)

searxng/searx/network/context.py

Lines 315 to 316 in ffcff58

def _get_new_client_from_factory(self):

return _RetrySameHTTPClient(super()._get_new_client_from_factory(), self)

searxng/searx/network/context.py

Lines 369 to 370 in ffcff58

def _get_new_client_from_factory(self):

return _RetryDifferentHTTPClient(self)

Reading the code now, I think:

the different implementations of NetworkContext can set the self._http_client directly in _set_http_client (so this method is no more final).

_get_new_client_from_factory can be removed.

For example:

class NetworkContextRetryFunction(NetworkContext): def call(self, func: Callable[P, R], *args: P.args, **kwargs: P.kwargs) -> R: ... def _set_http_clientself): self._http_client = _RetryFunctionHTTPClient(super()._get_new_client_from_factory(), self)

I don't remember exactly the last audio call, may be I said it was private methods ?
I should have said protected methods rather than private methods.

No you didn't said its private, its just me who mixes both terms .. in python the concept of __private is just crap and should not been used (as David Goodger recommends / _Foo__private). As a replacement, it has commonly been agreed that it is better to use _protected.

.. what I wanted to point out is: I think its not good to mark names as _protected and then overwriting these names in inheritance .. we can do it, but personally I think its not a good practice.

the different implementations of NetworkContext can set the self._http_client directly in ..

Furthermore, I think that .getter() and .setter() can often be realized by a @property (as shown above) / which is late-evaluated as a positive side effect (no need to explicitly call a .setter()).

But you should leave everything as it is for now, I will make suggestions in my branch, which we can look at after I have finished my review

…ients) In the bing engines, reuse of http client and SSL context leads to difficult to assess side effects. [1][2] - [1] searxng#2977 - [2] searxng#2685 (comment) Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>

dalf · 2023-12-04T07:29:35Z

I will review your additional commits most probably on Friday ( 5871c78 and d82d06e ).

return42 · 2023-12-04T08:03:42Z

I will review your additional commits most probably on Friday ( 5871c78 and d82d06e ).

~~No, please not! .. its just half of the code .. and may I will drop it later .. give me some more time to finish my review ..~~

dalf · 2024-01-05T21:06:49Z

The tests are flaky, this is related to #2988 not to this PR

The socket leak might be fixed (at least I've tried).

* deploying searxng/searxng#2685 for integration testing => this is temporary and will be reverted to searxng/searxng:main, when the upstream PR is merged * second try of testing branch, after memory leak in first integration testing

return42 · 2024-03-09T07:20:47Z

@dalf @mrpaulblack sorry, I lost the overview, what the state of this PR, the network stack?

dalf · 2024-03-09T08:50:29Z

The broken part is the streaming for the image proxy: the sockets are not always closed.
Everything else works.
I have an idea how to fix that, just need motivation to implement it...

czaky · 2024-06-07T13:25:34Z

Sorry, for intruding on this almost 1y discussion.

Context: I have recently experimented with SearXNG through unreliable proxy networks (particularly TOR)
(reference setup: https://github.com/czaky/turbotor). This resulted in PR: #3491 .

How could we support following features (using a non-asyncio approach)?

Parallel fan out of requests through multiple proxies (at least in initial phase)?
Keep alive / connectivity- / latency-determination for existing HTTP connections (in the background)?

Parallel requests through multiple proxies.

As implemented in #3491, doing the "retries" in parallel (instead of in series) makes this approach usable over networks with increased latency (tail >10s) and high rate of client rejection. This was rather straightforward to implement using asyncio and it minimized issues related to using SearXNG over TOR.

(Motivation: theoretically and practically, waiting for multiple engines increases the probability of hitting the tail latency exponentially. So if the common recommendation for TOR is a timeout of 30s, it is, unfortunately, how long SearXNG takes to answer queries.)

Keep alive / connectivity- / latency-determination and background connection pools.

In case of TOR each proxy changes the circuits every 10 minutes. Thus, the set of "good" proxies changes rather fast. A background task could proactively determine a smaller subset of proxies that provide the optimal connectivity for each engine.
This would reduce client rejection rates by the target-point and reduce the overall noise.

(Note, I am more interested in making SearXNG usable over TOR then advocating for any approach.)

dalf · 2024-07-10T14:19:00Z

Regardless of the decision about async vs. sync, this PR is ready with one exception: some sockets are leaking, likely due to response streaming.

Response streaming is required for only one feature: the image proxy.

Therefore, an alternative approach is to:

Remove response streaming from the code
Provide a much simpler alternative

But first, we should verify whether the socket leak is indeed caused by response streaming.

dalf added the area: network searx.network label Aug 26, 2023

dalf force-pushed the searx_network_rework branch from 854b914 to 23a50a6 Compare August 26, 2023 17:25

dalf commented Aug 26, 2023

View reviewed changes

searx/network/network.py Outdated Show resolved Hide resolved

dalf force-pushed the searx_network_rework branch from 23a50a6 to 82840bd Compare September 1, 2023 06:06

dalf force-pushed the searx_network_rework branch 6 times, most recently from dd7140d to 55db35d Compare September 8, 2023 19:48

dalf commented Sep 8, 2023

View reviewed changes

searx/network/context.py Outdated Show resolved Hide resolved

dalf force-pushed the searx_network_rework branch 3 times, most recently from 6addb2d to c6fdf1b Compare September 9, 2023 16:15