这是indexloc提供的服务,不要输入任何密码
Skip to content

Conversation

@dalf
Copy link
Contributor

@dalf dalf commented Aug 26, 2023

What does this PR do?

See https://github.com/dalf/searxng/wiki/Branch-searx_network_rework to get an overview of the architecture change.

In few words:

  • remove async code (so we can drop uvloop)
  • there is better separation of concern
  • it keeps one HTTP client during the execution of one user query in one engine (this is optional). This might help in some scenarios ; if for example an engine returns a cookie linked to one IP address.
  • there are different retry mechanisms.

For each network, there is a new parameter retry_strategy which describes what to do when an outgoing HTTP request fail and retries > 1:

  • DIFFERENT_HTTP_CLIENT (default value): this is the current implementation in the master branch. If a request fails, then another (proxy, IP address) is tried. If an engine sends two HTTP requests, the first one goes through a proxy, the second request goes through another proxy (if two proxies or more are configured).
  • ENGINE: the (proxy, IP address) is kept during the process of one user query. If a HTTP request fails, the engine restart the process of the query with another (proxy, IP address).
  • SAME_HTTP_CLIENT: If a HTTP request fails, the request is sent again with the same (proxy, IP address).

All these strategies respect the user query timeout.

Why is this change important?

This is unfinished, but I'm not sure if this should be merged once done.

How to test this PR locally?

To be written, non-exhaustive list :

  • check outgoing connection with one proxy
  • check outgoing connection with multiple proxies
  • check outgoing connection with multiple IP addresses
  • check outgoing connection with retries > 1 and failing request, to check retries works as expected
  • check the image proxy does not have a memory leak

Author's checklist

About async code: it code was introduced when the target was to fully switch to async (async code in the engine, use quart, etc...). However, async won't fit the use case here: async is cooperative multitasking, and between the very small and fast function calls to deal with HTTP/2 and the very long pause to parse the engine result, I'm afraid it will provide slow down the response time (or it will break as soon we are not very careful).

The first commit brings NetworkContext but keep async code. The second commit removes async code.

The only point of async is to send multiple requests at the same time from one engine:


This PR brings 1401 new lines of code, but there is far more documentation and tests than before.

Related

@dalf dalf added the area: network searx.network label Aug 26, 2023
@dalf dalf force-pushed the searx_network_rework branch from 854b914 to 23a50a6 Compare August 26, 2023 17:25
@dalf dalf force-pushed the searx_network_rework branch from 23a50a6 to 82840bd Compare September 1, 2023 06:06
@return42
Copy link
Member

return42 commented Sep 1, 2023

I would suggest to implement a class for the networks management / what we have is a global dict:

NETWORKS: Dict[str, 'Network'] = {}

this dict is initialized with default when the module is imported:

NETWORKS[DEFAULT_NAME] = Network()

some parts of a "network management constructor" is done in the initialization ..

def initialize(settings_engines=None, settings_outgoing=None):
    ...
    NETWORKS.clear()
    NETWORKS[DEFAULT_NAME] = new_network({}, logger_name='default')
    NETWORKS['ipv4'] = new_network({'local_addresses': '0.0.0.0'}, logger_name='ipv4')
    NETWORKS['ipv6'] = new_network({'local_addresses': '::'}, logger_name='ipv6')

and for the management of the networks we have some lose functions like:

def get_network(name=None):
    return NETWORKS[name or DEFAULT_NAME]
...
def check_network_configuration():
    exception_count = 0
    for network in NETWORKS.values():
        ...

IMO we should have a class like NetworkManager from which a instance NETWORKS can be build:

  • the initialization of the manager is done in the class constructor
  • global functions like get_network(name=None) are implemented in methods like NetworkManager.get(name=None)

@dalf
Copy link
Contributor Author

dalf commented Sep 1, 2023

The line below makes code outside the webapp works out of the box : no need to do any kind of initialization, just use searx.network.get(...) same as requests.get(...) :

NETWORKS[DEFAULT_NAME] = Network()

For example, the traits are fetched outside the webapp:

resp = get('https://www.google.com/preferences')

The code can use httpx directly in some places, but not everywhere:

result = wikidata.send_wikidata_query(request.replace('%LANGUAGES_SPARQL%', LANGUAGES_SPARQL))

def send_wikidata_query(query, method='GET'):
if method == 'GET':
# query will be cached by wikidata
http_response = get(SPARQL_ENDPOINT_URL + '?' + urlencode({'query': query}), headers=get_headers())
else:
# query won't be cached by wikidata
http_response = post(SPARQL_ENDPOINT_URL, data={'query': query}, headers=get_headers())

Here get and post from searx.network need a Network (either the dedicate one in the webapp, either a default one in the standalone scripts).

Instead of using httpx directly, the standalone scripts can initialize searx.network with a dedicate initialization (a parameter in the constructor of NetworkManager, or whatever), so this line can be removed (if you see it as an issue):

NETWORKS[DEFAULT_NAME] = Network()

What do you think?


check_network_configuration(...) : yes the call should be done from initialize.

NetworkManager.get(name=None) : who creates and own the NetworkManager singleton?

@return42
Copy link
Member

return42 commented Sep 1, 2023

so this line can be removed (if you see it as an issue):

the line is not a big deal .. I just want suggest to have a manager for the task "networks management" in SXNG's network stack.

@dalf
Copy link
Contributor Author

dalf commented Sep 1, 2023

Why not a network management, but I can't answer who creates the NetworkManager, and then how to access the singleton.

@return42
Copy link
Member

return42 commented Sep 1, 2023

Why not a network management

was what I suggested / in code-speak ..

class NetworkManager:

  def __init__(self):
      self.networks = {}

  def new_network(self, name):
      ...
  def check_network_configuration(self):
      ...
  def get_network(self, name = None):
      ret_val = self.networks.get(name or DEFAULT_NAME)
      if ret_val:
          return ret_val
      self.networks[name] = self.new_network(name)
      return self.networks[name]

NETWORKS = NetworkManager()

@dalf
Copy link
Contributor Author

dalf commented Sep 1, 2023

I've added a NetworkManager as you described, a lot of thing can be improved, but that's a start.

@dalf dalf force-pushed the searx_network_rework branch 6 times, most recently from dd7140d to 55db35d Compare September 8, 2023 19:48
@dalf
Copy link
Contributor Author

dalf commented Sep 8, 2023

the last commit removes the async code.
Perhaps it's worth trying to enable HTTP/2 with the image proxy.

@dalf dalf force-pushed the searx_network_rework branch 3 times, most recently from 6addb2d to c6fdf1b Compare September 9, 2023 16:15
# pylint: disable=global-statement
"""Provide HTTP clients following the user configuration
"""
import threading
Copy link
Contributor Author

@dalf dalf Sep 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Describe the API: the implementation uses httpx, but the API is mostly compatible with requests.

mostly because, may be there are some missing parts.
TODO : check https://www.python-httpx.org/compatibility/

See the cookie section --> should we drop the requests API emulation and keep only httpx?

return42 added a commit to return42/searxng that referenced this pull request Nov 29, 2023
…ients)

In the bing engines, reuse of http client and SSL context leads to difficult to
assess side effects. [1][2]

- [1] searxng#2977
- [2] searxng#2685 (comment)

Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
@Austin-Olacsi
Copy link
Contributor

Maybe we could add something for:
#2772

To set the correct header? I'm just not sure what the best approach is.

@return42
Copy link
Member

Maybe we could add something for: #2772

@Hackurei I think setting HTTP headers for image load of special engines is not in the scope of the network stack / BTW it would not help when there is no image proxy and the image needs to be load by the web client itself ..

TL;DR: #2772 is very special to https://www.pixiv.net/ and its WEB-Clients/WEB-Side communication and cannot be solved sustainably at the level of the network stack of this PR.

return42 added a commit to return42/searxng that referenced this pull request Dec 2, 2023
…ients)

In the bing engines, reuse of http client and SSL context leads to difficult to
assess side effects. [1][2]

- [1] searxng#2977
- [2] searxng#2685 (comment)

Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
return42 added a commit to return42/searxng that referenced this pull request Dec 3, 2023
…ients)

In the bing engines, reuse of http client and SSL context leads to difficult to
assess side effects. [1][2]

- [1] searxng#2977
- [2] searxng#2685 (comment)

Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Comment on lines +135 to +162
@final
def _get_http_client(self) -> ABCHTTPClient:
"""Return the HTTP client to use for this context."""
if self._http_client is None:
raise ValueError("HTTP client has not been set")
return self._http_client

@final
def _set_http_client(self):
"""Ask the NetworkContext to use another HTTP client using the factory.
Use the method _get_new_client_from_factory() to call the factory,
so the NetworkContext implementations can wrap the ABCHTTPClient.
"""
self._http_client = self._get_new_client_from_factory()

@final
def _reset_http_client(self):
self._http_client = None

def _get_new_client_from_factory(self):
return self._http_client_factory()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit picking: do we really need such a lot of indirections .. where private methods get/set private members? By example to set a new http client in self._http_client

  • we call self._set_http_client() which does nothing more than
  • calling a method self._get_new_client_from_factory() which does nothing more than
  • calling self._http_client_factory()

BTW: the reader needs to know that a private method / a method with an underline prefix (_get_new_client_from_factory) is overwritten in the inheritance .. this needs really a lot of code reading to understand whats going on

I mean (haven't tested), this could be simplified to a property (which can be overwritten in the inheritance:

class NetworkContext:

    _http_client = None

    @property
    def http_client(self):
        if self._http_client is None:
            self._http_client = self._http_client_factory()
        return self._http_client

    @final
    def reset_http_client(self):
        self._http_client = None

Copy link
Contributor Author

@dalf dalf Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need such a lot of indirections .. where private methods get/set private members?

I don't remember exactly the last audio call, may be I said it was private methods ?
I should have said protected methods rather than private methods.

the reader needs to know that a private method / a method with an underline prefix

it seems to be a common practice:

  • a single _ as prefix for protected methods
  • a double __ as prefix for private methods

See:


About indirection: I can't say I'm happy with them, however _get_new_client_from_factory is override by the different implementations:

def _get_new_client_from_factory(self):
return _RetryFunctionHTTPClient(super()._get_new_client_from_factory(), self)

def _get_new_client_from_factory(self):
return _RetrySameHTTPClient(super()._get_new_client_from_factory(), self)

def _get_new_client_from_factory(self):
return _RetryDifferentHTTPClient(self)

Reading the code now, I think:

  • the different implementations of NetworkContext can set the self._http_client directly in _set_http_client (so this method is no more final).
  • _get_new_client_from_factory can be removed.

For example:

class NetworkContextRetryFunction(NetworkContext):
    def call(self, func: Callable[P, R], *args: P.args, **kwargs: P.kwargs) -> R:
        ...

    def _set_http_clientself):
        self._http_client = _RetryFunctionHTTPClient(super()._get_new_client_from_factory(), self)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember exactly the last audio call, may be I said it was private methods ?
I should have said protected methods rather than private methods.

No you didn't said its private, its just me who mixes both terms .. in python the concept of __private is just crap and should not been used (as David Goodger recommends / _Foo__private). As a replacement, it has commonly been agreed that it is better to use _protected.

.. what I wanted to point out is: I think its not good to mark names as _protected and then overwriting these names in inheritance .. we can do it, but personally I think its not a good practice.

the different implementations of NetworkContext can set the self._http_client directly in ..

Furthermore, I think that .getter() and .setter() can often be realized by a @property (as shown above) / which is late-evaluated as a positive side effect (no need to explicitly call a .setter()).

But you should leave everything as it is for now, I will make suggestions in my branch, which we can look at after I have finished my review

return42 added a commit to return42/searxng that referenced this pull request Dec 3, 2023
…ients)

In the bing engines, reuse of http client and SSL context leads to difficult to
assess side effects. [1][2]

- [1] searxng#2977
- [2] searxng#2685 (comment)

Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
return42 added a commit to return42/searxng that referenced this pull request Dec 3, 2023
…ients)

In the bing engines, reuse of http client and SSL context leads to difficult to
assess side effects. [1][2]

- [1] searxng#2977
- [2] searxng#2685 (comment)

Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
@dalf
Copy link
Contributor Author

dalf commented Dec 4, 2023

I will review your additional commits most probably on Friday ( 5871c78 and d82d06e ).

@return42
Copy link
Member

return42 commented Dec 4, 2023

I will review your additional commits most probably on Friday ( 5871c78 and d82d06e ).

No, please not! .. its just half of the code .. and may I will drop it later .. give me some more time to finish my review ..

@dalf dalf force-pushed the searx_network_rework branch from ffcff58 to 4e6f81b Compare January 5, 2024 20:51
@dalf
Copy link
Contributor Author

dalf commented Jan 5, 2024

The tests are flaky, this is related to #2988 not to this PR

The socket leak might be fixed (at least I've tried).

@dalf dalf force-pushed the searx_network_rework branch from 436ff7f to b5e9de8 Compare January 5, 2024 21:13
@dalf dalf force-pushed the searx_network_rework branch from b5e9de8 to 96fb44c Compare January 27, 2024 10:55
mrpaulblack added a commit to paulgoio/searxng that referenced this pull request Jan 29, 2024
* deploying searxng/searxng#2685 for integration testing
=> this is temporary and will be reverted to searxng/searxng:main, when the upstream PR is merged
* second try of testing branch, after memory leak in first integration testing
@dalf dalf mentioned this pull request Mar 3, 2024
@return42
Copy link
Member

return42 commented Mar 9, 2024

@dalf @mrpaulblack sorry, I lost the overview, what the state of this PR, the network stack?

@dalf
Copy link
Contributor Author

dalf commented Mar 9, 2024

The broken part is the streaming for the image proxy: the sockets are not always closed.
Everything else works.
I have an idea how to fix that, just need motivation to implement it...

@dalf dalf force-pushed the searx_network_rework branch from 96fb44c to 45c217f Compare March 16, 2024 17:00
@czaky
Copy link

czaky commented Jun 7, 2024

Sorry, for intruding on this almost 1y discussion.

Context: I have recently experimented with SearXNG through unreliable proxy networks (particularly TOR)
(reference setup: https://github.com/czaky/turbotor). This resulted in PR: #3491 .

How could we support following features (using a non-asyncio approach)?

  • Parallel fan out of requests through multiple proxies (at least in initial phase)?
  • Keep alive / connectivity- / latency-determination for existing HTTP connections (in the background)?

Parallel requests through multiple proxies.

As implemented in #3491, doing the "retries" in parallel (instead of in series) makes this approach usable over networks with increased latency (tail >10s) and high rate of client rejection. This was rather straightforward to implement using asyncio and it minimized issues related to using SearXNG over TOR.

(Motivation: theoretically and practically, waiting for multiple engines increases the probability of hitting the tail latency exponentially. So if the common recommendation for TOR is a timeout of 30s, it is, unfortunately, how long SearXNG takes to answer queries.)

Keep alive / connectivity- / latency-determination and background connection pools.

In case of TOR each proxy changes the circuits every 10 minutes. Thus, the set of "good" proxies changes rather fast. A background task could proactively determine a smaller subset of proxies that provide the optimal connectivity for each engine.
This would reduce client rejection rates by the target-point and reduce the overall noise.

(Note, I am more interested in making SearXNG usable over TOR then advocating for any approach.)

@dalf
Copy link
Contributor Author

dalf commented Jul 10, 2024

Regardless of the decision about async vs. sync, this PR is ready with one exception: some sockets are leaking, likely due to response streaming.

Response streaming is required for only one feature: the image proxy.

Therefore, an alternative approach is to:

  • Remove response streaming from the code
  • Provide a much simpler alternative

But first, we should verify whether the socket leak is indeed caused by response streaming.

@dalf dalf mentioned this pull request Apr 26, 2025
9 tasks
@dalf dalf closed this Apr 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: network searx.network preview-environment-commented Label for tracking if preview-environment comment was sent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants