-
Notifications
You must be signed in to change notification settings - Fork 2.3k
refactor searx.network
#2685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor searx.network
#2685
Conversation
854b914 to
23a50a6
Compare
23a50a6 to
82840bd
Compare
|
I would suggest to implement a class for the networks management / what we have is a global dict: NETWORKS: Dict[str, 'Network'] = {}this dict is initialized with default when the module is imported: NETWORKS[DEFAULT_NAME] = Network()some parts of a "network management constructor" is done in the initialization .. def initialize(settings_engines=None, settings_outgoing=None):
...
NETWORKS.clear()
NETWORKS[DEFAULT_NAME] = new_network({}, logger_name='default')
NETWORKS['ipv4'] = new_network({'local_addresses': '0.0.0.0'}, logger_name='ipv4')
NETWORKS['ipv6'] = new_network({'local_addresses': '::'}, logger_name='ipv6')and for the management of the networks we have some lose functions like: def get_network(name=None):
return NETWORKS[name or DEFAULT_NAME]
...
def check_network_configuration():
exception_count = 0
for network in NETWORKS.values():
...IMO we should have a class like NetworkManager from which a instance
|
|
The line below makes code outside the webapp works out of the box : no need to do any kind of initialization, just use NETWORKS[DEFAULT_NAME] = Network()For example, the traits are fetched outside the webapp: searxng/searx/engines/google.py Line 428 in 9fce459
The code can use httpx directly in some places, but not everywhere:
searxng/searx/engines/wikidata.py Lines 153 to 159 in 9fce459
Here Instead of using httpx directly, the standalone scripts can initialize searxng/searx/network/network.py Line 427 in 9fce459
What do you think?
|
the line is not a big deal .. I just want suggest to have a manager for the task "networks management" in SXNG's network stack. |
|
Why not a network management, but I can't answer who creates the NetworkManager, and then how to access the singleton. |
was what I suggested / in code-speak .. class NetworkManager:
def __init__(self):
self.networks = {}
def new_network(self, name):
...
def check_network_configuration(self):
...
def get_network(self, name = None):
ret_val = self.networks.get(name or DEFAULT_NAME)
if ret_val:
return ret_val
self.networks[name] = self.new_network(name)
return self.networks[name]
NETWORKS = NetworkManager() |
|
I've added a NetworkManager as you described, a lot of thing can be improved, but that's a start. |
dd7140d to
55db35d
Compare
|
the last commit removes the async code. |
6addb2d to
c6fdf1b
Compare
| # pylint: disable=global-statement | ||
| """Provide HTTP clients following the user configuration | ||
| """ | ||
| import threading |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Describe the API: the implementation uses httpx, but the API is mostly compatible with requests.
mostly because, may be there are some missing parts.
TODO : check https://www.python-httpx.org/compatibility/
See the cookie section --> should we drop the requests API emulation and keep only httpx?
…ients) In the bing engines, reuse of http client and SSL context leads to difficult to assess side effects. [1][2] - [1] searxng#2977 - [2] searxng#2685 (comment) Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
|
Maybe we could add something for: To set the correct header? I'm just not sure what the best approach is. |
@Hackurei I think setting HTTP headers for image load of special engines is not in the scope of the network stack / BTW it would not help when there is no image proxy and the image needs to be load by the web client itself .. TL;DR: #2772 is very special to https://www.pixiv.net/ and its WEB-Clients/WEB-Side communication and cannot be solved sustainably at the level of the network stack of this PR. |
…ients) In the bing engines, reuse of http client and SSL context leads to difficult to assess side effects. [1][2] - [1] searxng#2977 - [2] searxng#2685 (comment) Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
…ients) In the bing engines, reuse of http client and SSL context leads to difficult to assess side effects. [1][2] - [1] searxng#2977 - [2] searxng#2685 (comment) Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
| @final | ||
| def _get_http_client(self) -> ABCHTTPClient: | ||
| """Return the HTTP client to use for this context.""" | ||
| if self._http_client is None: | ||
| raise ValueError("HTTP client has not been set") | ||
| return self._http_client | ||
|
|
||
| @final | ||
| def _set_http_client(self): | ||
| """Ask the NetworkContext to use another HTTP client using the factory. | ||
| Use the method _get_new_client_from_factory() to call the factory, | ||
| so the NetworkContext implementations can wrap the ABCHTTPClient. | ||
| """ | ||
| self._http_client = self._get_new_client_from_factory() | ||
|
|
||
| @final | ||
| def _reset_http_client(self): | ||
| self._http_client = None | ||
|
|
||
| def _get_new_client_from_factory(self): | ||
| return self._http_client_factory() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit picking: do we really need such a lot of indirections .. where private methods get/set private members? By example to set a new http client in self._http_client
- we call
self._set_http_client()which does nothing more than - calling a method
self._get_new_client_from_factory()which does nothing more than - calling
self._http_client_factory()
BTW: the reader needs to know that a private method / a method with an underline prefix (_get_new_client_from_factory) is overwritten in the inheritance .. this needs really a lot of code reading to understand whats going on
I mean (haven't tested), this could be simplified to a property (which can be overwritten in the inheritance:
class NetworkContext:
_http_client = None
@property
def http_client(self):
if self._http_client is None:
self._http_client = self._http_client_factory()
return self._http_client
@final
def reset_http_client(self):
self._http_client = NoneThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really need such a lot of indirections .. where private methods get/set private members?
I don't remember exactly the last audio call, may be I said it was private methods ?
I should have said protected methods rather than private methods.
the reader needs to know that a private method / a method with an underline prefix
it seems to be a common practice:
- a single
_as prefix for protected methods - a double
__as prefix for private methods
See:
- https://softwareengineering.stackexchange.com/questions/228216/what-is-the-historical-reason-why-python-uses-the-double-underscore-for-class-pr
- https://discuss.python.org/t/private-protected-modifier-and-notation/22356
- https://stackoverflow.com/questions/6930144/underscore-vs-double-underscore-with-variables-and-methods
- httpx uses that convention
- pylint message : http://pylint-messages.wikidot.com/messages:w0212
- also: Allow accessing protected and private attributes in __new__ pylint-dev/pylint#1802 (comment)
About indirection: I can't say I'm happy with them, however _get_new_client_from_factory is override by the different implementations:
searxng/searx/network/context.py
Lines 267 to 268 in ffcff58
| def _get_new_client_from_factory(self): | |
| return _RetryFunctionHTTPClient(super()._get_new_client_from_factory(), self) |
searxng/searx/network/context.py
Lines 315 to 316 in ffcff58
| def _get_new_client_from_factory(self): | |
| return _RetrySameHTTPClient(super()._get_new_client_from_factory(), self) |
searxng/searx/network/context.py
Lines 369 to 370 in ffcff58
| def _get_new_client_from_factory(self): | |
| return _RetryDifferentHTTPClient(self) |
Reading the code now, I think:
- the different implementations of
NetworkContextcan set theself._http_clientdirectly in_set_http_client(so this method is no more final). _get_new_client_from_factorycan be removed.
For example:
class NetworkContextRetryFunction(NetworkContext):
def call(self, func: Callable[P, R], *args: P.args, **kwargs: P.kwargs) -> R:
...
def _set_http_clientself):
self._http_client = _RetryFunctionHTTPClient(super()._get_new_client_from_factory(), self)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't remember exactly the last audio call, may be I said it was private methods ?
I should have said protected methods rather than private methods.
No you didn't said its private, its just me who mixes both terms .. in python the concept of __private is just crap and should not been used (as David Goodger recommends / _Foo__private). As a replacement, it has commonly been agreed that it is better to use _protected.
.. what I wanted to point out is: I think its not good to mark names as _protected and then overwriting these names in inheritance .. we can do it, but personally I think its not a good practice.
the different implementations of NetworkContext can set the
self._http_client directlyin ..
Furthermore, I think that .getter() and .setter() can often be realized by a @property (as shown above) / which is late-evaluated as a positive side effect (no need to explicitly call a .setter()).
But you should leave everything as it is for now, I will make suggestions in my branch, which we can look at after I have finished my review
…ients) In the bing engines, reuse of http client and SSL context leads to difficult to assess side effects. [1][2] - [1] searxng#2977 - [2] searxng#2685 (comment) Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
…ients) In the bing engines, reuse of http client and SSL context leads to difficult to assess side effects. [1][2] - [1] searxng#2977 - [2] searxng#2685 (comment) Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
ffcff58 to
4e6f81b
Compare
|
The tests are flaky, this is related to #2988 not to this PR The socket leak might be fixed (at least I've tried). |
436ff7f to
b5e9de8
Compare
b5e9de8 to
96fb44c
Compare
* deploying searxng/searxng#2685 for integration testing => this is temporary and will be reverted to searxng/searxng:main, when the upstream PR is merged * second try of testing branch, after memory leak in first integration testing
|
@dalf @mrpaulblack sorry, I lost the overview, what the state of this PR, the network stack? |
|
The broken part is the streaming for the image proxy: the sockets are not always closed. |
96fb44c to
45c217f
Compare
|
Sorry, for intruding on this almost 1y discussion. Context: I have recently experimented with SearXNG through unreliable proxy networks (particularly TOR) How could we support following features (using a non-asyncio approach)?
Parallel requests through multiple proxies.As implemented in #3491, doing the "retries" in parallel (instead of in series) makes this approach usable over networks with increased latency (tail >10s) and high rate of client rejection. This was rather straightforward to implement using asyncio and it minimized issues related to using SearXNG over TOR. (Motivation: theoretically and practically, waiting for multiple engines increases the probability of hitting the tail latency exponentially. So if the common recommendation for TOR is a timeout of 30s, it is, unfortunately, how long SearXNG takes to answer queries.) Keep alive / connectivity- / latency-determination and background connection pools.In case of TOR each proxy changes the circuits every 10 minutes. Thus, the set of "good" proxies changes rather fast. A background task could proactively determine a smaller subset of proxies that provide the optimal connectivity for each engine. (Note, I am more interested in making SearXNG usable over TOR then advocating for any approach.) |
|
Regardless of the decision about async vs. sync, this PR is ready with one exception: some sockets are leaking, likely due to response streaming. Response streaming is required for only one feature: the image proxy. Therefore, an alternative approach is to:
But first, we should verify whether the socket leak is indeed caused by response streaming. |
What does this PR do?
See https://github.com/dalf/searxng/wiki/Branch-searx_network_rework to get an overview of the architecture change.
In few words:
For each network, there is a new parameter
retry_strategywhich describes what to do when an outgoing HTTP request fail andretries > 1:DIFFERENT_HTTP_CLIENT(default value): this is the current implementation in the master branch. If a request fails, then another (proxy, IP address) is tried. If an engine sends two HTTP requests, the first one goes through a proxy, the second request goes through another proxy (if two proxies or more are configured).ENGINE: the (proxy, IP address) is kept during the process of one user query. If a HTTP request fails, the engine restart the process of the query with another (proxy, IP address).SAME_HTTP_CLIENT: If a HTTP request fails, the request is sent again with the same (proxy, IP address).All these strategies respect the user query timeout.
Why is this change important?
This is unfinished, but I'm not sure if this should be merged once done.
How to test this PR locally?
To be written, non-exhaustive list :
Author's checklist
About async code: it code was introduced when the target was to fully switch to async (async code in the engine, use quart, etc...). However, async won't fit the use case here: async is cooperative multitasking, and between the very small and fast function calls to deal with HTTP/2 and the very long pause to parse the engine result, I'm afraid it will provide slow down the response time (or it will break as soon we are not very careful).
The first commit brings NetworkContext but keep async code. The second commit removes async code.
The only point of async is to send multiple requests at the same time from one engine:
This PR brings 1401 new lines of code, but there is far more documentation and tests than before.
Related