-
Notifications
You must be signed in to change notification settings - Fork 2.3k
[mod] botdetection: HTTP Fetch Metadata Request Headers #3965
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
That's interesting. Are there really a lot of bots that sends these incorrect headers? |
I detected one bot that passes all tests .. all other are blocked, mainly by this PR. BTW Sec-Fetch-Mode is one method DDG uses to block bots .. it was the first place I have seen Sec-Fetch-Mod and I thought we can add this method to our botdetection :-) |
|
In the current state of this PR it means that any browser that have not implemented Sec-Fetch-* will get blocked. Or if the browser that have implemented it, the user is running an old version of it: Then all of these visitors will get blocked. You can see a better compatibility matrix of this header: https://caniuse.com/mdn-http_headers_sec-fetch-dest In the current state of this PR you will block 10% of the total legitimate users on the internet. Big example is that Safari has only implemented these headers last March (2023), it's quite recent. I'm not fond into this change. Maybe you can implement so you check if the headers are sent, the values sent have to be correct for allowing the traffic. But straight blocking any browser that do not send these headers is a bad idea. Or you can make this configurable, but that's not a good idea to have this behavior by default. |
Bnyro
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The limiter is also in use if people use different formats (e.g. json) for the results, right?
Nobody using the search API via the json format would consider sending the Sec-* headers, so I agree that we should only validate them if they were sent.
I hope not, the limiter checks for some CSS resources, something that is not sent when requesting the JSON API. |
|
Thanks for feedback ..
this is what DDG does ... but we can make the filter methods optional --> I will modify this PR, to have an option to deactivate filter methods.
JSON and similar API requests are per se bot requests ;-) .. Just some thoughts of mine: We have to filter out bot requests because otherwise our engines degrade ... it's the only reason we filter, by the way. For the WEB applications we activate the limiter, disable formats such as JSON and thus prefer real WEB browsers. The link token method loads CSS, for example, and thus already excludes exotic WEB browsers such as lynx. If Sec-Fetch-* is supported by the major WEB browsers, it makes sense to use it by default for bot defense (as other search engines do). Optionally, instance operators can deactivate the filter method to also support browsers that do not support Sec-Fetch-* ... however, the operator must then be aware that his engines will degrade more quickly ... just as you could offer JSON on the WEB, but then have to assume that all bots will rush to this instance ... and I can hope that there won't be too many bug reports for the engines for such bot friendly setups ;-). |
I agree, it sounds reasonable to enable this filter rule by default and allow instance admins to disable it if there's a large demand from its users. People running such old browsers (older than a year) are a security risk anyways and should update their browsers, so I don't see a requirement to support these old browser versions by default. |
I'm only saying that because I saw an Android app (I don't remember the name) that used the JSON API of SearXNG to search the web and allow users to quickly navigate to the results, which would be affected by that for example. |
Interesting .. 🤔 .. I would like to know against which instance such an app connects .. all instances on searx.space have bot protection enabled ... |
Found it: It's called "Gugal" and one must set the instance url manually, so it's probably only used with self-hosted instances. |
You don't know the inner workings of DDG, you don't know how their anti bot fully work. They may have exemption for specific web browsers. I don't think we all concluded to target the same anti bot level as DDG, if that was the case we would have javascript bot detection and so on already into the core. I already explained that it's a lost cause to combat all the nefarious bots. You can't combat something where the secret sauce to beat them is literally found in a public source code. Filtering based on I have already said that the only solutions to reduce the issues caused by bots are:
Making it easier to deploy SearXNG will also benefit for the users to have the best experience by running a local instance. Something that as of now is quite complicated for a beginner in IT due to the documentation not being very user-friendly (requires some existing knowledge about hosting services at home), but that's another take... |
I agree, but to win such a combat isn't my intention .. The intention is to protect the engines from degradation / Bot requests do not interfere if they are not detected by the search service to which they are forwarded by our engines --> To detect “all” bots is a moving target, but we don't have to achieve that. It should be enough for us to filter out the malicious requests. There are bots that are elaborately implemented, these are so good that they are neither detected by us nor by the search engine to which we pass their queries ... these bots are not a problem for us and we do not have to fight them. What is harmful are the lovelessly implemented bots that are sent to our instances by script kiddies ... these cause considerable damage. Serious bot developers who are able to track our developments are less likely to use SearXNG instances than to access Google directly, for example (which is easier than bypassing our bot detection these days).
I can currently see that 99% of the bot requests on my server have no set sec-fetch-* headers The Sec-Fetch headers are a good method of defense at the moment ... if the carelessly implemented bots get better one day, we will have to revise our methods again. But for the moment, these headers still provide a good estimate to rely on for bot detection. It's the lovelessly implemented bots that roam around the network, those represent a significant majority .. it should be enough for us, to fend those off. |
A public instance is sending the requests from a single IP address. It just takes a small amount of bots to ruin the experience for everyone because the engine will consider that there are already too many requests being done compared to normal traffic from a single IP address. That's common sense! I'm speaking from my experience with dealing with many public services that are being attacked every day by bots (yewtu.be, searx.be and xcancel.com). The only way that truly works is by hiding how you combat them and use advanced methods like JavaScript scripts, but that's certainly not what we want by default on SearXNG. This is why I'm repeating myself again, you have to start to explore other paths that do not involve combating bots. Anyway, my review for this new change is that I do not endorse the filtering based on |
I don't want to contradict that, but it's not trivial / especially if we want to protect privacy. I try to achieve a maximum of our goals with simple means in the time available to me ... I would be happy to receive PRs with alternatives.
@unixfox would it be OK for you, if this method (Sec-Fetch*) is optional? .. If we first activate it in the default and wait for feedback from the user community ... if there are frequent problems, we could deactivate the method in the default. I have had good experiences with this method in my own instance, but my instance is not representative --> I would like to gain experience of how the method works in practice, on a broad basis of instances. If the Sec-Fetch-* method causes more damage, then it should not be used. |
The people that have bad experience due to this new filter will never bother reporting it back to us, that's a fact. They will move on and never try SearXNG again. Hence, why I'm highly hesitant to enable it by default because I know this will cause harm, I have explained above that it does filter quite a lot of legitimate traffic: #3965 (comment) But you can do it this way, make it optional. Then create a GitHub discussion in https://github.com/searxng/searx-instances/discussions explaining the drawbacks and the improvements. Then send this link to the This way, it's up to the maintainer to take the final decision. |
|
@return42 In my opinion, to be safe, you should check both the browser version sent through the user-agent and if Sec-Fetch-Site exist or not. This will be a bit more time-consuming to build this list, but at least you won't block browsers that legitimately do not support those headers. |
time-consuming is negligible .. but as a consequence, however, we would have to let all I patched this PR into my darmarit.org branch and use it in production since I opened this PR .. it has blocked thousands and thousands of bot request since without violate my users. If there are really doubts about the usefulness, then it would certainly be easiest to leave the decision to the maintainers ... in other words, we make this feature optional. But instead of making individual block methods optional, we should rebuild botdetection so that URL paths and the filter methods to be applied to them are configurable, including the method proposed here in this PR. I had already started to extend the botdetection with such a configuration ... but I didn't finish it because I would have to take care of other things again :-o |
I can assure you that in my experience from managing xcancel.com with many many bots hitting my instance every day. 98% of bots disguise themselves as Chrome or Firefox and nothing else. So if you already start to check for the version of Firefox or Chrome, then you will already have a very good coverage without spending too much time.
For sure, I think that's the best outcome. Let the maintainer decide if he wants to block more bots at the cost of blocking legitimate users on old browsers. |
3634757 to
51c6370
Compare
|
FYI: last force push was just a rebase on master / BTW I run this patch on my instance since 2 months without any issues (and good blocking experience). |
|
Hello my comments still apply. Same goes for the recommendation for the community. |
51c6370 to
5417abf
Compare
|
@return42 please review my PR: #4696, especially the commit I added a specific check to check for the browser version before checking for the Sec-Fetch headers. This is inline with what I said in #3965 (comment) and I'm fine with blocking by default the requests like this. |
HTTP Fetch Metadata Request Headers [1][2] are used to detect bot requests. Bots with invalid *Fetch Metadata* will be redirected to the intro (`index`) page. [1] https://www.w3.org/TR/fetch-metadata/ [2] https://developer.mozilla.org/en-US/docs/Glossary/Fetch_metadata_request_header Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
PR is now merged .. go ahead with PR: #4696 |
First commit of #4696 was cherry-picked from here (fe08bb1) and I can't merge what I don't approve of .. the decision for #4696 is now entirely yours. |
|
I would have merged my own PR and close this one 🙄. I think I'm still maintaining this project, even if it's from a distant eye, so you aren't the only one that can merge things on this project. |
HTTP Fetch Metadata Request Headers [1][2] are used to detect bot requests. Bots with invalid Fetch Metadata will be redirected to the intro (
index) page.[1] https://www.w3.org/TR/fetch-metadata/
[2] https://developer.mozilla.org/en-US/docs/Glossary/Fetch_metadata_request_header
FYI: PR is tested on my instance https://darmarit.org/searx