[mod] botdetection: HTTP Fetch Metadata Request Headers #3965

return42 · 2024-10-27T12:47:31Z

HTTP Fetch Metadata Request Headers [1][2] are used to detect bot requests. Bots with invalid Fetch Metadata will be redirected to the intro (index) page.

[1] https://www.w3.org/TR/fetch-metadata/
[2] https://developer.mozilla.org/en-US/docs/Glossary/Fetch_metadata_request_header

FYI: PR is tested on my instance https://darmarit.org/searx

unixfox · 2024-10-28T11:53:26Z

That's interesting. Are there really a lot of bots that sends these incorrect headers?

return42 · 2024-10-28T11:57:52Z

That's interesting. Are there really a lot of bots that sends these incorrect headers?

I detected one bot that passes all tests .. all other are blocked, mainly by this PR. BTW Sec-Fetch-Mode is one method DDG uses to block bots .. it was the first place I have seen Sec-Fetch-Mod and I thought we can add this method to our botdetection :-)

unixfox · 2024-10-28T12:16:27Z

In the current state of this PR it means that any browser that have not implemented Sec-Fetch-* will get blocked. Or if the browser that have implemented it, the user is running an old version of it:

Then all of these visitors will get blocked.

You can see a better compatibility matrix of this header: https://caniuse.com/mdn-http_headers_sec-fetch-dest

In the current state of this PR you will block 10% of the total legitimate users on the internet. Big example is that Safari has only implemented these headers last March (2023), it's quite recent.

I'm not fond into this change. Maybe you can implement so you check if the headers are sent, the values sent have to be correct for allowing the traffic.

But straight blocking any browser that do not send these headers is a bad idea. Or you can make this configurable, but that's not a good idea to have this behavior by default.

Bnyro

The limiter is also in use if people use different formats (e.g. json) for the results, right?

Nobody using the search API via the json format would consider sending the Sec-* headers, so I agree that we should only validate them if they were sent.

unixfox · 2024-10-28T14:29:20Z

The limiter is also in use if people use different formats (e.g. json) for the results, right?

I hope not, the limiter checks for some CSS resources, something that is not sent when requesting the JSON API.

return42 · 2024-10-28T15:48:11Z

Thanks for feedback ..

In the current state of this PR it means that any browser that have not implemented Sec-Fetch-* will get blocked.

this is what DDG does ... but we can make the filter methods optional --> I will modify this PR, to have an option to deactivate filter methods.

The limiter is also in use if people use different formats (e.g. json) for the results, right?

JSON and similar API requests are per se bot requests ;-) ..

Just some thoughts of mine:

We have to filter out bot requests because otherwise our engines degrade ... it's the only reason we filter, by the way.

For the WEB applications we activate the limiter, disable formats such as JSON and thus prefer real WEB browsers. The link token method loads CSS, for example, and thus already excludes exotic WEB browsers such as lynx.

If Sec-Fetch-* is supported by the major WEB browsers, it makes sense to use it by default for bot defense (as other search engines do). Optionally, instance operators can deactivate the filter method to also support browsers that do not support Sec-Fetch-* ... however, the operator must then be aware that his engines will degrade more quickly ... just as you could offer JSON on the WEB, but then have to assume that all bots will rush to this instance ... and I can hope that there won't be too many bug reports for the engines for such bot friendly setups ;-).

Bnyro · 2024-10-28T15:52:27Z

If Sec-Fetch-* is supported by the major WEB browsers, it makes sense to use it by default for bot defense (as other search engines do). Optionally, instance operators can deactivate the filter method to also support browsers that do not support Sec-Fetch-* ... however, the operator must then be aware that his engines will degrade more quickly

I agree, it sounds reasonable to enable this filter rule by default and allow instance admins to disable it if there's a large demand from its users. People running such old browsers (older than a year) are a security risk anyways and should update their browsers, so I don't see a requirement to support these old browser versions by default.

Bnyro · 2024-10-28T15:54:14Z

The limiter is also in use if people use different formats (e.g. json) for the results, right?

JSON and similar API requests are per se bot requests ;-) ..

I'm only saying that because I saw an Android app (I don't remember the name) that used the JSON API of SearXNG to search the web and allow users to quickly navigate to the results, which would be affected by that for example.

return42 · 2024-10-28T15:59:34Z

I'm only saying that because I saw an Android app (I don't remember the name) that used the JSON API of SearXNG to search the web and allow users to quickly navigate to the results, which would be affected by that for example.

Interesting .. 🤔 .. I would like to know against which instance such an app connects .. all instances on searx.space have bot protection enabled ...

Bnyro · 2024-10-28T16:02:14Z

Interesting .. 🤔 .. I would like to know against which instance such an app connects .. all instances on searx.space have bot protection enabled ...

Found it: It's called "Gugal" and one must set the instance url manually, so it's probably only used with self-hosted instances.

unixfox · 2024-10-28T18:40:50Z

this is what DDG does ... but we can make the filter methods optional --> I will modify this PR, to have an option to deactivate filter methods.

You don't know the inner workings of DDG, you don't know how their anti bot fully work. They may have exemption for specific web browsers. I don't think we all concluded to target the same anti bot level as DDG, if that was the case we would have javascript bot detection and so on already into the core.

I already explained that it's a lost cause to combat all the nefarious bots. You can't combat something where the secret sauce to beat them is literally found in a public source code.

Filtering based on Sec-* headers may work for a couple of months until bot developers notice it and update their code.

I have already said that the only solutions to reduce the issues caused by bots are:

involving the users by allowing them to pass the CAPTCHA: Built-in CAPTCHA solver #2844
make it easier to switch between public instances
backup engines that get activated, it's better to serve something than nothing: [Important] Improve the user experience on the public instances #2802 (comment)

Making it easier to deploy SearXNG will also benefit for the users to have the best experience by running a local instance. Something that as of now is quite complicated for a beginner in IT due to the documentation not being very user-friendly (requires some existing knowledge about hosting services at home), but that's another take...

return42 · 2024-10-29T09:40:13Z

that it's a lost cause to combat all the nefarious bots. You can't combat something where the secret sauce to beat them is literally found in a public source code.

I agree, but to win such a combat isn't my intention .. The intention is to protect the engines from degradation / Bot requests do not interfere if they are not detected by the search service to which they are forwarded by our engines --> To detect “all” bots is a moving target, but we don't have to achieve that. It should be enough for us to filter out the malicious requests.

There are bots that are elaborately implemented, these are so good that they are neither detected by us nor by the search engine to which we pass their queries ... these bots are not a problem for us and we do not have to fight them.

What is harmful are the lovelessly implemented bots that are sent to our instances by script kiddies ... these cause considerable damage. Serious bot developers who are able to track our developments are less likely to use SearXNG instances than to access Google directly, for example (which is easier than bypassing our bot detection these days).

Filtering based on Sec-* headers may work for a couple of months until bot developers notice it and update their code.

I can currently see that 99% of the bot requests on my server have no set sec-fetch-* headers

The Sec-Fetch headers are a good method of defense at the moment ... if the carelessly implemented bots get better one day, we will have to revise our methods again. But for the moment, these headers still provide a good estimate to rely on for bot detection.

It's the lovelessly implemented bots that roam around the network, those represent a significant majority .. it should be enough for us, to fend those off.

unixfox · 2024-10-29T10:11:21Z

I agree, but to win such a combat isn't my intention .. The intention is to protect the engines from degradation / Bot requests do not interfere if they are not detected by the search service to which they are forwarded by our engines --> To detect “all” bots is a moving target, but we don't have to achieve that. It should be enough for us to filter out the malicious requests.

There are bots that are elaborately implemented, these are so good that they are neither detected by us nor by the search engine to which we pass their queries ... these bots are not a problem for us and we do not have to fight them.

What is harmful are the lovelessly implemented bots that are sent to our instances by script kiddies ... these cause considerable damage. Serious bot developers who are able to track our developments are less likely to use SearXNG instances than to access Google directly, for example (which is easier than bypassing our bot detection these days).

A public instance is sending the requests from a single IP address. It just takes a small amount of bots to ruin the experience for everyone because the engine will consider that there are already too many requests being done compared to normal traffic from a single IP address. That's common sense!

I'm speaking from my experience with dealing with many public services that are being attacked every day by bots (yewtu.be, searx.be and xcancel.com). The only way that truly works is by hiding how you combat them and use advanced methods like JavaScript scripts, but that's certainly not what we want by default on SearXNG.

This is why I'm repeating myself again, you have to start to explore other paths that do not involve combating bots.

Anyway, my review for this new change is that I do not endorse the filtering based on Sec-* by default, this will do more harm than improving the bot situation.

return42 · 2024-10-29T10:28:40Z

This is why I'm repeating myself again, you have to start to explore other paths that do not involve combating bots.

I don't want to contradict that, but it's not trivial / especially if we want to protect privacy. I try to achieve a maximum of our goals with simple means in the time available to me ... I would be happy to receive PRs with alternatives.

Anyway, my review for this new change is that I do not endorse the filtering based on Sec-* by default, this will do more harm than improving the bot situation.

@unixfox would it be OK for you, if this method (Sec-Fetch*) is optional? .. If we first activate it in the default and wait for feedback from the user community ... if there are frequent problems, we could deactivate the method in the default.

I have had good experiences with this method in my own instance, but my instance is not representative --> I would like to gain experience of how the method works in practice, on a broad basis of instances.

If the Sec-Fetch-* method causes more damage, then it should not be used.

unixfox · 2024-10-29T10:50:45Z

@unixfox would it be OK for you, if this method (Sec-Fetch*) is optional? .. If we first activate it in the default and wait for feedback from the user community ... if there are frequent problems, we could deactivate the method in the default.

The people that have bad experience due to this new filter will never bother reporting it back to us, that's a fact. They will move on and never try SearXNG again. Hence, why I'm highly hesitant to enable it by default because I know this will cause harm, I have explained above that it does filter quite a lot of legitimate traffic: #3965 (comment)

But you can do it this way, make it optional. Then create a GitHub discussion in https://github.com/searxng/searx-instances/discussions explaining the drawbacks and the improvements. Then send this link to the SearxNG public instances owners matrix room, that's why we created it!

This way, it's up to the maintainer to take the final decision.

unixfox · 2024-12-08T17:03:01Z

@return42 In my opinion, to be safe, you should check both the browser version sent through the user-agent and if Sec-Fetch-Site exist or not. This will be a bit more time-consuming to build this list, but at least you won't block browsers that legitimately do not support those headers.

return42 · 2024-12-08T17:24:53Z

This will be a bit more time-consuming to build this list, but at least you won't block browsers that legitimately do not support those headers.

time-consuming is negligible .. but as a consequence, however, we would have to let all User-Agents unknown to us pass ..

I patched this PR into my darmarit.org branch and use it in production since I opened this PR .. it has blocked thousands and thousands of bot request since without violate my users.

If there are really doubts about the usefulness, then it would certainly be easiest to leave the decision to the maintainers ... in other words, we make this feature optional.

But instead of making individual block methods optional, we should rebuild botdetection so that URL paths and the filter methods to be applied to them are configurable, including the method proposed here in this PR.

I had already started to extend the botdetection with such a configuration ... but I didn't finish it because I would have to take care of other things again :-o

unixfox · 2024-12-08T17:38:21Z

time-consuming is negligible .. but as a consequence, however, we would have to let all User-Agents unknown to us pass ..

I can assure you that in my experience from managing xcancel.com with many many bots hitting my instance every day. 98% of bots disguise themselves as Chrome or Firefox and nothing else.

So if you already start to check for the version of Firefox or Chrome, then you will already have a very good coverage without spending too much time.

If there are really doubts about the usefulness, then it would certainly be easiest to leave the decision to the maintainers ... in other words, we make this feature optional.

For sure, I think that's the best outcome. Let the maintainer decide if he wants to block more bots at the cost of blocking legitimate users on old browsers.

return42 · 2025-01-28T06:33:41Z

FYI: last force push was just a rebase on master / BTW I run this patch on my instance since 2 months without any issues (and good blocking experience).

unixfox · 2025-01-28T07:46:53Z

Hello my comments still apply. Same goes for the recommendation for the community.

unixfox · 2025-04-30T09:16:26Z

@return42 please review my PR: #4696, especially the commit deaf06b (#4696)

I added a specific check to check for the browser version before checking for the Sec-Fetch headers.

This is inline with what I said in #3965 (comment) and I'm fine with blocking by default the requests like this.

HTTP Fetch Metadata Request Headers [1][2] are used to detect bot requests. Bots with invalid *Fetch Metadata* will be redirected to the intro (`index`) page. [1] https://www.w3.org/TR/fetch-metadata/ [2] https://developer.mozilla.org/en-US/docs/Glossary/Fetch_metadata_request_header Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>

return42 · 2025-05-04T00:09:34Z

I added a specific check to check for the browser version before checking for the Sec-Fetch headers.

PR is now merged .. go ahead with PR: #4696

unixfox · 2025-05-04T07:53:28Z

I don't understand why you merged this one instead of #4696 because #4696 had everything for the new improvement including the check for the supported browsers. PR #4696 was a successor to this PR...

Now there will be double the PR for the same improvement but ok 🙄

return42 · 2025-05-04T08:15:03Z

I don't understand why you merged this one instead of #4696

First commit of #4696 was cherry-picked from here (fe08bb1) and I can't merge what I don't approve of .. the decision for #4696 is now entirely yours.

unixfox · 2025-05-04T08:17:36Z

I would have merged my own PR and close this one 🙄. I think I'm still maintaining this project, even if it's from a distant eye, so you aren't the only one that can merge things on this project.

return42 mentioned this pull request Oct 28, 2024

[refactor] engine: duckduckgo - https://html.duckduckgo.com/html #3955

Merged

unixfox self-requested a review October 28, 2024 11:53

Bnyro self-requested a review October 28, 2024 11:58

Bnyro reviewed Oct 28, 2024

View reviewed changes

return42 force-pushed the Sec-Fetch-Mode branch 2 times, most recently from 3634757 to 51c6370 Compare January 28, 2025 06:32

unixfox mentioned this pull request Feb 9, 2025

[fix] botdetection: return error, do not fail silently #4299

Closed

return42 force-pushed the Sec-Fetch-Mode branch from 51c6370 to 5417abf Compare April 17, 2025 15:22

unixfox mentioned this pull request Apr 30, 2025

[fix] botdetection: check if the browser supports Sec-Fetch headers #4696

Merged

return42 force-pushed the Sec-Fetch-Mode branch from 5417abf to 0e63013 Compare May 4, 2025 00:00

return42 merged commit fe08bb1 into searxng:master May 4, 2025
9 checks passed

return42 deleted the Sec-Fetch-Mode branch May 4, 2025 00:07

[mod] botdetection: HTTP Fetch Metadata Request Headers #3965

[mod] botdetection: HTTP Fetch Metadata Request Headers #3965

Uh oh!

Conversation

return42 commented Oct 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

unixfox commented Oct 28, 2024

Uh oh!

return42 commented Oct 28, 2024

Uh oh!

unixfox commented Oct 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Bnyro left a comment

Choose a reason for hiding this comment

Uh oh!

unixfox commented Oct 28, 2024

Uh oh!

return42 commented Oct 28, 2024

Uh oh!

Bnyro commented Oct 28, 2024

Uh oh!

Bnyro commented Oct 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

return42 commented Oct 28, 2024

Uh oh!

Bnyro commented Oct 28, 2024

Uh oh!

unixfox commented Oct 28, 2024

Uh oh!

return42 commented Oct 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

unixfox commented Oct 29, 2024

Uh oh!

return42 commented Oct 29, 2024

Uh oh!

unixfox commented Oct 29, 2024

Uh oh!

unixfox commented Dec 8, 2024

Uh oh!

return42 commented Dec 8, 2024

Uh oh!

unixfox commented Dec 8, 2024

Uh oh!

return42 commented Jan 28, 2025

Uh oh!

unixfox commented Jan 28, 2025

Uh oh!

unixfox commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

return42 commented May 4, 2025

Uh oh!

unixfox commented May 4, 2025

Uh oh!

return42 commented May 4, 2025

Uh oh!

unixfox commented May 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

return42 commented Oct 27, 2024 •

edited

Loading

unixfox commented Oct 28, 2024 •

edited

Loading

Bnyro commented Oct 28, 2024 •

edited

Loading

return42 commented Oct 29, 2024 •

edited

Loading

unixfox commented Apr 30, 2025 •

edited

Loading

unixfox commented May 4, 2025 •

edited

Loading