这是indexloc提供的服务,不要输入任何密码
Skip to content

Conversation

@jazzzooo
Copy link
Contributor

What does this PR do?

  • Updates the User-Agent to a correct one, see here. This was getting us detected as a scraper by Bing.
  • Now we get actual Bing results, not the single-page results that Bing sends to suspected bots and scrapers.
  • Pagination is fixed.
  • Filtering by date works.
  • SafeSearch is removed, as it is handled by the Bing backend, no cookies or parameters can change it.
  • Region/language select works somewhat.

Not fixed:

  • The '(Web)' icon getting included in the content of the result. I tried fixing it but the content kept getting deleted sometimes and I didn't figure out why. Maybe someone better at xpaths can fix it.
  • The regions and languages. I will leave this for a future PR. It's a mess and we don't always send the right region and language. To test this in the future, try all different languages for Switzerland. Right now some don't give any results and we send en-US for some.

Why is this change important?

All my changes to the cookie logic were intentional. It is the "least" broken solution for now without redoing the region/language traits logic, I'll save that for another PR. The new User-Agent should be an improvement across the board, but it is still good to check that nothing broke. The previous User-Agent was not one that appeared in the wild, maybe now some other engines will work better too? Also, let's not use uuid1 in the future, uuid4 is faster and doesn't leak the mac address of the server.

How to test this PR locally?

!bing Coca-Cola (the most international term)
!bin Coca-Cola
!biv Coca-Cola

  • Make sure most regions give regional results for these^.
  • Test that page 2 works.
  • Test that filtering by date works.
  • Test that other engines don't break from the new User-Agent.

Related issues

Closes #2698
Could fix #2388 but I cannot reproduce, it might be related to #2641 , see my comment on that. We will know it's fixed if instances stop reporting it after this goes live tho.

@jazzzooo jazzzooo changed the title [fix] correct useragent [fix] engine - bing fix search, pagination, remove safesearch Sep 20, 2023
@Bnyro
Copy link
Member

Bnyro commented Sep 20, 2023

Awesome job, thank you for looking into this one!

I can confirm that this fixes pagination and time range search 🎉

I'll look deeper into the changes soon when I have time for it.

@return42
Copy link
Member

@jazzzooo:

The regions and languages. I will leave this for a future PR. It's a mess and we don't always send the right region and language. To test this in the future, try all different languages for Switzerland. Right now some don't give any results and we send en-US for some.

Not sure you know; We fetch the languages and regions ..

"bing": {
"all_locale": null,
"custom": {},
"data_type": "traits_v1",
"languages": {
"ar": "ar",
"bg": "bg",
"bn": "bn",
"ca": "ca",
"cs": "cs",
"da": "da",
"de": "de",
"en": "en",
"es": "es",
"et": "et",
"eu": "eu",
"fi": "fi",
"fr": "fr",
"gl": "gl",
"gu": "gu",
"he": "he",
"hi": "hi",
"hr": "hr",
"hu": "hu",
"is": "is",
"it": "it",
"ja": "jp",
"kn": "kn",
"ko": "ko",
"lt": "lt",
"lv": "lv",
"ml": "ml",
"mr": "mr",
"ms": "ms",
"nb": "nb",
"nl": "nl",
"pa": "pa",
"pl": "pl",
"pt": "pt-pt",
"ro": "ro",
"ru": "ru",
"sk": "sk",
"sl": "sl",
"sr": "sr",
"sv": "sv",
"ta": "ta",
"te": "te",
"th": "th",
"tr": "tr",
"uk": "uk",
"vi": "vi",
"zh": "zh-hans",
"zh_Hans": "zh-hans",
"zh_Hant": "zh-hant"
},
"regions": {
"da-DK": "da-DK",
"de-AT": "de-AT",
"de-CH": "de-CH",
"de-DE": "de-DE",
"en-AU": "en-AU",
"en-CA": "en-CA",
"en-GB": "en-GB",
"en-IN": "en-IN",
"en-MY": "en-MY",
"en-NZ": "en-NZ",
"en-PH": "en-PH",
"en-US": "en-US",
"en-ZA": "en-ZA",
"es-AR": "es-AR",
"es-CL": "es-CL",
"es-ES": "es-ES",
"es-MX": "es-MX",
"es-US": "es-US",
"fi-FI": "fi-FI",
"fr-BE": "fr-BE",
"fr-CA": "fr-CA",
"fr-CH": "fr-CH",
"fr-FR": "fr-FR",
"id-ID": "en-ID",
"it-IT": "it-IT",
"ja-JP": "ja-JP",
"ko-KR": "ko-KR",
"nb-NO": "no-NO",
"nl-BE": "nl-BE",
"nl-NL": "nl-NL",
"pl-PL": "pl-PL",
"pt-BR": "pt-BR",
"ru-RU": "ru-RU",
"sv-SE": "sv-SE",
"tr-TR": "tr-TR",
"zh-CN": "zh-CN",
"zh-HK": "zh-HK",
"zh-TW": "zh-TW"
}
},

From the API description: https://learn.microsoft.com/en-us/bing/search-apis/bing-web-search/reference/market-codes

But note, bing has different market-codes for WEB. Video, Images .. News

There we describe the URLs / for instance under the names searx.engines.bing.bing_traits_url (WEB), searx.engines.bing_images.bing_traits_url (Images) .. and so force.

About Switzerland: If you look there Bing WEB has only fr-CH and fr-DE

I wrote this just to let you know .. no need to change anything in your PR .. in my review I will have a look about languages / regions .. seems bing has changed a lot of things since I have implemented the engine.


One question I have about your initial comment:

The '(Web)' icon getting included in the content of the result. I tried fixing it but the content kept getting deleted sometimes and I didn't figure out why. Maybe someone better at xpaths can fix it.

What '(Web)' icon do you mean .. I haven't seen anything in the content I don't expected.

@jazzzooo
Copy link
Contributor Author

jazzzooo commented Sep 20, 2023

@return42 yep I saw we fetch those. If you noticed I changed the default for the language to en-us tho because that's what the cookie was set to. The different cookies might be different in how locale is encoded, one cookie is enough tho it seems.

For Switzerland I'm sure you meant fr-CH and de-CH, but yes those work now, I could've sworn french wasn't working during testing... No worries, not the focus on this PR anyways. (Also on Bing you can set region to "Switzerland - German" there isn't even a "Switzerland - French" option in the ui... and then set the language to Italian, and the results are pretty good. But that might be too much work for what it's worth.)

Here are screenshots of the Web icons, I tested them with other countries and IP's and they still appear.
web2
web

@return42
Copy link
Member

@jazzzooo FYI: I just rebased your branch to get #2826 which has already been merged upstream .. I'm still in review with this PR ..

@return42
Copy link
Member

return42 commented Sep 27, 2023

Pew .. bing itself is somewhat broken. Here are the results when I search, with region Australia and language English:

image

.. thats because I tested different language & regions before .. to get this result:

image

I had to empty browser cache and delete the cookies from bing. But instead of a link to https://www.bmw-dubai.com I would expected for that query is a link to the branch office in Australia:

image

M$ is broken by design.

@return42
Copy link
Member

it turns out that the old market codes don't work anymore ... I had to rework the fetch_traits() and now read from https://www.bing.com/account/general

I didn't manage more than bing-WEB today, I will have a look at the other bing engines in the next days.

May I have to review fetch_traits() from the other engines also .. I remember that bing-News had special market codes .. nut sure if this has been changed in the meantime .. I will have look .. more coming soon 👍

@jazzzooo
Copy link
Contributor Author

@return42 what was your ip for the bmw austrilia search? I cannot reproduce even with Hong Kong ip:
aus

@jazzzooo
Copy link
Contributor Author

Also, I wasn't able to observe _EDGE_S change anything when _EDGE_CD was set. Did you observe any change?

@return42
Copy link
Member

return42 commented Sep 28, 2023

Also, I wasn't able to observe _EDGE_S change anything when _EDGE_CD was set. Did you observe any change?

Yes, the market code (mkt={engine_region}) is needed .. you won't notice it in a simple search, but if you do enough tests in different languages (across different pages, etc), you will eventually notice that the results are better. It's very hard to describe, because bing sometimes doesn't behave deterministically when you switch wildly between languages and regions (see above).

In your patched version the page breaks didn't work for me, I had to add the argument pq ... but also this error is only noticed by longer tests where you look at the pages closely and notice that they sometimes repeat (who expects that on page 4 the content of page one comes back) / see my comment in the code::

    # if arg 'pq' is missed, somtimes on page 4 we get results from page 1,
    # don't ask why it is only sometimes / its M$ and they have never been
    # deterministic ;)

Further in my tests in China, Japan and other I noticed that I never should send the first arg on page 1.

TL; DR;

all settings have somehow side effects on the other settings, it is a big jumble of options and nothing behaves predictable ... only sufficient test can give you at least some hope that it somehow works .. today .. and tomorrow everything changes --> as it always is with M$ products, I experience this every day at work 😢

@return42 return42 force-pushed the bing-fix branch 2 times, most recently from 829ad6d to a731f40 Compare September 28, 2023 17:52
@return42
Copy link
Member

@jazzzooo I have invested another day -- bing is so crude -- ... I think now we have a state where the issues are fixed and everything works as far as possible. We'll notice over time that there are still quirks in some languages and regions (especially at the bing-news) ... we'll have to sort them out in subsequent PRs.

If you could do another final test, just to make sure I haven't missed any major bugs, then we could merge this PR for now (I'd like to release the bug fixes, the fine tuning can be done afterwards).

@return42
Copy link
Member

@BernieHuang2008 can you have a look about the traditional and simplified Chinese in the bing engines we patched in this PR .. especially bing-web and bing-news are of interest in the regions (:zh-CN, :zh-TW, :zh-HK) where the Chinese language (:zh) is spoken.

Comment on lines +136 to +159
# In bing the market code 'zh-cn' exists, but there is no 'news' category in
# bing for this market. Alternatively we use the the market code from Honk
# Kong. Even if this is not correct, it is better than having no hits at
# all, or sending false queries to bing that could raise the suspicion of a
# bot.

_fetch_traits(engine_traits, bing_traits_url, xpath_language_codes, xpath_market_codes)
# HINT: 'en-hk' is the region code it does not indicate the language en!!
engine_traits.regions['zh-CN'] = 'en-hk'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BernieHuang2008 this will alias the bing-news for :zh and :zh-CN to :zh-HK .. not sure if it is a good or bead idea of mine / may you know a better alternative?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, what do you mean "no :zh-cn"?
image
if im correct, it's obvious that the very second result are different.

Copy link
Member

@return42 return42 Sep 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I SearXNG we have a Search Syntax / with this syntax you can for instance select a language/region .. as you can do it in the drop-down menu in the left upper:

image

When I write down a search term here, I use this syntax / for instance :zh-TW !bing bmw is a search term that will use the bing engine with language Chinese in region Taiwan to query the word "bmw".

Note: You cant use this syntax on bing.com as you have done above in your screenshot.

@BernieHuang2008
Copy link
Contributor

@BernieHuang2008 can you have a look about the traditional and simplified Chinese in the bing engines we patched in this PR .. especially bing-web and bing-news are of interest in the regions (:zh-CN, :zh-TW, :zh-HK) where the Chinese language (:zh) is spoken.

Sure, but it may take a while because I need to understand what you're doing first

@BernieHuang2008
Copy link
Contributor

I've found these bugs:

image

You can see that there are two quotation marks (“ ”) in the result title, and the content inside seems to disappeared.

Both in Traditional/Simplified Chinese, disappearing keywords are all over the page.

Before this PR (in public instances), it worked fine. In Bing News, it also shows these reslts in entirety.

@return42
Copy link
Member

return42 commented Sep 29, 2023

Sure, but it may take a while because I need to understand what you're doing first

I think its hard to understand (or explain) all the details of the bing engines .. most of what we have done here is a reverse engineering of the bing services from debugging what bing does when you use it in your WEB browser .. and bing is very complicated .. there is no clear API, nor is there by instance a clear list of market-code bing uses.

Bing has two settings about languages and regions:

  1. the language of the UI
  2. the market code --> the region in which the search is done

Ths is more or less similar for all bing engines: bing (WEB), bing_images, bing_videos and bing_news.

There is one thing to notice: bing_news does not exists for all the market codes (aka regions) .. for instance there is no bing-news category for the region China (PRC) .. I can't properly evaluate the search results of the Chinese speaking regions because I can't even tell if a result is Simplified or traditional Chinese :o

It would be enough for me, if you do a simple test as a normal user and give us feedback if the results are in the expected language (and script). And if they fit to the region you have selected as user.

By instance here is how I tested the regions:

  • :zh-CN BMW !bing --> here I would expect a link in the upper ranking to the BMW branch in China (PRC)
  • :zh-TW BMW !bing --> .. BMW branch in Taiwan
  • :zh-HK BMW !bing --> .. BMW branch Honk Kong

as you may notice from my test above, I can't really test for these regions due to lack of language skills and ignorance of regionally preferred results ... I can't do that for many other regions either, but in the Chinese language area there is also the fact that in the regions partly simplified, partly traditional script is preferred.

@return42
Copy link
Member

Oops, we posted at the same time .. :)

@BernieHuang2008
Copy link
Contributor

... it doesn't matter ...

the results satisfied me except the problem above.

@return42
Copy link
Member

I've found these bugs:

👍

You can see that there are two quotation marks (“ ”) in the result title, and the content inside seems to disappeared.
Both in Traditional/Simplified Chinese, disappearing keywords are all over the page.

How did you tested it / can you give my your search term ...

@BernieHuang2008
Copy link
Contributor

sure, i use

!bin 遥遥领先

@return42
Copy link
Member

!bin 遥遥领先

OK, I will have a look 👍


.. but this brings me to another question / and I'm sorry for asking dump questions: the search term seems simplified Chinese .. right? .. when I use this term, the language recognition (in SearXNG) switches to zh-HK ..

It is a littel off topic here in the thread where we test the bing engines, but what comes into my mind:

  • Is it intended for speech recognition to switch to zh-TW while the user may prefer zh-CN?
  • As far as I know, in TW the official script is traditional.. or am I wrong?

It also happens in other language areas that a language is not recognized correctly. You then have to explicitly select the desired language or region.. but what is it like here in the Chinese language area.. here the language recognition may also set the wrong region .

Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
@return42
Copy link
Member

I've found these bugs:
You can see that there are two quotation marks (“ ”) in the result title, and the content inside seems to disappeared.

This issue should now been fixed / I also added the:

@jazzzooo
Copy link
Contributor Author

jazzzooo commented Oct 1, 2023

comp
Looks to me like there has been a regression in the region search :/ I've attached a comparison of commit 820ab68 to the left and bc4c32b to the right

@return42
Copy link
Member

return42 commented Oct 1, 2023

Looks to me like there has been a regression in the region search :/

Not sure where you see a regression, when you go to bing.com and search in en-GB in their forms, you will get the results that are also shown in SearXNG in this branch (:en-GB). When you search in SearXNG and choose language :en you will get more/other results.

@return42 return42 merged commit 32a4ea3 into searxng:master Oct 1, 2023
@jazzzooo
Copy link
Contributor Author

jazzzooo commented Oct 1, 2023

Well from my testing we are not getting the region results. Are you saying the right side of my previous screenshot is appropriate? Or that you cannot reproduce. These tests were with a US IP-address. Here I did the same test on bing.com, and we clearly get regional results when the UK region is selected. The cookies in the UK screenshot are _EDGE_CD=m=en-gb, _EDGE_S=F=1&SID=...&mkt=en-gb, SRCHHPGUSR=SRCHLANG=en&...
bingcomp

@jazzzooo
Copy link
Contributor Author

jazzzooo commented Oct 2, 2023

@Bnyro if you have time maybe you could verify, as you're knowledgeable about bing... Try my test with a US IP if you have access to one. See if commit 820ab68 and bc4c32b differ in regional search. Maybe I did something wrong in my test :)

@return42
Copy link
Member

return42 commented Oct 3, 2023

Well from my testing we are not getting the region results. Are you saying the right side of my previous screenshot is appropriate?

Sorry / no, the left side of #2822 (comment) is what I would expect for regional search.

With current master branch / From an IP located in DE, when I search for ..

  • election :en-gb --> https://www.parliament.uk and other .uk
  • election :en-us --> https://www.usa.gov and other US related

.. links on top of the result list.

These tests were with a US IP-address.

Do I understand you right, from this US IP you don't get:

  • election :en-gb --> https://www.parliament.uk and other .uk

.. links on top of the result list?

@jazzzooo
Copy link
Contributor Author

jazzzooo commented Oct 3, 2023

@return42

Do I understand you right, from this US IP you don't get...

yep

I redid my testing again once with US IP and once with a German IP, and I restarted searx between every IP change. But it would be good if someone else can also test it.

820ab68:

US IP:

!bing election :en-us -> cnn, usa.gov
!bing election :en-gb -> bbc, parliament.uk

German IP:

!bing election :en-us -> cnn, usa.gov
!bing election :en-gb -> bbc, parliament.uk

bc4c32b

It seems a bit more nuanced than I thought, whatever locale you search with first, will stick until you restart the instance...

US IP:

!bing election :en-gb -> bbc, parliament.uk
!bing election :en-us -> bbc, parliament.uk

German IP:

!bing election :en-us -> cnn, usa.gov
!bing election :en-gb -> cnn, usa.gov

I don't observe any difference between US and German IP, so you should be able to reproduce this. I have no redis, I'm running with make run and default settings

@jazzzooo
Copy link
Contributor Author

@return42 ping, this issue still exists and breaks region search on all public instances. I checked, and my version still works :) If you'd like, I can open an issue on this, but the code in this pr is relevant.

@return42 return42 requested a review from unixfox October 20, 2023 05:26
@return42
Copy link
Member

@return42

Do I understand you right, from this US IP you don't get...

yep

I redid my testing again once with US IP and once with a German IP, and I restarted searx between every IP change. But it would be good if someone else can also test it.

Yes! .. I asked @unixfox and @dalf for a review.

To sum up the long thread .. we compare two patches bc4c32b..820ab68 both commits are in the branch of this PR.

The bing engine needed a review again and this PR was welcome to me, but the commit 820ab68 has in my opinion too much expansion (but may I am totally wrong / I looked to long on the bing engines to have a clear view anymore).

I then did a review of this PR and had to include a few things back into the implementation (albeit in a slightly different form) -> bc4c32b.

At the moment the focus in the discussion is on the languages/regions of the bing WEB engine ... but we must not lose sight of the other bing engines (news, videos images). I have tested bc4c32b in all possible languages/regions (please also test zh-..) and got good results (from my german IPs).

However, @jazzzooo and I end up with very different ratings. This can have many reasons, which do not necessarily lie in the implementation; bing does not behave deterministically! In my opinion, the bing client is already broken (see #2822 (comment)) ... the market codes and languages are used at several parameters and cookies --> and M$ seems to have lost the overview itself.

In my opinion, there is no "absolutely right" or "absolutely wrong" ... we will only come to a conclusion that suits us best from experience .. this also requires a thorough reverse engineering of the bing client.

Here now a comparison of the results of bing-WEB for UK and US ... but I also point out again that we must not neglect languages like zh and also the other bing engines have to be tested and we have to test the paging supportt --> an evaluation should cover all these scenarios.

bc4c32b

It seems a bit more nuanced than I thought, whatever locale you search with first, will stick until you restart the instance...

Yeah, thats one big problem in testing (and running) my patch ..

May its the best we remove my patch from this PR

but I would like to leave the decision to others ... @dalf and @unifox have asked for more reviewers on PRs ..

US IP:

!bing election :en-gb -> bbc, parliament.uk
!bing election :en-us -> bbc, parliament.uk

German IP:

!bing election :en-us -> cnn, usa.gov
!bing election :en-gb -> cnn, usa.gov

I don't observe any difference between US and German IP, so you should be able to reproduce this. I have no redis, I'm running with make run and default settings

Not sure what you mean by "I don't observe any difference between US and German IP" .. ? .. your examples show that you got only UK-results on your US IP and only US-results on your german-IP

@return42 return42 requested a review from dalf October 20, 2023 08:34
@return42
Copy link
Member

@return42 ping, this issue still exists and breaks region search on all public instances. I checked, and my version still works :) If you'd like, I can open an issue on this, but the code in this pr is relevant.

May you like to open a PR that reverts my bc4c32b

@return42
Copy link
Member

@return42 ping, this issue still exists and breaks region search on all public instances. I checked, and my version still works :) If you'd like, I can open an issue on this, but the code in this pr is relevant.

@jazzzooo now I understand why we have such different experience :) .. read & lets continue in:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bing engine does no longer support time-range safe-search and pagination Bug: Bing Engine

6 participants