这是indexloc提供的服务,不要输入任何密码
Skip to content

Conversation

@sjrl
Copy link
Contributor

@sjrl sjrl commented Jul 29, 2022

Related Issue(s): Issue #2920

Proposed changes:

  • Update minimum requirements in setup.cfg
  • Change how the error is raised when running the crawler in google colab so the trace is more obvious

Pre-flight checklist

@sjrl sjrl requested a review from ZanSara July 29, 2022 08:55
@sjrl
Copy link
Contributor Author

sjrl commented Jul 29, 2022

@ZanSara I did notice that with these changes that the following install on Google Colab

!pip install --upgrade pip
!pip install git+https://github.com/deepset-ai/haystack.git@issue_crawler_colab#egg=farm-haystack[colab,crawler]

looks to hang perpetually in trying to resolve dependencies (this was the original issue brought up in discussion #2919). I'm still trying to pin down what is causing the problem. I haven't been able to reproduce this locally on my machine.

And without the changes in this PR the install is successful but the selenium version becomes 3.141.0 which is incompatible with the Crawler in Haystack

@danielbichuetti
Copy link
Contributor

danielbichuetti commented Jul 29, 2022

@sjrl and @ZanSara I'm sorry to get into this PR.

What happens here is that when you get the colab option, you are sticking grpcio version. This makes selenium unable to get installed because it depends on newer version.

This chat on Slack mentioned Colab issues with Selenium issues: https://haystack-community.slack.com/archives/C01J3TZM9HT/p1656228551083839

@sjrl
Copy link
Contributor Author

sjrl commented Jul 29, 2022

What happens here is that when you get the colab option, you are sticking grpcio version. This makes selenium unable to get installed because it depends of newer version.

I think it must be a secondary dependency since I can't find it in their requirements file: https://github.com/SeleniumHQ/selenium/blob/trunk/py/requirements.txt @ZanSara Do you know if we could potentially bump the grpcio version in farm-haystack[colab]?

@sjrl
Copy link
Contributor Author

sjrl commented Jul 29, 2022

This chat on Slack mentioned Colab issues with Selenium issues: https://haystack-community.slack.com/archives/C01J3TZM9HT/p1656228551083839

Right, this issue was caused by the Selenium version 3 as you identified hence the attempted fix with the minimum version requirement. But it looks like we might also need to bump the grpcio version, is that right @danielbichuetti ?

@danielbichuetti
Copy link
Contributor

danielbichuetti commented Jul 29, 2022

Yes, the main issue is that is not a direct dependency issue. It's an indirect. When grpcio is set fixed, other dependencies that Selenium relies on, also get limited by the grpcio.

thinc 8.1.0 requires typing-extensions<4.2.0,>=3.7.4.1; python_version < "3.8", but you have typing-extensions 4.3.0 which is incompatible.
spacy 3.4.1 requires typing-extensions<4.2.0,>=3.7.4; python_version < "3.8", but you have typing-extensions 4.3.0 which is incompatible.
requests 2.23.0 requires idna<3,>=2.5, but you have idna 3.3 which is incompatible.
requests 2.23.0 requires urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1, but you have urllib3 1.26.11 which is incompatible.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.

When using Colab I forced the installation of 4.3.0. Apparently no behavior was affected, but need further investigation, I suppose. @sjrl

@ZanSara
Copy link
Contributor

ZanSara commented Jul 29, 2022

Hey @sjrl and @danielbichuetti! The [colab] option was introduced due to some special features of Colab's environment. The original issue was that grpcio would be pre-installed, and if anyone tried to install a different version, a runtime restart was required.

To avoid the runtime restart, we pinned grpcio in the [colab] extra to match Colab's version. So if Colab is now giving issues because of that pin, please update it!

@danielbichuetti
Copy link
Contributor

danielbichuetti commented Jul 29, 2022

@ZanSara @sjrl The dependency issue is this:

selenium 4.2.0 requires urllib3[secure,socks]~=1.26, but you have urllib3 1.25.11 which is incompatible.
And then:
requests 2.23.0 requires urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1, but you have urllib3 1.26.11 which is incompatible.
And then:
google-colab 1.0.0 requires requests~=2.23.0

It's not the grpcio indeed, indeed the google-colab is. While investigating this, I found grpcio can be forced to 1.48. No Colab issues.

datascience package can also be forced to 0.17.5 safely on Colab.

On Colab repo we have this issue: googlecolab/colabtools#2604 where they discuss it. @craigcitro advises they are only warnings and not errors.

Maybe pin requests to a newer version. What do you think?

@sjrl
Copy link
Contributor Author

sjrl commented Jul 29, 2022

To avoid the runtime restart, we pinned grpcio in the [colab] extra to match Colab's version. So if Colab is now giving issues because of that pin, please update it!

The current version of grpcio in google colab is 1.47.0. So it sounds like it should be okay to pin it to 1.47.0. What do you think @danielbichuetti? And thanks for investigating for any potential issues.

@danielbichuetti
Copy link
Contributor

danielbichuetti commented Jul 29, 2022

@sjrl The differences from grpcio 1.48 to 1.47 regarding gRPC on Python are small, but two bug fixes. I think updating it to 1.48.0 doesn't need a restart. At least, I did an upgrade, and it worked well without it.

I just found one issue on Google API Python client repo where they discuss a similar case. The lack of updates on the google-colab makes this kind of "issues" happen as stated there. So, my point of view is that requests could be forced to a newer version. I just built an updated environment and nothing gets broken. Crawler, fecthing urls, and so on.

colab =
    grpcio==1.48.0
    requests>=2.26.0

@sjrl
Copy link
Contributor Author

sjrl commented Jul 29, 2022

@danielbichuetti I'm unsure about forcing the requests to a new version if the original concern was that we do not want to require the user to restart the runtime after installing farm-haystack. What do you think @ZanSara?

And as @danielbichuetti points out the dependency issue is requests and not grpcio so updating the pin of grpcio to 1.47.0 doesn't resolve the hanging install.

@danielbichuetti
Copy link
Contributor

danielbichuetti commented Jul 29, 2022

@sjrl Installing it manually !pip install --no-deps requests>=2.26.0 didn't require a restart. I'm unsure about how setuptools will handle this.

Another solution that would be possible:

  1. On Colab force Selenium to be lower than 4.0.0
  2. Change code to use old style parameters on Selenium, it's not yet deprecated on new versions, just tagged for it.

But, it is much complex then pinning requests.

@ZanSara
Copy link
Contributor

ZanSara commented Jul 29, 2022

I'm not sure I get what's the issue of pinning requests in [colab]. Does that solve the issue? Does it require a restart if added to setp.cfg? If not, that's the way to go imho

@sjrl
Copy link
Contributor Author

sjrl commented Jul 29, 2022

I'm not sure I get what's the issue of pinning requests in [colab]. Does that solve the issue? Does it require a restart if added to setp.cfg? If not, that's the way to go IMHO

The base problem now is that selenium>=4.0.0 requires urllib3~=1.26 but google-colab requires requests~=2.23.0 which requires urllib3<1.26 as I believe @danielbichuetti pointed out. So pinning requests in setup.cfg to a higher version like 2.26 throws an error when trying to pip install farm-haystack[colab].

@danielbichuetti
Copy link
Contributor

danielbichuetti commented Jul 29, 2022

base problem now is that selenium>=4.0.0 requires urllib3~=1.26 but google-colab requires requests~=2.23.0 which requires urllib3<1.26 as I believe @danielbichuetti pointed out. So pinning requests in setup.cfg to a higher version like 2.26 throws an error when trying to pip install farm-haystack[colab].

Indeed, setuptools can't handle this dependencies issue. At least I didn't find a way. Manually, it will work because of the no-deps flag.

@danielbichuetti
Copy link
Contributor

danielbichuetti commented Jul 29, 2022

@sjrl After further investigation, I found one Selenium version which install without dependencies issues: 4.1.3

So, this should resolve the issue

crawler =
    selenium >=4.0.0,<=4.1.3, !=4.1.4 
    webdriver-manager

This remembered me why I "love" so much complex dependencies chains.
🥲

@sjrl
Copy link
Contributor Author

sjrl commented Jul 29, 2022

@danielbichuetti Thanks for looking into this. If version 4.1.3 is the only version that works I would suggest we pin that version exactly. What do you think @ZanSara ?

This remembered me why I "love" so much complex dependencies chains.

And yes this also reminds why I find google-colab to be frustratingly out of date when it comes to staying up to date on packages.

@danielbichuetti
Copy link
Contributor

danielbichuetti commented Jul 29, 2022

@sjrl 😃

If version 4.1.3 is the only version that works I would suggest we pin that version exactly.

It's not the only version, it's the most updated one. The dependency on the new urllib3 was introduced on 4.1.4. It was a bad explanation, but the code is the correct scenario. So >=4.0.0 for the new parameter and <=4.1.3 because of the change in deps.

@sjrl
Copy link
Contributor Author

sjrl commented Jul 29, 2022

It's not the only version, it's the most updated one. The dependency on the new urllib3 was introduced on 4.1.4. It was a bad explanation, but the code is the correct scenario. So >=4.0.0 for the new parameter and <=4.1.3 because of the change in deps.

It looks like in version 4.0.0 that it requires urllib3~=1.26 https://github.com/SeleniumHQ/selenium/blob/3a2181467904af9043828cede13e5dc866c1af12/py/requirements.txt#L23

@danielbichuetti
Copy link
Contributor

danielbichuetti commented Jul 29, 2022

@sjrl Indeed, I did a huge mistake when pasting versions. I was collecting some version issues that were possible with haystack. On 4.1.4 it was introduced one incompatibility with 3.7, but not on urllib3. The urllib3 pin to 1.26 is not present in Selenium 4.0.0-alpha-7

@danielbichuetti
Copy link
Contributor

@sjrl @ZanSara After that huge mistake, I segmented environments and started a direct test on Colab.

Currently, I could pin and make it work with Colab. Despise being an outdated version of Selenium, it's far better than v3 and won't throw errors:

The dependency pinned code is:

colab =
    grpcio==1.47.0
    selenium == 4.0.0.a7

Here is the working notebook.

I really apologize for that version issue.

@ZanSara
Copy link
Contributor

ZanSara commented Jul 29, 2022

Thanks to you for all the work!! 🙌 Let's apply these two pins for now, I think it's the best idea. @sjrl What do you think? It's a pin to an alpha version, but if it's the only one that works I think we can't do any better. This might be temporary anyway. I'll let you approve this PR 👍

@craigcitro
Copy link

Thanks for tagging me in!

Just catching up, but to ask: could we make anyone's life easier by loosening the requests version specifier in the google-colab package?

@danielbichuetti
Copy link
Contributor

danielbichuetti commented Jul 29, 2022

Hey @craigcitro! That's outstanding you joined us. 🎉

If you could do it, it would be great! Newer versions of urllib3 aren't compatible with 2.23.0. Many packages rely on more updated requests version, due to many bug fixes and improvements. I think even one Google API client had issues with this.

Indeed, this PR is mainly because of requests 2.23.0 requirement. Which forces Selenium back to version 3.0.

Despise being possible to update manually, setuptools see issues since it's a required dependency from google-colab. Would it be possible? The whole community would really appreciate that. 🥺

@craigcitro
Copy link

Done! I'm making the fix on our side, though it'll take a few days to go live in Colab.

In general, feel free to ask on https://github.com/googlecolab/colabtools/issues when you hit trouble! Always happy to help our users when we can. 😄

@danielbichuetti
Copy link
Contributor

danielbichuetti commented Jul 29, 2022

@craigcitro Thank you very much! I have no words for this 🥳

@sjrl
Copy link
Contributor Author

sjrl commented Aug 1, 2022

Just tried it out this morning and it does not look like google-colab has updated yet with the relaxed requirements for requests. I think we should wait until the changes take affect in colab before merging this PR, so I'll try periodically throughout the coming days.

@craigcitro
Copy link

Ah, I should have clarified that I meant business days -- our releases only happen on days we're there in case something goes sideways.

Change is live now.

@sjrl sjrl merged commit bde3261 into master Aug 3, 2022
@sjrl sjrl deleted the issue_crawler_colab branch August 3, 2022 08:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants