这是indexloc提供的服务,不要输入任何密码
Skip to content

[bug] URL crawler ERROR - hostname scope restriction on link follows #136

@bibi-b

Description

@bibi-b

Hi there, I am running anything-llm in docker and have issues with the URL crawler on several websites.

e.g. I want to scan http://get-nord.de and it gave me following errors. The
Working on http://get-nord.dehttps://www.linkedin.com/showcase/get-nord/?originalSubdomain=de...
looks wrong to me.

...

Working on http://get-nord.de...
Working on http://get-nord.de/besuchen/newsletter...
Working on http://get-nord.dehttps://www.linkedin.com/showcase/get-nord/?originalSubdomain=de...
Traceback (most recent call last):
File "/app/collector/v-env/lib/python3.10/site-packages/urllib3/connection.py", line 174, in _new_conn
conn = connection.create_connection(
File "/app/collector/v-env/lib/python3.10/site-packages/urllib3/util/connection.py", line 72, in create_connection
for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
File "/usr/lib/python3.10/socket.py", line 955, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/app/collector/v-env/lib/python3.10/site-packages/urllib3/connectionpool.py", line 714, in urlopen
httplib_response = self._make_request(
File "/app/collector/v-env/lib/python3.10/site-packages/urllib3/connectionpool.py", line 415, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/app/collector/v-env/lib/python3.10/site-packages/urllib3/connection.py", line 244, in request
super(HTTPConnection, self).request(method, url, body=body, headers=headers)
File "/usr/lib/python3.10/http/client.py", line 1282, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/lib/python3.10/http/client.py", line 1328, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/lib/python3.10/http/client.py", line 1277, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/lib/python3.10/http/client.py", line 1037, in _send_output
self.send(msg)
File "/usr/lib/python3.10/http/client.py", line 975, in send
self.connect()
File "/app/collector/v-env/lib/python3.10/site-packages/urllib3/connection.py", line 205, in connect
conn = self._new_conn()
File "/app/collector/v-env/lib/python3.10/site-packages/urllib3/connection.py", line 186, in _new_conn
raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f59ad809810>: Failed to establish a new connection: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/app/collector/v-env/lib/python3.10/site-packages/requests/adapters.py", line 486, in send
resp = conn.urlopen(
File "/app/collector/v-env/lib/python3.10/site-packages/urllib3/connectionpool.py", line 798, in urlopen
retries = retries.increment(
File "/app/collector/v-env/lib/python3.10/site-packages/urllib3/util/retry.py", line 592, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='get-nord.dehttps', port=80): Max retries exceeded with url: //www.linkedin.com/showcase/get-nord/?originalSubdomain=de (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f59ad809810>: Failed to establish a new connection: [Errno -2] Name or service not known'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/app/collector/main.py", line 80, in
main()
File "/app/collector/main.py", line 56, in main
crawler()
File "/app/collector/scripts/link.py", line 92, in crawler
parse_links(links)
File "/app/collector/scripts/link.py", line 123, in parse_links
req = session.get(link, timeout=20)
File "/app/collector/v-env/lib/python3.10/site-packages/requests/sessions.py", line 602, in get
return self.request("GET", url, **kwargs)
File "/app/collector/v-env/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
File "/app/collector/v-env/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
File "/app/collector/v-env/lib/python3.10/site-packages/requests/adapters.py", line 519, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='get-nord.dehttps', port=80): Max retries exceeded with url: //www.linkedin.com/showcase/get-nord/?originalSubdomain=de (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f59ad809810>: Failed to establish a new connection: [Errno -2] Name or service not known'))

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions