-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
Adding url uploads to document picker #375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding url uploads to document picker #375
Conversation
collector/scripts/link.pyIn the def process_single_link(url):
if not url:
return False, "Invalid URL!", None
session = HTMLSession()
req = session.get(url)
if not req.ok:
return False, "Could not reach this URL.", None
req.html.render()
with tempfile.NamedTemporaryFile(mode = "w") as tmp:
tmp.write(req.html.html)
tmp.seek(0)
loader = UnstructuredHTMLLoader(tmp.name)
data = loader.load()[0]
full_text = data.page_content
if not full_text:
return False, "Could not parse any meaningful data from this URL.", None
link_meta = append_meta(req, full_text, True)
token_count = len(tokenize(full_text))
link_meta['pageContent'] = full_text
link_meta['token_count_estimate'] = token_count
return True, None, link_metaIn the def process_single_link(url):
if not url:
return False, "Invalid URL!", None
session = HTMLSession()
req = session.get(url)
if not req.ok:
return False, "Could not reach this URL.", None
req.html.render()
html_content = req.html.html
with tempfile.NamedTemporaryFile(mode = "w") as tmp:
tmp.write(html_content)
tmp.seek(0)
loader = UnstructuredHTMLLoader(tmp.name)
data = loader.load()[0]
full_text = data.page_content
if not full_text:
return False, "Could not parse any meaningful data from this URL.", None
link_meta = append_meta(req, full_text, True)
token_count = len(tokenize(full_text))
link_meta['pageContent'] = full_text
link_meta['token_count_estimate'] = token_count
return True, None, link_meta |
no autocomplete on URLs
* WIP adding url uploads to document picker * fix manual script for uploading url to custom-documents * fix metadata for url scraping * wip url parsing * update how async link scraping works * docker-compose defaults added no autocomplete on URLs --------- Co-authored-by: timothycarambat <rambat1010@gmail.com>
resolves #363
Adds an input for users to put links to scrape and insert them into their documents