Add YouTube Transcript Pulling to `scrapeGenericUrl` & Improve Web Scraper Tool Introspection #4537

angelplusultra · 2025-10-13T21:24:58Z

Pull Request Type

Relevant Issues

resolves #4508

What is in this change?

This PR introduces two main changes:

YouTube Transcript Support — Adds the ability to pull YouTube video transcripts using the scrapeGenericUrl function.
Improved Introspection Logging — Refactors the scrape function in the web scraper agent tool to provide more specific introspection logs, detailing exactly what the scraper is doing for a given resource.

What These Changes Enable

YouTube Transcript Support

With these updates, chatting with an LLM using @agent can now automatically fetch a YouTube transcript when given a video URL via the web_scraper tool.

Example usage:

@agent Please summarize this video https://www.youtube.com/watch?v=B_H1DxOI6Xs

Additionally, users can now pass a YouTube video URL directly into the URL input field within the RAG document modal to create a document from that video, effectively bypassing the need for the dedicated YouTube data connector.

Improved Introspection Logging

When @agent calls the web_scraper tool and passes in a URL. The tool first verifies what kind of resource it is by analyzing the URL itself and making a HEAD call to retrieve its Content-Type header. Based on this information the introspection logs will inform the user whether the tools will begin to

Pull the transcript and metadata for the YouTube video (If the user provides a YouTube video URL)
Read the content of the file (If the user provides a URL that responds with a non HTML content type )
Scrape the content of the web page (If the user provides a URL that responds with HTML)

Additional Information

Developer Validations

I ran yarn lint from the root of the repo & committed changes
Relevant documentation has been updated
I have tested my code functionality
Docker build succeeds locally

- Introduced functionality to handle YouTube URLs by validating them and fetching video transcripts. - Updated the `processVia` logic to include a new option for processing YouTube video transcripts. - Enhanced the scraping function to format and return transcript content as a document if required. - Added a utility function to validate YouTube URLs.

…llector

…ed references in the generic URL scraper.

…, ensuring all references are consistent in the URL validation tests.

angelplusultra added 5 commits October 10, 2025 11:41

Refactor agent introspection logs and

1be21f4

Add timeout to head request in scrape func

8823bc1

remove log

858faa5

Add more robust youtube url validation logic to both scrape fn and co…

c17121d

…llector

angelplusultra linked an issue Oct 13, 2025 that may be closed by this pull request

[FEAT]: @agent YouTube Transcript Analysis #4508

Closed

angelplusultra requested a review from timothycarambat October 13, 2025 21:26

angelplusultra assigned timothycarambat Oct 13, 2025

angelplusultra added 4 commits October 14, 2025 09:15

Fix bug in path matching

75ec5a6

add tests for isYouTubeUrl

09df8e4

Rename isYouTubeUrl to isYouTubeVideoUrl for clarity and update relat…

5bcecb4

…ed references in the generic URL scraper.

Update tests to reflect renaming of isYouTubeUrl to isYouTubeVideoUrl…

dfa79f1

…, ensuring all references are consistent in the URL validation tests.

angelplusultra added the PR:needs review Needs review by core team label Oct 15, 2025

timothycarambat mentioned this pull request Oct 15, 2025

Add ability to auto-handle YT video URLs in uploader & chat #4547

Merged

10 tasks

timothycarambat closed this in #4547 Oct 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add YouTube Transcript Pulling to `scrapeGenericUrl` & Improve Web Scraper Tool Introspection #4537

Add YouTube Transcript Pulling to `scrapeGenericUrl` & Improve Web Scraper Tool Introspection #4537

Uh oh!

angelplusultra commented Oct 13, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Add YouTube Transcript Pulling to scrapeGenericUrl & Improve Web Scraper Tool Introspection #4537

Add YouTube Transcript Pulling to scrapeGenericUrl & Improve Web Scraper Tool Introspection #4537

Uh oh!

Conversation

angelplusultra commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Type

Relevant Issues

What is in this change?

What These Changes Enable

YouTube Transcript Support

Improved Introspection Logging

Additional Information

Developer Validations

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add YouTube Transcript Pulling to `scrapeGenericUrl` & Improve Web Scraper Tool Introspection #4537

Add YouTube Transcript Pulling to `scrapeGenericUrl` & Improve Web Scraper Tool Introspection #4537

angelplusultra commented Oct 13, 2025 •

edited

Loading