这是indexloc提供的服务,不要输入任何密码
Skip to content

Ray Video Reader Enhancement #848

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: ray-api
Choose a base branch
from
Open

Conversation

suiyoubi
Copy link
Contributor

@suiyoubi suiyoubi commented Jul 23, 2025

Description

This PR addresses the following issues:

  • VideoDownloadStage → VideoReaderStage
  • VideoReaderStage -> Removed and use existing FilePartitioningStage
  • VideoDownloadReaderStage -> VideoLoadingStage
  • remove video from tasks/__init__.py
  • remove SplitPipeTask
  • Add a limit attribute in FilePartitioningStage
  • Add Unit Test for FilePartitioningStage

Usage

# Add snippet demonstrating usage

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

suiyoubi added 9 commits July 23, 2025 07:21
…age and update VideoReaderDownloadStage to use the new class. Adjust tests accordingly to reflect the changes in stage names and functionality.

Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
…ing convention

- Changed the internal name of VideoListStage from "video_reader" to "video_list".
- Updated assertions in the test for VideoListStage to match the new name.
- Adjusted configuration in the VideoReaderDownloadStage to use "video_list" instead of "video_reader".

This ensures consistency across the codebase following the recent refactor.

Signed-off-by: Ao Tang <aot@nvidia.com>
…" instead of "video_reader"

Signed-off-by: Ao Tang <aot@nvidia.com>
…eoReaderStage in VideoReaderDownloadStage. Update related tests to reflect the new structure and ensure consistency across the codebase.

Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
…posite stage that combines VideoListStage and VideoReaderStage.

Signed-off-by: Ao Tang <aot@nvidia.com>
…video_loading, video_reader, and related test files to use the new video module structure.

Signed-off-by: Ao Tang <aot@nvidia.com>
Copy link

copy-pr-bot bot commented Jul 23, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Ao Tang <aot@nvidia.com>
@suiyoubi
Copy link
Contributor Author

/ok to test 7aa2ca4

…ng files into groups based on specified criteria, including a limit on the number of groups. Update VideoLoadingStage to utilize FilePartitioningStage instead of the deprecated VideoListStage. Refactor VideoReaderStage to accept FileGroupTask as input and adjust related tests to ensure functionality and correctness.

Signed-off-by: Ao Tang <aot@nvidia.com>
@suiyoubi
Copy link
Contributor Author

/ok to test b0e2439



@dataclass
class VideoReaderDownloadStage(CompositeStage[_EmptyTask, VideoTask]):
class VideoLoadingStage(CompositeStage[_EmptyTask, VideoTask]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what convention is right, and if we want to change it in future, but take a look here https://github.com/NVIDIA-NeMo/Curator/blob/ray-api/ray-curator/ray_curator/stages/io/reader/jsonl.py

We have a stage called JsonlReaderStage(ProcessingStage[FileGroupTask, DocumentBatch]) and then the composite stage is called JsonlReader(CompositeStage[_EmptyTask, DocumentBatch]). Do we want to match that, and call it VideoReaderStage and VideoReader respectively?

@@ -0,0 +1,233 @@
"""Tests for FilePartitioningStage."""

from pathlib import Path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding these :)

"""
reader_stage = VideoReaderStage(
input_video_path=self.input_video_path,
video_limit=self.video_limit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nit should we also have video_limit be an int | None similar to how you have it in FilePartitioningStage

Comment on lines +92 to +221

def test_get_description_limited(self) -> None:
"""Test get_description method with limited videos."""
stage = VideoLoadingStage(
input_video_path="/test/videos",
video_limit=25
)

description = stage.get_description()
expected = (
"Reads video files from '/test/videos' "
"(limit: 25) "
"and downloads/processes them with metadata extraction"
)
assert description == expected

def test_get_description_zero_limit(self) -> None:
"""Test get_description method with zero limit."""
stage = VideoLoadingStage(
input_video_path="/test/videos",
video_limit=0
)

description = stage.get_description()
expected = (
"Reads video files from '/test/videos' "
"(limit: unlimited) "
"and downloads/processes them with metadata extraction"
)
assert description == expected

def test_inputs_outputs_delegation(self) -> None:
"""Test that inputs/outputs are properly delegated to constituent stages."""
stage = VideoLoadingStage(input_video_path="/test/videos")

# Should delegate to first stage for inputs
inputs = stage.inputs()
# FilePartitioningStage inputs should be empty
assert inputs == ([], [])

# Should delegate to last stage for outputs
outputs = stage.outputs()
# VideoReaderStage outputs
assert outputs == (["data"], ["source_bytes", "metadata"])

def test_post_init_calls_super(self) -> None:
"""Test that __post_init__ properly calls parent initialization."""
with patch("ray_curator.stages.base.CompositeStage.__init__") as mock_super_init:
VideoLoadingStage(input_video_path="/test/videos")

# Should have called parent __init__
mock_super_init.assert_called_once()

def test_decompose_stage_independence(self) -> None:
"""Test that each call to decompose returns independent stage instances."""
stage = VideoLoadingStage(
input_video_path="/test/videos",
video_limit=10,
verbose=True
)

# Get two decompositions
stages1 = stage.decompose()
stages2 = stage.decompose()

# Should be different instances
assert stages1[0] is not stages2[0]
assert stages1[1] is not stages2[1]

# But should have same configuration
assert stages1[0].file_paths == stages2[0].file_paths
assert stages1[0].limit == stages2[0].limit
assert stages1[1].verbose == stages2[1].verbose

def test_decompose_preserves_parameters(self) -> None:
"""Test that decompose preserves all input parameters correctly."""
stage = VideoLoadingStage(
input_video_path="/complex/path/with spaces",
video_limit=999,
verbose=True
)

stages = stage.decompose()
file_stage, reader_stage = stages

# Ensure all parameters are correctly passed through
assert file_stage.file_paths == "/complex/path/with spaces"
assert file_stage.files_per_partition == 1
assert file_stage.limit == 999
assert set(file_stage.file_extensions) == {".mp4", ".mov", ".avi", ".mkv", ".webm"}

assert reader_stage.verbose is True

@pytest.mark.parametrize(("video_limit", "expected_limit"), [
(-1, -1),
(0, 0),
(1, 1),
(100, 100),
(999999, 999999)
])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these tests feel very LLM-y to me, with no value add, unless I'm overlooking something here?

I think your test above of test_decompose_basic covers most of the cases right, that we we decompose and that we indeed pass the correct values? Having a test for each parameter seems like an overkill.


# Additional comprehensive tests below:

def test_download_video_bytes_permission_error(self) -> None:
Copy link
Contributor

@praateekmahajan praateekmahajan Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same in this test file, feels very bloated. Can we see if theres stuff we don't need..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants