Ray Video Reader Enhancement #848

suiyoubi · 2025-07-23T15:07:32Z

Description

This PR addresses the following issues:

VideoDownloadStage → VideoReaderStage
VideoReaderStage -> Removed and use existing FilePartitioningStage
VideoDownloadReaderStage -> VideoLoadingStage
remove video from tasks/__init__.py
remove SplitPipeTask
Add a limit attribute in FilePartitioningStage
Add Unit Test for FilePartitioningStage

Usage

# Add snippet demonstrating usage

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

…age and update VideoReaderDownloadStage to use the new class. Adjust tests accordingly to reflect the changes in stage names and functionality. Signed-off-by: Ao Tang <aot@nvidia.com>

Signed-off-by: Ao Tang <aot@nvidia.com>

…ing convention - Changed the internal name of VideoListStage from "video_reader" to "video_list". - Updated assertions in the test for VideoListStage to match the new name. - Adjusted configuration in the VideoReaderDownloadStage to use "video_list" instead of "video_reader". This ensures consistency across the codebase following the recent refactor. Signed-off-by: Ao Tang <aot@nvidia.com>

…" instead of "video_reader" Signed-off-by: Ao Tang <aot@nvidia.com>

…eoReaderStage in VideoReaderDownloadStage. Update related tests to reflect the new structure and ensure consistency across the codebase. Signed-off-by: Ao Tang <aot@nvidia.com>

Signed-off-by: Ao Tang <aot@nvidia.com>

…posite stage that combines VideoListStage and VideoReaderStage. Signed-off-by: Ao Tang <aot@nvidia.com>

Signed-off-by: Ao Tang <aot@nvidia.com>

…video_loading, video_reader, and related test files to use the new video module structure. Signed-off-by: Ao Tang <aot@nvidia.com>

copy-pr-bot · 2025-07-23T15:07:35Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Ao Tang <aot@nvidia.com>

suiyoubi · 2025-07-23T15:14:32Z

/ok to test 7aa2ca4

…ng files into groups based on specified criteria, including a limit on the number of groups. Update VideoLoadingStage to utilize FilePartitioningStage instead of the deprecated VideoListStage. Refactor VideoReaderStage to accept FileGroupTask as input and adjust related tests to ensure functionality and correctness. Signed-off-by: Ao Tang <aot@nvidia.com>

suiyoubi · 2025-07-23T17:10:27Z

/ok to test b0e2439

praateekmahajan · 2025-07-24T16:52:47Z

ray-curator/ray_curator/stages/video/io/video_loading.py



 @dataclass
-class VideoReaderDownloadStage(CompositeStage[_EmptyTask, VideoTask]):
+class VideoLoadingStage(CompositeStage[_EmptyTask, VideoTask]):


I don't know what convention is right, and if we want to change it in future, but take a look here https://github.com/NVIDIA-NeMo/Curator/blob/ray-api/ray-curator/ray_curator/stages/io/reader/jsonl.py

We have a stage called JsonlReaderStage(ProcessingStage[FileGroupTask, DocumentBatch]) and then the composite stage is called JsonlReader(CompositeStage[_EmptyTask, DocumentBatch]). Do we want to match that, and call it VideoReaderStage and VideoReader respectively?

praateekmahajan · 2025-07-24T16:54:13Z

ray-curator/tests/stages/io/reader/test_file_partitioning.py

@@ -0,0 +1,233 @@
+"""Tests for FilePartitioningStage."""
+
+from pathlib import Path


Thank you for adding these :)

praateekmahajan · 2025-07-24T16:55:59Z

ray-curator/ray_curator/stages/video/io/video_loading.py

        """
-        reader_stage = VideoReaderStage(
-            input_video_path=self.input_video_path,
-            video_limit=self.video_limit


Super nit should we also have video_limit be an int | None similar to how you have it in FilePartitioningStage

praateekmahajan · 2025-07-24T16:59:06Z

ray-curator/tests/stages/video/io/test_video_loading.py

+
+    def test_get_description_limited(self) -> None:
+        """Test get_description method with limited videos."""
+        stage = VideoLoadingStage(
+            input_video_path="/test/videos",
+            video_limit=25
+        )
+
+        description = stage.get_description()
+        expected = (
+            "Reads video files from '/test/videos' "
+            "(limit: 25) "
+            "and downloads/processes them with metadata extraction"
+        )
+        assert description == expected
+
+    def test_get_description_zero_limit(self) -> None:
+        """Test get_description method with zero limit."""
+        stage = VideoLoadingStage(
+            input_video_path="/test/videos",
+            video_limit=0
+        )
+
+        description = stage.get_description()
+        expected = (
+            "Reads video files from '/test/videos' "
+            "(limit: unlimited) "
+            "and downloads/processes them with metadata extraction"
+        )
+        assert description == expected
+
+    def test_inputs_outputs_delegation(self) -> None:
+        """Test that inputs/outputs are properly delegated to constituent stages."""
+        stage = VideoLoadingStage(input_video_path="/test/videos")
+
+        # Should delegate to first stage for inputs
+        inputs = stage.inputs()
+        # FilePartitioningStage inputs should be empty
+        assert inputs == ([], [])
+
+        # Should delegate to last stage for outputs
+        outputs = stage.outputs()
+        # VideoReaderStage outputs
+        assert outputs == (["data"], ["source_bytes", "metadata"])
+
+    def test_post_init_calls_super(self) -> None:
+        """Test that __post_init__ properly calls parent initialization."""
+        with patch("ray_curator.stages.base.CompositeStage.__init__") as mock_super_init:
+            VideoLoadingStage(input_video_path="/test/videos")
+
+            # Should have called parent __init__
+            mock_super_init.assert_called_once()
+
+    def test_decompose_stage_independence(self) -> None:
+        """Test that each call to decompose returns independent stage instances."""
+        stage = VideoLoadingStage(
+            input_video_path="/test/videos",
+            video_limit=10,
+            verbose=True
+        )
+
+        # Get two decompositions
+        stages1 = stage.decompose()
+        stages2 = stage.decompose()
+
+        # Should be different instances
+        assert stages1[0] is not stages2[0]
+        assert stages1[1] is not stages2[1]
+
+        # But should have same configuration
+        assert stages1[0].file_paths == stages2[0].file_paths
+        assert stages1[0].limit == stages2[0].limit
+        assert stages1[1].verbose == stages2[1].verbose
+
+    def test_decompose_preserves_parameters(self) -> None:
+        """Test that decompose preserves all input parameters correctly."""
+        stage = VideoLoadingStage(
+            input_video_path="/complex/path/with spaces",
+            video_limit=999,
+            verbose=True
+        )
+
+        stages = stage.decompose()
+        file_stage, reader_stage = stages
+
+        # Ensure all parameters are correctly passed through
+        assert file_stage.file_paths == "/complex/path/with spaces"
+        assert file_stage.files_per_partition == 1
+        assert file_stage.limit == 999
+        assert set(file_stage.file_extensions) == {".mp4", ".mov", ".avi", ".mkv", ".webm"}
+
+        assert reader_stage.verbose is True
+
+    @pytest.mark.parametrize(("video_limit", "expected_limit"), [
+        (-1, -1),
+        (0, 0),
+        (1, 1),
+        (100, 100),
+        (999999, 999999)
+    ])


All these tests feel very LLM-y to me, with no value add, unless I'm overlooking something here?

I think your test above of test_decompose_basic covers most of the cases right, that we we decompose and that we indeed pass the correct values? Having a test for each parameter seems like an overkill.

praateekmahajan · 2025-07-24T17:02:51Z

ray-curator/tests/stages/video/io/test_video_reader.py

+
+    # Additional comprehensive tests below:
+
+    def test_download_video_bytes_permission_error(self) -> None:


Same in this test file, feels very bloated. Can we see if theres stuff we don't need..

suiyoubi added 9 commits July 23, 2025 07:21

Refactor video reading stages: Rename VideoReaderStage to VideoListSt…

67eeda4

…age and update VideoReaderDownloadStage to use the new class. Adjust tests accordingly to reflect the changes in stage names and functionality. Signed-off-by: Ao Tang <aot@nvidia.com>

Rename test_video_reader to test_video_list

29a5510

Signed-off-by: Ao Tang <aot@nvidia.com>

Update test assertions in VideoReaderDownloadStage to use "video_list…

85f6eb0

…" instead of "video_reader" Signed-off-by: Ao Tang <aot@nvidia.com>

Refactor video processing stages: Replace VideoDownloadStage with Vid…

302f0d2

…eoReaderStage in VideoReaderDownloadStage. Update related tests to reflect the new structure and ensure consistency across the codebase. Signed-off-by: Ao Tang <aot@nvidia.com>

Enhance VideoListStage and VideoReaderStage documentation

e33c117

Signed-off-by: Ao Tang <aot@nvidia.com>

Refactor video reading pipeline: Introduce VideoLoadingStage as a com…

841793b

…posite stage that combines VideoListStage and VideoReaderStage. Signed-off-by: Ao Tang <aot@nvidia.com>

Remove SplitPipeTask from video module and update imports accordingly.

b3f4b02

Signed-off-by: Ao Tang <aot@nvidia.com>

Refactor video task imports: Update import statements in video_list, …

41de3dd

…video_loading, video_reader, and related test files to use the new video module structure. Signed-off-by: Ao Tang <aot@nvidia.com>

ruff fix

7aa2ca4

Signed-off-by: Ao Tang <aot@nvidia.com>

copy-pr-bot bot temporarily deployed to test July 23, 2025 15:14 Inactive

suiyoubi requested a review from praateekmahajan July 23, 2025 15:14

copy-pr-bot bot temporarily deployed to nemo-ci July 23, 2025 15:14 Inactive

copy-pr-bot bot temporarily deployed to test July 23, 2025 17:10 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci July 23, 2025 17:10 Inactive

praateekmahajan reviewed Jul 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ray Video Reader Enhancement #848

Ray Video Reader Enhancement #848

Uh oh!

suiyoubi commented Jul 23, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Jul 23, 2025

Uh oh!

suiyoubi commented Jul 23, 2025

Uh oh!

suiyoubi commented Jul 23, 2025

Uh oh!

praateekmahajan Jul 24, 2025

Uh oh!

praateekmahajan Jul 24, 2025

Uh oh!

praateekmahajan Jul 24, 2025

Uh oh!

praateekmahajan Jul 24, 2025

Uh oh!

praateekmahajan Jul 24, 2025 •

edited

Loading

Uh oh!

Uh oh!

		@@ -0,0 +1,233 @@
		"""Tests for FilePartitioningStage."""

		from pathlib import Path


		# Additional comprehensive tests below:

		def test_download_video_bytes_permission_error(self) -> None:

Ray Video Reader Enhancement #848

Are you sure you want to change the base?

Ray Video Reader Enhancement #848

Uh oh!

Conversation

suiyoubi commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Usage

Checklist

Uh oh!

copy-pr-bot bot commented Jul 23, 2025

Uh oh!

suiyoubi commented Jul 23, 2025

Uh oh!

suiyoubi commented Jul 23, 2025

Uh oh!

praateekmahajan Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

praateekmahajan Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

praateekmahajan Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

praateekmahajan Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

praateekmahajan Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

suiyoubi commented Jul 23, 2025 •

edited

Loading

praateekmahajan Jul 24, 2025 •

edited

Loading