-
Notifications
You must be signed in to change notification settings - Fork 154
Ray Video Reader Enhancement #848
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: ray-api
Are you sure you want to change the base?
Conversation
…age and update VideoReaderDownloadStage to use the new class. Adjust tests accordingly to reflect the changes in stage names and functionality. Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
…ing convention - Changed the internal name of VideoListStage from "video_reader" to "video_list". - Updated assertions in the test for VideoListStage to match the new name. - Adjusted configuration in the VideoReaderDownloadStage to use "video_list" instead of "video_reader". This ensures consistency across the codebase following the recent refactor. Signed-off-by: Ao Tang <aot@nvidia.com>
…" instead of "video_reader" Signed-off-by: Ao Tang <aot@nvidia.com>
…eoReaderStage in VideoReaderDownloadStage. Update related tests to reflect the new structure and ensure consistency across the codebase. Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
…posite stage that combines VideoListStage and VideoReaderStage. Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
…video_loading, video_reader, and related test files to use the new video module structure. Signed-off-by: Ao Tang <aot@nvidia.com>
/ok to test 7aa2ca4 |
…ng files into groups based on specified criteria, including a limit on the number of groups. Update VideoLoadingStage to utilize FilePartitioningStage instead of the deprecated VideoListStage. Refactor VideoReaderStage to accept FileGroupTask as input and adjust related tests to ensure functionality and correctness. Signed-off-by: Ao Tang <aot@nvidia.com>
/ok to test b0e2439 |
|
||
|
||
@dataclass | ||
class VideoReaderDownloadStage(CompositeStage[_EmptyTask, VideoTask]): | ||
class VideoLoadingStage(CompositeStage[_EmptyTask, VideoTask]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what convention is right, and if we want to change it in future, but take a look here https://github.com/NVIDIA-NeMo/Curator/blob/ray-api/ray-curator/ray_curator/stages/io/reader/jsonl.py
We have a stage called JsonlReaderStage(ProcessingStage[FileGroupTask, DocumentBatch])
and then the composite stage is called JsonlReader(CompositeStage[_EmptyTask, DocumentBatch])
. Do we want to match that, and call it VideoReaderStage and VideoReader respectively?
@@ -0,0 +1,233 @@ | |||
"""Tests for FilePartitioningStage.""" | |||
|
|||
from pathlib import Path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding these :)
""" | ||
reader_stage = VideoReaderStage( | ||
input_video_path=self.input_video_path, | ||
video_limit=self.video_limit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super nit should we also have video_limit be an int | None
similar to how you have it in FilePartitioningStage
|
||
def test_get_description_limited(self) -> None: | ||
"""Test get_description method with limited videos.""" | ||
stage = VideoLoadingStage( | ||
input_video_path="/test/videos", | ||
video_limit=25 | ||
) | ||
|
||
description = stage.get_description() | ||
expected = ( | ||
"Reads video files from '/test/videos' " | ||
"(limit: 25) " | ||
"and downloads/processes them with metadata extraction" | ||
) | ||
assert description == expected | ||
|
||
def test_get_description_zero_limit(self) -> None: | ||
"""Test get_description method with zero limit.""" | ||
stage = VideoLoadingStage( | ||
input_video_path="/test/videos", | ||
video_limit=0 | ||
) | ||
|
||
description = stage.get_description() | ||
expected = ( | ||
"Reads video files from '/test/videos' " | ||
"(limit: unlimited) " | ||
"and downloads/processes them with metadata extraction" | ||
) | ||
assert description == expected | ||
|
||
def test_inputs_outputs_delegation(self) -> None: | ||
"""Test that inputs/outputs are properly delegated to constituent stages.""" | ||
stage = VideoLoadingStage(input_video_path="/test/videos") | ||
|
||
# Should delegate to first stage for inputs | ||
inputs = stage.inputs() | ||
# FilePartitioningStage inputs should be empty | ||
assert inputs == ([], []) | ||
|
||
# Should delegate to last stage for outputs | ||
outputs = stage.outputs() | ||
# VideoReaderStage outputs | ||
assert outputs == (["data"], ["source_bytes", "metadata"]) | ||
|
||
def test_post_init_calls_super(self) -> None: | ||
"""Test that __post_init__ properly calls parent initialization.""" | ||
with patch("ray_curator.stages.base.CompositeStage.__init__") as mock_super_init: | ||
VideoLoadingStage(input_video_path="/test/videos") | ||
|
||
# Should have called parent __init__ | ||
mock_super_init.assert_called_once() | ||
|
||
def test_decompose_stage_independence(self) -> None: | ||
"""Test that each call to decompose returns independent stage instances.""" | ||
stage = VideoLoadingStage( | ||
input_video_path="/test/videos", | ||
video_limit=10, | ||
verbose=True | ||
) | ||
|
||
# Get two decompositions | ||
stages1 = stage.decompose() | ||
stages2 = stage.decompose() | ||
|
||
# Should be different instances | ||
assert stages1[0] is not stages2[0] | ||
assert stages1[1] is not stages2[1] | ||
|
||
# But should have same configuration | ||
assert stages1[0].file_paths == stages2[0].file_paths | ||
assert stages1[0].limit == stages2[0].limit | ||
assert stages1[1].verbose == stages2[1].verbose | ||
|
||
def test_decompose_preserves_parameters(self) -> None: | ||
"""Test that decompose preserves all input parameters correctly.""" | ||
stage = VideoLoadingStage( | ||
input_video_path="/complex/path/with spaces", | ||
video_limit=999, | ||
verbose=True | ||
) | ||
|
||
stages = stage.decompose() | ||
file_stage, reader_stage = stages | ||
|
||
# Ensure all parameters are correctly passed through | ||
assert file_stage.file_paths == "/complex/path/with spaces" | ||
assert file_stage.files_per_partition == 1 | ||
assert file_stage.limit == 999 | ||
assert set(file_stage.file_extensions) == {".mp4", ".mov", ".avi", ".mkv", ".webm"} | ||
|
||
assert reader_stage.verbose is True | ||
|
||
@pytest.mark.parametrize(("video_limit", "expected_limit"), [ | ||
(-1, -1), | ||
(0, 0), | ||
(1, 1), | ||
(100, 100), | ||
(999999, 999999) | ||
]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these tests feel very LLM-y to me, with no value add, unless I'm overlooking something here?
I think your test above of test_decompose_basic
covers most of the cases right, that we we decompose and that we indeed pass the correct values? Having a test for each parameter seems like an overkill.
|
||
# Additional comprehensive tests below: | ||
|
||
def test_download_video_bytes_permission_error(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same in this test file, feels very bloated. Can we see if theres stuff we don't need..
Description
This PR addresses the following issues:
tasks/__init__.py
Usage
# Add snippet demonstrating usage
Checklist