Add video splitting pipeline with fixed stride extraction and transcoding Stage #783

suiyoubi · 2025-07-10T16:08:05Z

Note this requires #775 to be merged first (currently the base is set to aot/;ray-video-reader instead of ray-api)

Introduced video_split_clip_example.py to demonstrate video splitting functionality.
Added ClipTranscodingStage and FixedStrideExtractorStage for processing video clips.
Implemented command-line arguments for configuring video processing parameters.
Created utility functions for grouping iterables in grouping.py.
Added unit tests for the new stages in test_clip_transcoding_stage.py and test_fixed_stride_extractor_stage.py.

Description

Usage

python Curator/ray-curator/ray_curator/examples/video/video_split_clip_example.py --debug --verbose

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2025-07-10T16:08:09Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

abhinavg4 · 2025-07-21T05:28:46Z

ray-curator/ray_curator/examples/video/video_split_clip_example.py

+    parser.add_argument("--verbose", action="store_true", default=False)
+    parser.add_argument("--output-clip-path", type=str, default="/mnt/mint/output")
+    parser.add_argument(
+        "--no-upload-clips",


Is this argument used anywhere ?

this is used when we have S3 client support, I just have it here in advance

nit: i would say as general practice lets add things when the time comes, otherwise the diff becomes hard to reason about today (and in future). SImilarly in the Video dataclass you have more fields than what the "currently" merged stage need. This makes it hard to prune out what's needed where.

abhinavg4

A couple of minor comments but I think the major blockers are

Copy stats
Entire_gpu should not be present

ray-curator/ray_curator/examples/video/video_split_clip_example.py

abhinavg4 · 2025-07-21T05:35:30Z

ray-curator/ray_curator/stages/video/clipping/clip_extraction_stages.py

+        """Resource requirements for this stage."""
+        if self.encoder == "h264_nvenc" or self.use_hwaccel:
+            # TODO: support partial GPU usage
+            return Resources(entire_gpu=True)


Can you please add your name to the TODO so that we know it's our TODO and not of cosmos curate.

I think it should not be entire_gpu instead Resouces(gpus=1). The diff is of nvencs and nvdecs. entire GPU gives you nvencs and nvdecs as well whereas gpus=1 does not

Also just wondering if there are any blockers for making this partial GPU ?

Modify this to use the partial GPU now

ray-curator/ray_curator/stages/video/clipping/clip_extraction_stages.py

…ding stages - Introduced `video_split_clip_example.py` to demonstrate video splitting functionality. - Added `ClipTranscodingStage` and `FixedStrideExtractorStage` for processing video clips. - Implemented command-line arguments for configuring video processing parameters. - Created utility functions for grouping iterables in `grouping.py`. - Added unit tests for the new stages in `test_clip_transcoding_stage.py` and `test_fixed_stride_extractor_stage.py`. Signed-off-by: Ao Tang <aot@nvidia.com>

…age integration Signed-off-by: Ao Tang <aot@nvidia.com>

Signed-off-by: Ao Tang <aot@nvidia.com>

…ntegrate new functionalities - Replaced separate VideoReaderStage and VideoDownloadStage with a composite VideoReaderDownloadStage, streamlining the video reading and downloading process. - Updated ClipTranscodingStage to improve GPU resource allocation and added detailed arguments for better configurability. - Adjusted tests to reflect changes in resource management, ensuring accurate assertions on GPU usage. These changes improve the clarity and efficiency of video processing within the ray-curator framework. Signed-off-by: Ao Tang <aot@nvidia.com>

- Introduced MockGpuInfo and MockGpuResources classes to simulate GPU information and resources for testing. - Updated test_resources_gpu_encoder and test_resources_hwaccel_enabled methods to utilize mocks, ensuring accurate resource assertions without dependency on actual GPU hardware. - Enhanced test_different_encoder_configurations to validate resource requirements for various encoder configurations, including GPU settings. These changes improve the robustness of the ClipTranscodingStage tests by isolating them from hardware dependencies, facilitating easier testing and validation. Signed-off-by: [Your Name] <your.email@example.com> Signed-off-by: Ao Tang <aot@nvidia.com>

Signed-off-by: Ao Tang <aot@nvidia.com>

…-video-clip-extraction

sarahyurick · 2025-07-22T20:42:07Z

/ok to test c6a8a1b

abhinavg4

Looks good. But I think verbose is still pending.

praateekmahajan · 2025-07-23T00:35:07Z

ray-curator/ray_curator/utils/grouping.py

+T = typing.TypeVar("T")
+
+
+def split_by_chunk_size(


Can you add tests for these?

praateekmahajan · 2025-07-23T00:37:39Z

ray-curator/ray_curator/examples/video/video_split_clip_example.py

@@ -0,0 +1,267 @@
+import argparse


I think we don't want to commit these? let's check with arham and team what is the final thing? one e.g. per module is gonna be some serious bloat.. we can have an integration test if we want to "test it"

If you mean this file, I actually directly use this file to integrate the rest of the modules.

I mean you can keep it locally but need not push.

…tage tests Signed-off-by: Ao Tang <aot@nvidia.com>

suiyoubi · 2025-07-23T13:26:34Z

/ok to test 9f39885

Signed-off-by: Ao Tang <aot@nvidia.com>

suiyoubi · 2025-07-23T13:31:35Z

/ok to test 8472134

- Added `ray_stage_spec` method to `ClipTranscodingStage`, `VideoDownloadStage`, and `VideoReaderStage` to define stage characteristics for Ray integration. - Updated input and output methods in `ClipTranscodingStage` to include additional input parameters. - Modified `SplitPipeTask` to return properties from `data` instead of `video`, ensuring consistency in task data handling. - Added unit tests to verify the correctness of the new `ray_stage_spec` implementations. These changes improve the integration of video processing stages with Ray's architecture and enhance test coverage for the new functionalities. Signed-off-by: Ao Tang <aot@nvidia.com>

praateekmahajan · 2025-07-24T22:03:01Z

ray-curator/ray_curator/stages/video/clipping/clip_extraction_stages.py

+        encoder_threads: Number of threads per encoder.
+        encode_batch_size: Number of clips to encode in parallel.
+        nb_streams_per_gpu: Number of streams per GPU.
+        use_hwaccel: Whether to use hardware acceleration.


It seems like even when use_hwaccel is False then we use resources(gpus=1) (at line 76). Can you elaborate what happens there?

praateekmahajan · 2025-07-24T22:03:49Z

ray-curator/ray_curator/stages/video/clipping/clip_extraction_stages.py

+    def resources(self) -> Resources:
+        """Resource requirements for this stage."""
+        if self.encoder == "h264_nvenc" or self.use_hwaccel:
+            if self.nb_streams_per_gpu > 0:
+                # Assume that we have same type of GPUs
+                gpu_info = _get_local_gpu_info()[0]
+                nvencs = _make_gpu_resources_from_gpu_name(gpu_info.name).num_nvencs
+                gpu_memory_gb = _get_gpu_memory_gb()
+                return Resources(nvencs=nvencs // self.nb_streams_per_gpu, gpu_memory_gb=gpu_memory_gb // self.nb_streams_per_gpu)
+            else:
+                return Resources(gpus=1)
+
+        return Resources(cpus=self.num_cpus_per_worker)


Let's not override properties but in post_init override the self._resources. If you override properties we'll end up with a weird case if someone does ClipTranscodingStage.with(resources=???)

praateekmahajan · 2025-07-24T22:04:52Z

ray-curator/ray_curator/stages/video/clipping/clip_extraction_stages.py

+        if video.source_bytes is None:
+            msg = "Video source bytes are not available"
+            raise ValueError(msg)


Is this fro IDE to be happy? Because in theory the validate(..) of the stage should check against the inputs source_bytes, right? If not then we've gone somewhere wrong in our validate implementation

praateekmahajan · 2025-07-24T22:05:39Z

ray-curator/ray_curator/stages/video/clipping/clip_extraction_stages.py

+        output_tasks = []
+        clip_durations = [clip.duration for clip in video.clips]
+        if len(clip_durations) > 0:
+            logger.info(


inside verbose?

praateekmahajan · 2025-07-24T22:07:55Z

ray-curator/ray_curator/stages/video/clipping/clip_extraction_stages.py

+    def ray_stage_spec(self) -> dict[str, Any]:
+        """Ray stage specification for this stage."""
+        return {
+            RayStageSpecKeys.IS_ACTOR_STAGE: True,


fanout makes sense since we go from X -> list[X] but do we need this to be Actor? For general context let's use Actors when we want to maintain state or our setup(..) loads a model or something i.e. our init time is expensive.. If it's a simpler map style operation then we don't need it to be an Actor and you can remove that value and RayData executor will autodecide if it should be one.

FWIW RayData won't work with this stage because it has nvencs.

praateekmahajan · 2025-07-24T22:08:52Z

ray-curator/tests/utils/test_grouping.py

+    def test_drop_incomplete_chunk_false_explicit(self):
+        """Test keeping incomplete chunks when drop_incomplete_chunk=False."""
+        data = [1, 2, 3, 4, 5]
+        chunks = list(split_by_chunk_size(data, 3, drop_incomplete_chunk=False))
+        expected = [[1, 2, 3], [4, 5]]
+        assert chunks == expected


this could be combined with line 18 test..

praateekmahajan · 2025-07-24T22:10:53Z

ray-curator/tests/utils/test_grouping.py

+    def test_basic_functionality(self):
+        """Test basic splitting into n chunks."""
+        data = [1, 2, 3, 4, 5, 6, 7, 8, 9]
+        chunks = list(split_into_n_chunks(data, 3))


how does this differ from functionality from list(split_by_chunk_size(data, 3)))

suiyoubi changed the title ~~Add video splitting pipeline with fixed stride extraction and transco…~~ Add video splitting pipeline with fixed stride extraction and transcoding Stage Jul 10, 2025

abhinavg4 reviewed Jul 21, 2025

View reviewed changes

abhinavg4 requested changes Jul 21, 2025

View reviewed changes

suiyoubi changed the base branch from aot/ray-video-reader to ray-api July 22, 2025 20:15

suiyoubi added 6 commits July 22, 2025 13:23

Refactor video splitting pipeline to remove debug mode and enhance st…

e43daa3

…age integration Signed-off-by: Ao Tang <aot@nvidia.com>

Add video limit argument to video split clip example

0002c6b

Signed-off-by: Ao Tang <aot@nvidia.com>

Remove deprecated GPU resource tests from ClipTranscodingStage

012a2e1

Signed-off-by: Ao Tang <aot@nvidia.com>

suiyoubi force-pushed the aot/ray-video-clip-extraction branch from 29a106c to 012a2e1 Compare July 22, 2025 20:24

Merge branch 'ray-api' of github.com:NVIDIA-NeMo/Curator into aot/ray…

c6a8a1b

…-video-clip-extraction

copy-pr-bot bot temporarily deployed to test July 22, 2025 20:42 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci July 22, 2025 20:42 Failure

abhinavg4 approved these changes Jul 22, 2025

View reviewed changes

praateekmahajan reviewed Jul 23, 2025

View reviewed changes

suiyoubi and others added 2 commits July 23, 2025 09:22

Merge branch 'ray-api' into aot/ray-video-clip-extraction

6365bfb

Remove unused test for processing in debug mode from ClipTranscodingS…

9f39885

…tage tests Signed-off-by: Ao Tang <aot@nvidia.com>

copy-pr-bot bot temporarily deployed to test July 23, 2025 13:26 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci July 23, 2025 13:26 Error

Add unit tests for grouping utilities in the ray_curator.utils module

8472134

Signed-off-by: Ao Tang <aot@nvidia.com>

copy-pr-bot bot temporarily deployed to test July 23, 2025 13:32 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci July 23, 2025 13:32 Inactive

praateekmahajan reviewed Jul 24, 2025

View reviewed changes

praateekmahajan mentioned this pull request Jul 24, 2025

Reasoning Data Curation pipeline #782

Open

3 tasks

Add video splitting pipeline with fixed stride extraction and transcoding Stage #783

Are you sure you want to change the base?

Add video splitting pipeline with fixed stride extraction and transcoding Stage #783

Uh oh!

Conversation

suiyoubi commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Usage

Checklist

Uh oh!

copy-pr-bot bot commented Jul 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhinavg4 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sarahyurick commented Jul 22, 2025

Uh oh!

abhinavg4 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

suiyoubi commented Jul 23, 2025

Uh oh!

suiyoubi commented Jul 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

suiyoubi commented Jul 10, 2025 •

edited

Loading