-
Notifications
You must be signed in to change notification settings - Fork 154
Insights: NVIDIA-NeMo/Curator
Overview
Could not load contribution data
Please try again later
11 Pull requests merged by 7 people
-
Update ray-api auto label action trigger
#856 merged
Jul 25, 2025 -
ci(fix): Use GITHUB_TOKEN for community bot
#853 merged
Jul 25, 2025 -
Auto label PRs with the ray-api label
#845 merged
Jul 24, 2025 -
cp:
ci: Add community-bot (846)
intoray-api
#849 merged
Jul 23, 2025 -
ci: Add community-bot
#846 merged
Jul 23, 2025 -
cp:
chore: Add new trustees and vetters to the copy-pr-bot configuration (841)
intoray-api
#842 merged
Jul 22, 2025 -
chore: Add new trustees and vetters to the copy-pr-bot configuration
#841 merged
Jul 22, 2025 -
Ray Video Pipeline : Video Reader
#775 merged
Jul 22, 2025 -
Re-enable CI/CD for Ray API branch
#840 merged
Jul 22, 2025 -
Update ray version to 2.48
#839 merged
Jul 22, 2025 -
Update Authors
#836 merged
Jul 21, 2025
10 Pull requests opened by 7 people
-
Adding get client feature
#834 opened
Jul 20, 2025 -
Adding function decorator for very simple functions to be converted into stages
#835 opened
Jul 20, 2025 -
Initial Minhash implementation on Ray
#837 opened
Jul 21, 2025 -
Add secrets detector
#843 opened
Jul 22, 2025 -
Initialize and shutdown ray session in each executor
#844 opened
Jul 22, 2025 -
Add aesthetic filtering capabilities to video processing pipeline
#847 opened
Jul 23, 2025 -
Ray Video Reader Enhancement
#848 opened
Jul 23, 2025 -
Allow users to fuse multiple `DocumentFilter` objects into a single `ScoreFilter` stage
#850 opened
Jul 23, 2025 -
cp: `Auto label PRs with the ray-api label (845)` into `ray-api`
#851 opened
Jul 24, 2025 -
cp: `ci(fix): Use GITHUB_TOKEN for community bot (853)` into `ray-api`
#854 opened
Jul 25, 2025
4 Issues closed by 1 person
-
Extend support to non-English languages for PII Deidentifier
#554 closed
Jul 25, 2025 -
Running Curator under SLURM Cluster
#531 closed
Jul 25, 2025 -
Remove dask conditionals from our codebase
#680 closed
Jul 25, 2025
3 Issues opened by 2 people
-
[Ray] Pass expected language into `FastTextLangId` filter
#855 opened
Jul 25, 2025 -
Batch size tuning for Hugging Face text classifiers
#838 opened
Jul 21, 2025
59 Unresolved conversations
Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.
-
Add video splitting pipeline with fixed stride extraction and transcoding Stage
#783 commented on
Jul 24, 2025 • 26 new comments -
[Ray] Classifiers
#753 commented on
Jul 25, 2025 • 20 new comments -
Reasoning Data Curation pipeline
#782 commented on
Jul 25, 2025 • 19 new comments -
Add frame extraction stage to video splitting pipeline
#803 commented on
Jul 22, 2025 • 4 new comments -
Add motion filtering stages to video splitting pipeline
#797 commented on
Jul 22, 2025 • 4 new comments -
Add ClipWriterStage to video splitting pipeline
#786 commented on
Jul 23, 2025 • 3 new comments -
Add a way to pass expected language to FastTextLangId filter
#565 commented on
Jul 26, 2025 • 0 new comments -
Create FastText classifier module
#546 commented on
Jul 24, 2025 • 0 new comments -
Hard negative mining for Retriever fine-tuning
#523 commented on
Jul 25, 2025 • 0 new comments -
Added LookUp error handling during encoding detection.
#502 commented on
Jul 25, 2025 • 0 new comments -
[WIP] Add RAPIDS Nightly to GPU CI
#436 commented on
Jul 26, 2025 • 0 new comments -
Updating the Quick Example
#432 commented on
Jul 26, 2025 • 0 new comments -
Bump nltk from 3.8.1 to 3.9 in /tutorials/dapt-curation/code
#429 commented on
Jul 26, 2025 • 0 new comments -
Fix GPU error messages for fuzzy deduplication
#387 commented on
Jul 26, 2025 • 0 new comments -
Remove `max_text_bytes_per_part`
#385 commented on
Jul 26, 2025 • 0 new comments -
Create `Cache` class for exact, fuzzy, and semantic deduplication
#384 commented on
Jul 26, 2025 • 0 new comments -
ci: Add `copyright-check` workflow
#369 commented on
Jul 26, 2025 • 0 new comments -
Running dapt tutorial is giving error
#833 commented on
Jul 19, 2025 • 0 new comments -
Add option to skip data by adding a flag instead of removing them
#566 commented on
Jul 24, 2025 • 0 new comments -
Add Regex Modifier
#568 commented on
Jul 24, 2025 • 0 new comments -
[WIP] Remote I/O in SemDedup
#621 commented on
Jul 24, 2025 • 0 new comments -
Change prompt to try and get only topic names
#623 commented on
Jul 26, 2025 • 0 new comments -
Fail loudly for NeMo Curator Dask-Cuda cluster creation CUDA context issues
#675 commented on
Jul 24, 2025 • 0 new comments -
Presidio pii redaction
#765 commented on
Jul 24, 2025 • 0 new comments -
Initial PR for Synthetic data generation
#767 commented on
Jul 24, 2025 • 0 new comments -
[Ray] Download and extract ArXiv
#805 commented on
Jul 25, 2025 • 0 new comments -
Add VideoFrameExtractionStage to video splitting pipeline
#808 commented on
Jul 22, 2025 • 0 new comments -
Add TransNetV2ClipExtractionStage to video splitting pipeline
#809 commented on
Jul 22, 2025 • 0 new comments -
Add Cosmos-Embed1 model and embedding stages
#825 commented on
Jul 22, 2025 • 0 new comments -
docs: refactor all the things
#826 commented on
Jul 25, 2025 • 0 new comments -
import cudf will cause CUDARuntime error
#714 commented on
Jul 24, 2025 • 0 new comments -
Add example of how to resume an interrupted `download_common_crawl` job
#710 commented on
Jul 24, 2025 • 0 new comments -
Add more Docker instructions to README
#693 commented on
Jul 24, 2025 • 0 new comments -
Multilingual PII support
#661 commented on
Jul 24, 2025 • 0 new comments -
Exact / Fuzzy Duplicate Removal Improvements at Scale
#529 commented on
Jul 25, 2025 • 0 new comments -
nemo_curator.utils.distributed_utils.read_data doesn't work for my own parquet dataset unless cleaning text by myself
#482 commented on
Jul 25, 2025 • 0 new comments -
fuzzy_dedup OOM issue
#471 commented on
Jul 25, 2025 • 0 new comments -
Fuzzy Duplicates Identification fails on batched_merge_and_write when document dataset is read with blocksize
#462 commented on
Jul 25, 2025 • 0 new comments -
Consecutive execution of fuzzy deduplication on different columns fails with errors
#501 commented on
Jul 25, 2025 • 0 new comments -
[FEA] Remove GPU-related messages on CPU-only servers
#535 commented on
Jul 25, 2025 • 0 new comments -
Migration to Ray - Text Curation
#745 commented on
Jul 25, 2025 • 0 new comments -
Migration to Ray Backend - Core Infra
#744 commented on
Jul 25, 2025 • 0 new comments -
Fuzzy Duplicate Removal fails at scale
#723 commented on
Jul 25, 2025 • 0 new comments -
Ruff Bug fixes in code
#627 commented on
Jul 25, 2025 • 0 new comments -
[FEA] Enable Best Fit Packing
#492 commented on
Jul 25, 2025 • 0 new comments -
Post to internal slack if nightly tests fail
#488 commented on
Jul 25, 2025 • 0 new comments -
Refactor separate_by_metadata and Partition On to use the same code paths.
#524 commented on
Jul 25, 2025 • 0 new comments -
[FEA] Add Sampling-Based Clustering in SemDedup
#538 commented on
Jul 25, 2025 • 0 new comments -
Remove dependency on `convert_str_id_to_int` in FuzzyDedup Scripts
#447 commented on
Jul 26, 2025 • 0 new comments -
PII Modifier fails to load on worker sporadically raising `cannot reshape array of size`
#424 commented on
Jul 26, 2025 • 0 new comments -
Pii Modifier should work with `DocumentDataset` on cudf
#418 commented on
Jul 26, 2025 • 0 new comments -
PII Modifier should support documents greater than pre-configured length
#417 commented on
Jul 26, 2025 • 0 new comments -
`LookupError` not caught during Encoding handling
#411 commented on
Jul 26, 2025 • 0 new comments -
Use CrossFit for `TokenizerFertilityFilter`
#377 commented on
Jul 26, 2025 • 0 new comments -
[IMP] Decrease Merge Peak Memory Usage of ConnectedComponents
#375 commented on
Jul 26, 2025 • 0 new comments -
Zyda2 tutorial - key error when running compute_counts script
#345 commented on
Jul 26, 2025 • 0 new comments -
Zyda2 tutorial - TypeError when initializing Dask CPU cluster
#344 commented on
Jul 26, 2025 • 0 new comments -
Standardize `text_field`, `id_field`, etc. terminology
#342 commented on
Jul 26, 2025 • 0 new comments -
Faster/More efficient duplicate removal for exact/fuzzy dedup.
#335 commented on
Jul 26, 2025 • 0 new comments