+
Skip to content

ggcr/codecurator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CodeCurator

An end-to-end tool for curating GitHub repositories into structured code datasets.

  • Fast parallel processing - Download and extract with configurable workers
  • Smart filtering - Only processes programming files using GitHub Linguist
  • GPT-2 tokenization - Ready-to-use token counts for ML workflows
  • Efficient caching - Uses ETags to avoid re-downloading unchanged repos

Perfect for curating training data, running code analysis, or creating repository archives.

Installation

cargo install --path .

Usage

Create an input file with one GitHub repository per line:

"microsoft/vscode"
"vercel/next.js"
"tensorflow/tensorflow"
"bitcoin/bitcoin"
"rust-lang/rust"
"kubernetes/kubernetes"
"facebook/react"
"docker/compose"
"ansible/ansible"
"elastic/elasticsearch"

Download repositories:

codecurator download ./configs/repos.jsonl

This creates ZIP files in /zip/ directory. Downloads from main branch first, falls back to master if needed.

Extract and process:

codecurator extract ./configs/repos.jsonl --languages Python Rust Verilog

Processes all programming files, tokenizes content, and outputs structured data to /jsonl/ directory.

Deduplication:

codecurator dedupe ./configs/repos.jsonl

Hashes the contents of all files and deduplicates them. Stores the final data to /dedup/ by default.

Statistics:

$ bash stats/count_records.sh ./jsonl/
Total records: 110645

$ bash stats/count_tokens.sh ./dedup/
Total tokens: 346574283

About

A Rust tool for curating and processing GitHub repos as datasets 🦀

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载