- All languages
- Ada
- Arduino
- Assembly
- Astro
- AutoIt
- BQN
- C
- C#
- C++
- COBOL
- CSS
- Clojure
- CoffeeScript
- Common Lisp
- Coq
- Crystal
- Cuda
- Cython
- Dhall
- Dockerfile
- Elixir
- Emacs Lisp
- Erlang
- F#
- F*
- Fennel
- Fortran
- FreeMarker
- Futhark
- GAMS
- Go
- Groovy
- HCL
- HTML
- Haskell
- Haxe
- Janet
- Java
- JavaScript
- Jinja
- Julia
- Jupyter Notebook
- Kotlin
- Lean
- Lua
- MATLAB
- MDX
- Makefile
- Markdown
- Mercury
- MiniZinc
- MoonScript
- Nim
- Nunjucks
- OCaml
- Objective-C
- OpenSCAD
- PHP
- Pascal
- Perl
- Perl 6
- PowerShell
- Processing
- Prolog
- PureScript
- Python
- R
- Racket
- Raku
- Reason
- Roff
- Ruby
- Rust
- SCSS
- Sass
- Scala
- Scheme
- Shell
- Shen
- Smalltalk
- Stan
- Standard ML
- Svelte
- Swift
- TSQL
- Tcl
- TeX
- Toit
- TypeScript
- VHDL
- Vala
- Vim Script
- Vue
- WebAssembly
- Wren
- Zig
Starred repositories
The official repository for Toxic Commons and Celadon. Toxicity Classification for public domain data.
The official GitHub page for the survey paper of AIGTD entitled "The Imitation Game Revisited: A Comprehensive Survey on Recent Advances in AI-generated Text Detection."
RAID is the largest and most challenging benchmark for AI-generated text detection. (ACL 2024)
Small python package to measure OCR quality and other related metrics.
The legal review and SBOM system used by SUSE and openSUSE
Concept Induction: Analyzing Unstructured Text with High-Level Concepts Using LLooM (CHI 2024 paper). LLooM automatically surfaces high-level concepts to analyze unstructured text.
Suri: Multi-constraint instruction following for long-form text generation (EMNLP’24)
Frankentext: Stitching random text fragments into long-form narratives
Hypernetworks that adapt LLMs for specific benchmark tasks using only textual task description as the input
An open-source framework for detecting, redacting, masking, and anonymizing sensitive data (PII) across text, images, and structured data. Supports NLP, pattern matching, and customizable pipelines.
Extend existing LLMs way beyond the original training length with constant memory usage, without retraining
Agent Reinforcement Trainer: train multi-step agents for real-world tasks using GRPO. Give your agents on-the-job training. Reinforcement learning for Qwen2.5, Qwen3, Llama, Kimi, and more!
Download the entire Wayback Machine archive for a given URL.
A blazing fast, async-first, undetectable webscraping/web automation framework based on ultrafunkamsterdam/nodriver. Now with Docker support!
[SIGIR24] Pre-training with Bag-of-Word Prediction for Dense Passage Retrieval
Successor of Undetected-Chromedriver. Providing a blazing fast framework for web automation, webscraping, bots and any other creative ideas which are normally hindered by annoying anti bot systems …
HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieval Results in RAG Systems (WWW 2025)
A modest JavaScript framework for the HTML you already have
Scaling Deep Research via Reinforcement Learning in Real-world Environments.
verl: Volcano Engine Reinforcement Learning for LLMs
Search-o1: Agentic Search-Enhanced Large Reasoning Models
Claude Code is an agentic coding tool that lives in your terminal, understands your codebase, and helps you code faster by executing routine tasks, explaining complex code, and handling git workflo…
Unzipping with yauzl with added support for Mac OS Archive Utility ZIP files
🔑 Deno library for hashing passwords using scrypt