Stars
NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the effective training time by minimizing the downtime due to fa…
nv-one-logger enables tracking of GPU application progress over time and can help to identify overhead from workload and cluster inefficiencies to provide efficiency metrics.
Training library for Megatron-based models
Scalable toolkit for efficient model reinforcement
A Datacenter Scale Distributed Inference Serving Framework
A bidirectional pipeline parallelism algorithm for computation-communication overlap in DeepSeek V3/R1 training.
Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation
watchpoints is an easy-to-use, intuitive variable/object monitor tool for python that behaves similar to watchpoints in gdb.
NeMo Guardrails is an open-source toolkit for easily adding programmable guardrails to LLM-based conversational systems.
AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data…
alibaba / Megatron-LLaMA
Forked from NVIDIA/Megatron-LMBest practice for training LLaMA models in Megatron-LM
NVIDIA GPU metrics exporter for Prometheus leveraging DCGM
Provides end-to-end model development pipelines for LLMs and Multimodal models that can be launched on-prem or cloud-native.
A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
Seamlessly integrate LLMs into scikit-learn.
An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
Open-source search and retrieval database for AI applications.
这是一款提高ChatGPT的数据安全能力和效率的插件。并且免费共享大量创新功能,如:自动刷新、保持活跃、数据安全、取消审计、克隆对话、言无不尽、净化页面、展示大屏、拦截跟踪、日新月异、明察秋毫等。让我们的AI体验无比安全、顺畅、丝滑、高效、简洁。
🦜🔗 Build context-aware reasoning applications
JARVIS, a system to connect LLMs with ML community. Paper: https://arxiv.org/pdf/2303.17580.pdf
Time series forecasting with PyTorch
Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
Always know what to expect from your data.
Deepchecks: Tests for Continuous Validation of ML Models & Data. Deepchecks is a holistic open-source solution for all of your AI & ML validation needs, enabling to thoroughly test your data and mo…
Evidently is an open-source ML and LLM observability framework. Evaluate, test, and monitor any AI-powered system or data pipeline. From tabular data to Gen AI. 100+ metrics.
Library for exploring and validating machine learning data