Build software better, together

Rampati / MIT-6.S191-Lab3

In this lab, you will fine-tune a multi-billion parameter language model to generate specific style responses. You'll work with tokenization strategies, prompt templates, and a complete fine-tuning workflow to enhance LLM outputs. 🐙✨

nlp evaluation text-generation language-model gemma opik fine-tuning peft huggingface llm mit-6s191

Updated Jul 8, 2025
Jupyter Notebook

natanim-haile / prompt-optimizer

Star

Prompt Optimizer enhances your prompts for better results in AI applications. Join the community on GitHub and contribute to improving prompt efficiency! 🌟🌐

python nlp agent ai evaluation prompt multi-agent gpt datasets self-improvement natural-language-generation nlg prompt-learning prompt-tuning chatgpt society-of-mind ollama prompt-optimization

Updated Jul 8, 2025
TypeScript

justplus / llm-eval

Star

大语言模型评估平台，支持多种评估基准、自定义数据集和性能测试。

evaluation large-language-models llm evalscope

Updated Jul 8, 2025
Python

Kakz / prometheus-llm

Star

PrometheusLLM is a unique transformer architecture inspired by dignity and recursion. This project aims to explore new frontiers in AI research and welcomes contributions from the community. 🐙🌟

deep-learning mcp evaluation pipelines tracing language-model self-organization cognitive-architecture hermeneutics philosophy-of-mind gpt4 llm llmops ollama litellm llm-as-evaluator autopoietic-systems prompt-logging

Updated Jul 8, 2025
Python

ModelTC / llmc

Star

[EMNLP 2024 Industry Track] This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit".

benchmark deployment tool evaluation pruning quantization wan post-training-quantization awq large-language-models llm vllm smoothquant token-reduction mixtral internlm2 lvlm deepseek-v3 deepseek-r1

Updated Jul 8, 2025
Python

langfuse / langfuse

Star

🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

open-source playground monitoring analytics evaluation self-hosted ycombinator openai observability autogen large-language-models llm prompt-engineering langchain llmops llama-index prompt-management llm-evaluation llm-observability

Updated Jul 8, 2025
TypeScript

tenemos / langwatch

Star

The open LLM Ops platform - Traces, Analytics, Evaluations, Datasets and Prompt Optimization ✨

ai analytics evaluation openai gpt datasets observability low-code dspy llm prompt-engineering llmops

Updated Jul 8, 2025
TypeScript

stack-rs / mitosis

Star

Mitosis: A Unified Transport Evaluation Framework

rust cli distributed-systems library evaluation distributed transport-layer evaluation-framework

Updated Jul 8, 2025
Rust

yassinelahdiy / page-language-model

Star

Open-source framework for defining Page Language Models (PLMs) for intelligent app understanding and AI-assisted testing.

nlp agent benchmark gui webpage rest-api evaluation image-recognition model-assessment pre-training playwright prompt-engineering instruction-tuning set-of-mark

Updated Jul 8, 2025
Python

bibanyok89 / karate-ui-test

Star

This repository contains automated tests using Karate Test Automation. The tests include scenarios for validating different pages and actions on the NASA Hubble Mission website and image download functionality.

java webdriver tdd bdd evaluation test-automation test-scenarios ui-testing karate web-testing karate-framework karateui karatedsl karate-test-automation

Updated Jul 8, 2025
Gherkin

Abhisang3 / xVerify

Star

xVerify: Efficient Answer Verifier for Large Language Model Evaluations

benchmark reliability evaluation llm reliability-tools chatgpt open-compass deepseek-math judge-model reasoning-models open-r1 xverify math-verify

Updated Jul 8, 2025
Python

InfernoWasTaken2 / kendin-denetle

Star

Automated testing tool designed for developers to quickly and efficiently test their own code. Simplifies the process of checking for bugs and errors in software projects.

review personal evaluation audit assessment introspection selfcheck self examination self-assessment self-check self-examination self-evaluation self-review kendindenetle kendin-denetle

Updated Jul 8, 2025

open-compass / VLMEvalKit

Star

Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

computer-vision evaluation pytorch gemini openai vqa vit gpt multi-modal clip claude openai-api gpt4 large-language-models llm chatgpt llava qwen gpt-4v

Updated Jul 8, 2025
Python

rabahdic / SE-TESTING

Star

Explore the SE-TESTING repository for comprehensive labs and activities in software testing. Collaborate and enhance your skills with practical examples. 🛠️🌟

javascript debugging security benchmark spring microservice serverless jpa aws-s3 evaluation owasp jee serverless-framework iast dast enterprise-systems owasp-website wstg

Updated Jul 8, 2025
JavaScript

deepsense-ai / ragbits

Star

Building blocks for rapid development of GenAI applications

optimization evaluation agents prompts document-search rag guardrails llms vector-stores

Updated Jul 8, 2025
Python

promptfoo / promptfoo

Star

Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.