AI-assisted tool for reviewing and curating gene annotations. This project provides a structured workflow for validating existing Gene Ontology (GO) annotations using AI-driven analysis combined with literature research and bioinformatics evidence.
The AI Gene Review tool helps researchers and curators:
- Review existing GO annotations using strict, defined criteria
- Synthesize high-quality annotations from multiple evidence sources
- Fetch and organize gene data from UniProt and GOA databases
- Validate annotation files against LinkML schemas
- Manage references and supporting literature
- Install uv for dependency management
- Clone the repository and install dependencies:
git clone https://github.com/cmungall/ai-gene-review.git cd ai-gene-review uv sync --group dev
Fetch gene data:
uv run ai-gene-review fetch-gene human TP53
Validate a gene review file:
uv run ai-gene-review validate genes/human/TP53/TP53-ai-review.yaml
Fetch publications for a gene:
uv run ai-gene-review fetch-gene-pmids genes/human/TP53/TP53-ai-review.yaml
Generate statistics report:
just stats # Generate HTML report
just stats-open # Generate and open in browser
- Fetch Gene Data: Download UniProt records and GO annotations
- Literature Research: Gather supporting publications and evidence
- Create Review: Structure annotations using the YAML schema
- Validate: Check against LinkML schema and best practices
- Iterate: Refine annotations based on validation results
- 🧬 Multi-organism support: Human, mouse, worm, and other model organisms
- 📚 Literature integration: Automatic PubMed citation fetching and caching
- ✅ Schema validation: LinkML-based validation for consistency
- 🛡️ Anti-hallucination validation: ID/label tuple checksums prevent AI fabrication of terms
- 🔄 Batch processing: Handle multiple genes efficiently
- 📊 Structured reviews: YAML-based gene annotation reviews
- 🔍 Evidence tracking: Detailed provenance and supporting text
- Documentation Website: https://monarch-initiative.github.io/ai-gene-review
- Interactive Web App: https://ai4curation.github.io/ai-gene-review/app/index.html - Browse and explore gene annotation reviews
- Statistics Dashboard: https://ai4curation.github.io/ai-gene-review/docs/stats_report.html - Summary Stats
Each gene review follows a structured YAML format containing:
- Gene metadata: UniProt ID, gene symbol, taxon information
- Description: Comprehensive summary of gene function
- References: Literature and bioinformatics sources
- Existing annotations: Review of current GO annotations with actions (ACCEPT, MODIFY, REMOVE, etc.)
- Core functions: Curated essential gene functions
Example structure:
id: Q9BRQ4
gene_symbol: CFAP300
taxon:
id: NCBITaxon:9606
label: Homo sapiens
description: >-
CFAP300 is a cilium- and flagellum-specific protein...
existing_annotations:
- term:
id: GO:0005515
label: protein binding
action: MODIFY
reason: "While evidence is strong, 'protein binding' is uninformative..."
The repository includes example gene reviews for:
- Human: BRCA1, CFAP300, RBFOX3, TP53
- Mouse: Various examples
- Worm: lrx-1
Browse the genes/
directory to see complete examples.
The review of pedH revealed several important curation insights:
-
Lanthanide vs Calcium Dependency: PedH was incorrectly annotated with "calcium ion binding" (GO:0005509) when it actually requires lanthanide ions (La³⁺, Ce³⁺, Pr³⁺, Nd³⁺, Sm³⁺) for activity. This highlights the importance of reviewing automated annotations based on sequence similarity.
-
Cellular Localization Precision: Bioinformatics analysis confirmed PedH is a soluble periplasmic enzyme, not membrane-associated:
- Signal peptide (aa 1-25) directs export, then is cleaved
- No transmembrane regions in mature protein
- Functions throughout periplasmic space, not just at membrane boundaries
- Led to choosing GO:0042597 (periplasmic space) over GO:0030288 (outer membrane-bounded periplasmic space)
-
Dual Functional Roles: PedH serves both as:
- Metabolic enzyme: Oxidizes alcohols in 2-phenylethanol degradation pathway
- Regulatory sensor: Part of lanthanide-sensing system controlling gene expression via PedS2/PedR2 two-component system
-
Missing GO Terms Identified: The review revealed gaps in GO:
- No term for "lanthanide ion binding" (distinct from transition metal binding)
- No term for "lanthanide-dependent alcohol dehydrogenase activity"
- Verify metal cofactors carefully - Don't assume calcium when other metals are possible
- Consider protein mobility - Soluble vs membrane-associated matters for localization terms
- Look for regulatory functions - Enzymes may have sensory/regulatory roles beyond catalysis
- Use bioinformatics to validate - Signal peptide and TM predictions can clarify localization
The AI Gene Review system implements a robust anti-hallucination validation mechanism using ID/label tuple checksums to prevent AI systems from fabricating or misusing ontological terms.
Every ontology term in the system requires both an id
(semantic identifier) and label
(human-readable name):
term:
id: GO:0005515 # Ontology identifier
label: protein binding # Canonical label
The TermValidator
performs multi-layer validation:
- Format Validation: Ensures IDs follow proper CURIE patterns (
PREFIX:NUMBER
) - Existence Validation: Verifies terms exist in authoritative ontologies via OAK/OLS APIs
- Label Matching: Cross-references provided labels against canonical ontology labels
- Branch Validation: Ensures GO terms are in correct ontological branches (MF/BP/CC)
- Obsolescence Checking: Flags outdated terms
✅ Dual Verification: Both ID and label must be correct and consistent ✅ External Truth Source: Validates against authoritative ontologies (GO, HP, MONDO, etc.) ✅ Real-time Checking: Uses live API calls to catch fabricated terms ✅ Semantic Consistency: Ensures terms make sense in their context
# ❌ This would be caught as invalid
term:
id: GO:0005515
label: "DNA binding" # Wrong label for GO:0005515
# ✅ This passes validation
term:
id: GO:0005515
label: "protein binding" # Correct canonical label
# ❌ This would be flagged as fabricated
term:
id: GO:9999999
label: "made up function" # Non-existent term
The validator supports 10+ major ontologies:
- GO: Gene Ontology (molecular functions, biological processes, cellular components)
- HP: Human Phenotype Ontology
- MONDO: Mondo Disease Ontology
- CL: Cell Ontology
- UBERON: Uberon Anatomy Ontology
- CHEBI: Chemical Entities of Biological Interest
- PR: Protein Ontology
- SO: Sequence Ontology
- PATO: Phenotype And Trait Ontology
- NCBITaxon: NCBI Taxonomy
This validation system represents a novel approach to preventing ontological hallucination in AI curation workflows and could serve as a model for other AI applications working with structured biological knowledge.
- genes/ - Gene review data organized by organism
human/
,mouse/
,worm/
- Species-specific gene directories- Each gene folder contains: YAML review, UniProt data, GO annotations, notes
- docs/ - MkDocs-managed documentation
- src/ai_gene_review/ - Core Python package
cli.py
- Command-line interfaceschema/
- LinkML schema definitionsetl/
- Data extraction and loading modules
- tests/ - Python tests and example data
- publications/ - Cached PubMed articles
This project uses just command runner for development tasks.
Available commands:
just --list # Show all available commands
just test # Run tests, type checking, and linting
just format # Run code formatting checks
just install # Install project dependencies
CLI Commands:
uv run ai-gene-review --help # Show CLI help
uv run ai-gene-review fetch-gene human BRCA1 # Fetch gene data
uv run ai-gene-review validate <yaml-file> # Validate review file
uv run ai-gene-review batch-fetch <input-file> # Process multiple genes
See CONTRIBUTING.md for detailed contribution guidelines including:
- Code of conduct and best practices
- Understanding LinkML schemas
- Pull request workflow
- Development setup
This project uses the template monarch-project-copier