这是indexloc提供的服务,不要输入任何密码
Skip to content

mosuka/platypus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Platypus

Crates.io Documentation License: MIT

Platypus is a Rust-based hybrid search engine that unifies Keyword Search, Semantic Search, and Multimodal Search into a single, cohesive system.

The name comes from the platypus — one of the most remarkable real-world creatures, known for combining traits from mammals, birds, and reptiles into a single organism. This unique fusion of distinct evolutionary features mirrors the three complementary forms of understanding in modern search:

🦫 Keyword Search — precise retrieval through lexical, symbolic, and linguistic matching.

🦫 Semantic Search — meaning-based retrieval powered by vector representations and embeddings.

🦫 Multimodal Search — bridging text, images, and other modalities through shared latent representations.

Together, these capabilities form a unified hybrid search architecture — much like the platypus itself, where diverse traits work in harmony to navigate complex environments.

Built in Rust for performance, safety, and extensibility, Platypus aims to provide a next-generation information retrieval platform that supports a broad range of use cases, from research exploration to production deployment.

✨ Features

Core Search Capabilities

  • Pure Rust Implementation - Memory-safe and fast performance with zero-cost abstractions
  • Keyword Search - Full-text search with inverted index and BM25 scoring
  • Semantic Search - HNSW-based approximate nearest neighbor search with multiple distance metrics (Cosine, Euclidean, Dot Product)
  • Multimodal Search - Combined lexical and vector search with configurable score fusion strategies

Text Analysis

  • Flexible Text Analysis Pipeline - Configurable tokenization, stemming, and filtering
  • Multi-language Support - Built-in support for Japanese, Korean, and Chinese via Lindera
  • Custom Analyzers - Create custom analysis pipelines with pluggable tokenizers and filters
  • Synonym Support - Synonym expansion for improved recall

Advanced Query Types

  • Term Query - Simple keyword matching
  • Phrase Query - Exact phrase matching with positional information
  • Boolean Query - Complex combinations with AND/OR/NOT logic
  • Range Query - Numeric and date range queries
  • Fuzzy Query - Approximate string matching with edit distance
  • Wildcard Query - Pattern matching with * and ? wildcards
  • Geographic Query - Location-based search with distance and bounding box queries

Embedding & Semantic Search

  • Text Embeddings - Generate semantic embeddings with Candle (local BERT models) or OpenAI API
  • Multimodal Search - Cross-modal search with CLIP models for text-to-image and image-to-image similarity
  • Automatic Model Loading - Models are automatically downloaded from HuggingFace on first use
  • GPU Acceleration - Automatic GPU usage when available for embedding generation

Storage & Performance

  • Multiple Storage Backends - Filesystem, memory-mapped files, and in-memory storage
  • SIMD Acceleration - Optimized vector operations for improved performance
  • Incremental Updates - Real-time document addition without full reindexing

Additional Features

  • Spell Correction - Built-in spell checking and query suggestion system
  • Faceted Search - Multi-dimensional search with facet aggregation and filtering
  • Schemaless Indexing - Dynamic schema support for flexible document structures
  • Document Parsing - Built-in support for various document formats

🚀 Quick Start

Add Platypus to your Cargo.toml:

[dependencies]
platypus = "0.1"

Basic Usage

use std::sync::Arc;

use tempfile::TempDir;
use platypus::analysis::analyzer::analyzer::Analyzer;
use platypus::analysis::analyzer::standard::StandardAnalyzer;
use platypus::document::document::Document;
use platypus::document::field::{IntegerOption, TextOption};
use platypus::error::Result;
use platypus::lexical::engine::LexicalEngine;
use platypus::lexical::index::config::{InvertedIndexConfig, LexicalIndexConfig};
use platypus::lexical::index::factory::LexicalIndexFactory;
use platypus::lexical::index::inverted::query::term::TermQuery;
use platypus::lexical::search::searcher::LexicalSearchRequest;
use platypus::storage::file::FileStorageConfig;
use platypus::storage::{StorageConfig, StorageFactory};

fn main() -> Result<()> {
    // Create storage in a temporary directory
    let temp_dir = TempDir::new().unwrap();
    let storage = StorageFactory::create(StorageConfig::File(FileStorageConfig::new(
        temp_dir.path(),
    )))?;

    // Configure the inverted index with a StandardAnalyzer
    let analyzer: Arc<dyn Analyzer> = Arc::new(StandardAnalyzer::new()?);
    let config = LexicalIndexConfig::Inverted(InvertedIndexConfig {
        analyzer: Arc::clone(&analyzer),
        ..InvertedIndexConfig::default()
    });
    let index = LexicalIndexFactory::create(storage, config)?;

    // Create a lexical search engine
    let mut engine = LexicalEngine::new(index)?;

    // Add documents with explicit field options
    let documents = vec![
        Document::builder()
            .add_text("title", "Rust Programming", TextOption::default())
            .add_text(
                "content",
                "Rust is a systems programming language",
                TextOption::default(),
            )
            .add_integer("year", 2024, IntegerOption::default())
            .build(),
        Document::builder()
            .add_text("title", "Python Guide", TextOption::default())
            .add_text(
                "content",
                "Python is a versatile programming language",
                TextOption::default(),
            )
            .add_integer("year", 2023, IntegerOption::default())
            .build(),
    ];

    for doc in documents {
        engine.add_document(doc)?;
    }
    engine.commit()?;

    // Search documents
    let query = Box::new(TermQuery::new("content", "programming"));
    let request = LexicalSearchRequest::new(query)
        .load_documents(true)
        .max_docs(10);
    let results = engine.search(request)?;

    println!("Found {} matches", results.total_hits);
    for hit in results.hits {
        if let Some(doc) = hit.document {
            println!("Score: {:.2}, Doc ID: {}", hit.score, hit.doc_id);
            if let Some(title) = doc.get_text("title") {
                println!("  -> {}", title);
            }
        }
    }

    Ok(())
}

🏗️ Architecture

Platypus is built with a modular architecture:

Core Components

  • Schema & Fields - Define document structure with typed fields (text, numeric, boolean, geographic, vector)
  • Analysis Pipeline - Configurable text processing with tokenizers, filters, and stemmers
  • Storage Layer - Pluggable storage backends (filesystem, memory-mapped, in-memory) with transaction support
  • Lexical Index - Inverted index with posting lists and term dictionaries for full-text search
  • Vector Index - HNSW-based approximate nearest neighbor search for semantic similarity
  • Hybrid Search - Combined lexical and vector search with configurable score fusion
  • Query Engine - Flexible query system supporting multiple query types

Field Types

Platypus supports the following field value types through the Document builder API:

use chrono::Utc;
use platypus::document::document::Document;
use platypus::document::field::{
    BinaryOption, BooleanOption, DateTimeOption, FloatOption, GeoOption, IntegerOption,
    TextOption, VectorOption,
};

let doc = Document::builder()
    // Text field for full-text search
    .add_text("title", "Introduction to Rust", TextOption::default())

    // Integer field for numeric queries
    .add_integer("year", 2024, IntegerOption::default())

    // Float field for floating-point values
    .add_float("price", 49.99, FloatOption::default())

    // Boolean field for filtering
    .add_boolean("published", true, BooleanOption::default())

    // DateTime field for temporal queries
    .add_datetime("created_at", Utc::now(), DateTimeOption::default())

    // Geographic field for spatial queries
    .add_geo("location", 35.6762, 139.6503, GeoOption::default())

    // Binary field for arbitrary data
    .add_binary("thumbnail", vec![0u8, 1, 2, 3], BinaryOption::default())

    // Vector field storing text that will be embedded during indexing
    .add_vector("title_embedding", "Introduction to Rust", VectorOption::default())

    .build();

Query Types

use platypus::lexical::index::inverted::query::boolean::BooleanQuery;
use platypus::lexical::index::inverted::query::fuzzy::FuzzyQuery;
use platypus::lexical::index::inverted::query::geo::{
    GeoBoundingBox, GeoBoundingBoxQuery, GeoDistanceQuery, GeoPoint,
};
use platypus::lexical::index::inverted::query::phrase::PhraseQuery;
use platypus::lexical::index::inverted::query::range::NumericRangeQuery;
use platypus::lexical::index::inverted::query::term::TermQuery;
use platypus::lexical::index::inverted::query::wildcard::WildcardQuery;

// Term query - simple keyword matching
let query = Box::new(TermQuery::new("field", "term"));

// Phrase query - exact phrase matching
let query = Box::new(PhraseQuery::new("field", vec!["hello", "world"]));

// Numeric range query
let query = Box::new(NumericRangeQuery::new_float("price", Some(100.0), Some(500.0)));

// Boolean query - combine multiple queries
let mut bool_query = BooleanQuery::new();
bool_query.add_must(Box::new(TermQuery::new("category", "book")));
bool_query.add_should(Box::new(TermQuery::new("author", "tolkien")));
let query = Box::new(bool_query);

// Fuzzy query - approximate string matching
let query = Box::new(FuzzyQuery::new("title", "progamming", 2)); // max edit distance: 2

// Wildcard query - pattern matching
let query = Box::new(WildcardQuery::new("filename", "*.pdf"));

// Geographic distance query
let query = Box::new(GeoDistanceQuery::new(
    "location",
    GeoPoint::new(40.7128, -74.0060), // NYC coordinates
    10.0, // 10km radius
));

// Geographic bounding box query
let query = Box::new(GeoBoundingBoxQuery::new(
    "location",
    GeoBoundingBox::new(
        GeoPoint::new(40.5, -74.5), // bottom-left
        GeoPoint::new(41.0, -73.5), // top-right
    ),
));

🎯 Advanced Features

Vector Search with Text Embeddings

Platypus supports semantic search using text embeddings. You can use local BERT models via Candle or OpenAI's API.

Using Candle (Local BERT Models)

[dependencies]
platypus = { version = "0.1", features = ["embeddings-candle"] }
use platypus::embedding::candle_text_embedder::CandleTextEmbedder;
use platypus::embedding::text_embedder::TextEmbedder;
use platypus::vector::DistanceMetric;
use platypus::vector::engine::VectorEngine;
use platypus::vector::index::{FlatIndexConfig, VectorIndexConfig, VectorIndexFactory};
use platypus::vector::types::VectorSearchRequest;
use platypus::storage::memory::{MemoryStorage, MemoryStorageConfig};
use std::sync::Arc;

#[tokio::main]
async fn main() -> platypus::error::Result<()> {
    // Initialize embedder with a sentence-transformers model
    let embedder = CandleTextEmbedder::new("sentence-transformers/all-MiniLM-L6-v2")?;

    // Generate embeddings for documents
    let documents = vec![
        (1, "Rust is a systems programming language"),
        (2, "Python is great for data science"),
        (3, "Machine learning with neural networks"),
    ];

    // Create vector index configuration
    let vector_config = VectorIndexConfig::Flat(FlatIndexConfig {
        dimension: embedder.dimension(),
        distance_metric: DistanceMetric::Cosine,
        normalize_vectors: true,
        ..Default::default()
    });

    // Create storage and index
    let storage = Arc::new(MemoryStorage::new(MemoryStorageConfig::default()));
    let index = VectorIndexFactory::create(storage, vector_config)?;
    let mut engine = VectorEngine::new(index)?;

    // Add documents with their embeddings
    for (id, text) in &documents {
        let vector = embedder.embed(text).await?;
        engine.add_vector(*id, vector)?;
    }
    engine.commit()?;

    // Search with query embedding
    let query_vector = embedder.embed("programming languages").await?;
    let request = VectorSearchRequest::new(query_vector).top_k(10);
    let results = engine.search(request)?;

    for result in results.results {
        println!("Doc ID: {}, Similarity: {:.4}", result.doc_id, result.similarity);
    }

    Ok(())
}

Using OpenAI Embeddings

[dependencies]
platypus = { version = "0.1", features = ["embeddings-openai"] }
use platypus::embedding::openai_text_embedder::OpenAITextEmbedder;
use platypus::embedding::text_embedder::TextEmbedder;

// Initialize with API key
let embedder = OpenAITextEmbedder::new(
    "your-api-key".to_string(),
    "text-embedding-3-small".to_string()
)?;

// Generate embeddings
let vector = embedder.embed("your text here").await?;

Multimodal Search (Text + Images)

Platypus supports cross-modal search using CLIP (Contrastive Language-Image Pre-Training) models, enabling semantic search across text and images. This allows you to:

  • Text-to-Image Search: Find images using natural language queries
  • Image-to-Image Search: Find visually similar images using an image query
  • Semantic Understanding: Search based on content meaning, not just keywords

Setup

Add the embeddings-multimodal feature to your Cargo.toml:

[dependencies]
platypus = { version = "0.1", features = ["embeddings-multimodal"] }

Text-to-Image Search Example

use platypus::embedding::candle_multimodal_embedder::CandleMultimodalEmbedder;
use platypus::embedding::text_embedder::TextEmbedder;
use platypus::embedding::image_embedder::ImageEmbedder;
use platypus::vector::engine::VectorEngine;
use platypus::vector::index::{HnswIndexConfig, VectorIndexConfig, VectorIndexFactory};
use platypus::vector::types::VectorSearchRequest;
use platypus::vector::DistanceMetric;
use platypus::storage::memory::{MemoryStorage, MemoryStorageConfig};
use std::sync::Arc;

#[tokio::main]
async fn main() -> platypus::error::Result<()> {
    // Initialize CLIP embedder (automatically downloads model from HuggingFace)
    let embedder = CandleMultimodalEmbedder::new("openai/clip-vit-base-patch32")?;

    // Create vector index with CLIP's embedding dimension (512)
    let vector_config = VectorIndexConfig::Hnsw(HnswIndexConfig {
        dimension: ImageEmbedder::dimension(&embedder), // 512 for CLIP ViT-Base-Patch32
        distance_metric: DistanceMetric::Cosine,
        ..Default::default()
    });

    let storage = Arc::new(MemoryStorage::new(MemoryStorageConfig::default()));
    let index = VectorIndexFactory::create(storage, vector_config)?;
    let mut engine = VectorEngine::new(index)?;

    // Index your image collection
    let image_paths = vec!["image1.jpg", "image2.jpg", "image3.jpg"];
    for (id, image_path) in image_paths.iter().enumerate() {
        let vector = embedder.embed_image(image_path).await?;
        engine.add_vector(id as u64, vector)?;
    }
    engine.commit()?;

    // Search images using natural language
    let query_vector = embedder.embed("a photo of a cat playing").await?;
    let request = VectorSearchRequest::new(query_vector).top_k(10);
    let results = engine.search(request)?;

    for result in results.results {
        println!("Image ID: {}, Similarity: {:.4}", result.doc_id, result.similarity);
    }

    Ok(())
}

Image-to-Image Search Example

// Find visually similar images using an image as query
let query_image_vector = embedder.embed_image("query.jpg").await?;
let request = VectorSearchRequest::new(query_image_vector).top_k(5);
let results = engine.search(request)?;

for result in results.results {
    println!("Similar Image ID: {}, Similarity: {:.4}", result.doc_id, result.similarity);
}

How It Works

  1. CLIP Model: Uses OpenAI's CLIP model which maps both text and images into the same 512-dimensional vector space
  2. Automatic Download: Models are automatically downloaded from HuggingFace on first use
  3. GPU Acceleration: Automatically uses GPU if available (via Candle)
  4. Shared Embedding Space: Text and image embeddings can be directly compared using cosine similarity

Supported Models

Currently supports CLIP ViT-Base-Patch32 architecture:

  • Model: openai/clip-vit-base-patch32
  • Embedding Dimension: 512
  • Image Size: 224x224

Complete Examples

See working examples with detailed explanations:

Run the examples:

# Text-to-image search
cargo run --example text_to_image_search --features embeddings-multimodal

# Image-to-image search
cargo run --example image_to_image_search --features embeddings-multimodal -- query.jpg

Faceted Search

use platypus::lexical::index::inverted::query::term::TermQuery;
use platypus::lexical::search::facet::{FacetConfig, FacetedSearchEngine};
use platypus::lexical::types::LexicalSearchRequest;

// Create faceted search engine
let facet_config = FacetConfig {
    max_facet_count: 10,
    min_count: 1,
};

let mut faceted_engine = FacetedSearchEngine::new(
    engine,
    vec!["category".to_string(), "author".to_string()],
    facet_config,
)?;

// Perform faceted search
let query = Box::new(TermQuery::new("content", "programming"));
let request = LexicalSearchRequest::new(query).max_docs(10);
let results = faceted_engine.search(request)?;

// Access facet results
for facet in &results.facets {
    println!("Facet field: {}", facet.field);
    for count in &facet.counts {
        println!("  {}: {} documents", count.value, count.count);
    }
}

Spell Correction

use platypus::spelling::corrector::{SpellingCorrector, CorrectorConfig};
use platypus::spelling::dictionary::Dictionary;

// Build a dictionary from your corpus
let mut dictionary = Dictionary::new();
dictionary.add_word("programming", 100);
dictionary.add_word("program", 80);
dictionary.add_word("programmer", 60);

// Create spell corrector with configuration
let config = CorrectorConfig {
    max_edit_distance: 2,
    min_word_frequency: 5,
    max_suggestions: 5,
};

let corrector = SpellingCorrector::new(dictionary, config);

// Check and suggest corrections
if let Some(correction) = corrector.correct("progamming") {
    println!("Did you mean: '{}'? (confidence: {:.2})",
        correction.suggestion, correction.confidence);
}

Custom Analysis Pipeline

use platypus::analysis::analyzer::pipeline::PipelineAnalyzer;
use platypus::analysis::tokenizer::whitespace::WhitespaceTokenizer;
use platypus::analysis::token_filter::lowercase::LowercaseFilter;
use platypus::analysis::token_filter::stop::StopWordFilter;

// Create custom analyzer with multiple filters
let mut analyzer = PipelineAnalyzer::new(Box::new(WhitespaceTokenizer));
analyzer.add_filter(Box::new(LowercaseFilter));
analyzer.add_filter(Box::new(StopWordFilter::english()));

// Analyze text
let text = "The Quick Brown Fox Jumps Over the Lazy Dog";
let tokens = analyzer.analyze(text)?;

for token in tokens {
    println!("Token: {}, Position: {}", token.text, token.position);
}

For language-specific tokenization (Japanese, Korean, Chinese):

use platypus::analysis::tokenizer::lindera::LinderaTokenizer;
use platypus::analysis::analyzer::pipeline::PipelineAnalyzer;

// Japanese tokenization with Lindera
let tokenizer = LinderaTokenizer::japanese()?;
let analyzer = PipelineAnalyzer::new(Box::new(tokenizer));

let text = "東京は日本の首都です";
let tokens = analyzer.analyze(text)?;

📊 Performance

Platypus is designed for high performance:

  • SIMD Acceleration - Uses wide instruction sets for vector operations
  • Memory-Mapped I/O - Efficient file access with minimal memory overhead
  • Incremental Updates - Real-time document addition without full reindexing
  • Index Optimization - Background merge operations for optimal search performance

🛠️ Development

Building from Source

git clone https://github.com/mosuka/platypus.git
cd platypus
cargo build --release

Running Tests

cargo test

Running Benchmarks

cargo bench

Checking Code Quality

cargo clippy
cargo fmt --check

📖 Examples

Platypus includes numerous examples demonstrating various features:

Lexical Search Examples

Vector Search Examples

Advanced Features

Run any example with:

cargo run --example <example_name>

# For embedding examples, use feature flags:
cargo run --example vector_search --features embeddings-candle
cargo run --example embedding_with_openai --features embeddings-openai
cargo run --example text_to_image_search --features embeddings-multimodal
cargo run --example image_to_image_search --features embeddings-multimodal

🔧 Feature Flags

Platypus uses feature flags to enable optional functionality:

[dependencies]
# Default features only
platypus = "0.1"

# With Candle embeddings (local BERT models)
platypus = { version = "0.1", features = ["embeddings-candle"] }

# With OpenAI embeddings
platypus = { version = "0.1", features = ["embeddings-openai"] }

# With all embedding features
platypus = { version = "0.1", features = ["embeddings-all"] }

Available features:

  • embeddings-candle - Local text embeddings using Candle and BERT models
  • embeddings-openai - OpenAI API-based text embeddings
  • embeddings-multimodal - Multimodal embeddings (text and images) using CLIP models
  • embeddings-all - All embedding providers

📚 Documentation

🤝 Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

📄 License

This project is licensed under either of

at your option.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages