+
Skip to content

A streamlined automated machine learning library focusing on core functionality and best practices

Notifications You must be signed in to change notification settings

alakob/automl_mini

Repository files navigation

AutoML Logo

AutoML Mini

A streamlined automated machine learning library focusing on core functionality and best practices

Python CI Tests Coverage PEP 8 Type Hints SOLID

🎯 Overview

AutoML Mini is a Python library designed for rapid prototyping and implementation of automated machine learning workflows. Built with engineering best practices in mind, it demonstrates clean architecture, comprehensive testing, and production-ready code quality following Python standards.

Key Features

  • 🔄 Automated Pipeline: End-to-end ML workflow automation
  • 🔧 Preprocessing: Intelligent handling of mixed data types (numerical & categorical)
  • 🤖 Model Selection: Cross-validated comparison of multiple algorithms
  • 📊 Evaluation: Comprehensive performance metrics and reporting
  • 🏗️ SOLID Principles: Clean, extensible, and maintainable architecture with proven design patterns
  • ✅ Comprehensive Testing: 75 pytest-based unit and integration tests with high coverage
  • 📝 Documentation: Clear usage examples and library reference following PEP 257
  • ⚡ Modern Tooling: Fast development with uv, pre-commit hooks, and automated quality checks
  • 🐍 Python Standards: PEP 8 compliant code with type hints (PEP 484) and modern Python practices

🚀 Quick Start

# 1. Install uv (if you don't have it)
curl -LsSf https://astral.sh/uv/install.sh | sh

# 2. Install AutoML Mini from GitHub
uv pip install "automl_mini @ git+https://github.com/alakob/automl_mini.git"

# 3. Use it!
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

from automl_mini import AutoMLPipeline

# Load data and train
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.3, random_state=42
)

# AutoML in 3 lines
pipeline = AutoMLPipeline()
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)

print(f"Test Accuracy: {pipeline.score(X_test, y_test):.3f}")

That's it! AutoML Mini handles preprocessing, model selection, and evaluation automatically.

For more detailed examples and configuration options, see the Usage Examples section below.

📋 Table of Contents

📦 Installation

Standard Installation

# Install uv package manager
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install from GitHub
uv pip install "automl_mini @ git+https://github.com/alakob/automl_mini.git"

Development Installation

git clone https://github.com/alakob/automl_mini.git
cd automl_mini
uv venv && source .venv/bin/activate
uv pip install -e ".[dev]"
🔧 Alternative Methods & Troubleshooting

Using pip (traditional)

python -m venv .venv
source .venv/bin/activate
pip install "automl_mini @ git+https://github.com/alakob/automl_mini.git"

Prerequisites

  • Python 3.8 - 3.11
  • Git for cloning repository

Troubleshooting

  • uv not found: Restart shell or export PATH="$HOME/.cargo/bin:$PATH"
  • Permission errors: Use virtual environments (recommended)

📚 Usage Examples

Basic Usage

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import pandas as pd

from automl_mini import AutoMLPipeline

# Load the iris dataset (following scikit-learn naming conventions)
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Series(iris.target)

# Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train and predict
pipeline = AutoMLPipeline()
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)

# View results
results = pipeline.get_results()
print(results.summary())

# Additional information
print(f"Best model: {results.best_model}")
print(f"Test accuracy: {pipeline.score(X_test, y_test):.3f}")

# Feature importance (formatted)
print(pipeline.format_feature_importance())

# Or get raw feature importance as dictionary
# print(f"Feature importance: {pipeline.get_feature_importance()}")

Custom Configuration

from automl_mini import AutoMLPipeline, PipelineConfig, ProblemType

# Advanced configuration following Python best practices
config = PipelineConfig(
    cv_folds=5,
    problem_type=ProblemType.CLASSIFICATION,
    verbose=True,
    random_state=42
)

pipeline = AutoMLPipeline(config=config)

Run Complete Examples

A comprehensive example demonstrating all library features is available in the examples/ directory:

cd examples
python basic_usage.py

The basic_usage.py script includes:

  • Classification Example: Generated dataset with mixed numeric and categorical features
  • Regression Example: Diabetes progression dataset with feature engineering
  • Real-world Example: Complete Iris classification workflow
  • Error Handling Demo: Input validation and error handling showcase

🏗️ Architecture

The library follows SOLID principles and implements proven design patterns for maintainable and extensible code, adhering to Python best practices:

Core Components

src/automl_mini/
├── __init__.py           # Public Interface (PEP 257 docstrings)
├── pipeline.py           # Main orchestration (AutoMLPipeline)
├── preprocessing.py      # Data preprocessing transformers
├── models.py            # Model selection and evaluation
└── utils.py             # Utilities and validation

Python Standards Compliance

  • PEP 8: Code style and formatting with black and isort
  • PEP 257: Comprehensive docstring conventions
  • PEP 484: Type hints throughout the codebase
  • PEP 518: Modern pyproject.toml configuration
  • Import Organization: Grouped imports (stdlib, third-party, local)
  • Naming Conventions: snake_case for variables, PascalCase for classes

Design Patterns Implemented

  • Factory Pattern: Model creation for different problem types (Classification/Regression)
  • Strategy Pattern: Problem type selection and algorithm strategy
  • Template Method Pattern: Transformer workflow with ABC inheritance

Key Classes

  • AutoMLPipeline: Main orchestrator coordinating the entire workflow
  • DataPreprocessor: Automatic feature type detection and preprocessing
  • ModelSelector: Cross-validated model comparison and selection

📖 Detailed Architecture: See ARCHITECTURE.md for comprehensive design patterns and SOLID principles implementation.

🧪 Testing

The library includes comprehensive testing with 75 test cases following Python testing best practices:

Test Distribution:

  • Pipeline Tests (29 tests): End-to-end workflow and integration testing
  • Preprocessing Tests (29 tests): Individual component testing (transformers, data processing)
  • Utility Tests (17 tests): Input validation, helper functions, and error handling scenarios

Python Testing Standards

  • pytest: Modern Python testing framework
  • Test Coverage: High coverage with detailed reporting
  • Test Organization: Clear separation of unit vs integration tests
  • Fixtures: Reusable test data and configurations
  • Parameterized Tests: Data-driven testing for multiple scenarios

Running Tests

# Run all tests with coverage
uv run python -m pytest tests/ --cov=automl_mini

# Run specific test modules
uv run python -m pytest tests/test_pipeline.py -v

# Generate HTML coverage report
uv run python -m pytest tests/ --cov=automl_mini --cov-report=html

Test Results

============= 75 passed in 11.09s =============
All tests passing with comprehensive coverage

🐍 Python Best Practices

This library demonstrates and follows established Python standards:

Code Quality

  • Linting: ruff for fast Python linting
  • Formatting: black for consistent code formatting
  • Import Sorting: isort for organized imports
  • Type Checking: Type hints with mypy compatibility
  • Pre-commit Hooks: Automated quality checks

Development Standards

  • Virtual Environments: Proper isolation with venv/uv
  • Dependency Management: Modern pyproject.toml configuration
  • Version Control: Meaningful commit messages and branch strategy
  • Documentation: Comprehensive docstrings following PEP 257

Performance Considerations

  • Lazy Loading: Efficient memory usage patterns
  • Type Hints: Better IDE support and runtime optimization
  • Error Handling: Proper exception hierarchy and handling
  • Resource Management: Context managers where appropriate

📈 Performance & Capabilities

Performance Features

  • Parallel Cross-Validation: All model evaluation uses n_jobs=-1 for parallel processing
  • Parallel Model Training: Random Forest and Linear models (LogisticRegression, LinearRegression) support parallel training
  • Sequential Models: Gradient Boosting models run sequentially (inherent algorithm limitation)
  • Efficient Libraries: Built on NumPy/Pandas optimized operations
  • Memory Management: Proper DataFrame operations with index management

Supported Problem Types

  • Classification: Multi-class classification with probability estimates
  • Regression: Continuous target prediction

Algorithms Included

  • Random Forest: Robust ensemble method
  • Logistic/Linear Regression: Linear baseline models
  • Gradient Boosting: Advanced boosting algorithms

Preprocessing Features

  • Numerical: Mean imputation + StandardScaler
  • Categorical: Mode imputation + OneHot/Label encoding
  • Automatic type detection: No manual specification needed
  • Missing value handling: Robust imputation strategies

Evaluation Metrics

  • Classification: F1-score (weighted), accuracy, precision, recall
  • Regression: R², RMSE, MAE
  • Cross-validation: K-fold validation for reliable estimates

🎛️ Library Reference

AutoMLPipeline

Type-hinted API following PEP 484:

from typing import Dict, List, Optional, Union
import numpy as np
import pandas as pd

class AutoMLPipeline:
    def __init__(self, config: Optional[PipelineConfig] = None) -> None: ...
    def fit(self, X: Union[np.ndarray, pd.DataFrame], y: Union[np.ndarray, pd.Series]) -> 'AutoMLPipeline': ...
    def predict(self, X: Union[np.ndarray, pd.DataFrame]) -> np.ndarray: ...
    def predict_proba(self, X: Union[np.ndarray, pd.DataFrame]) -> np.ndarray: ...  # Classification only
    def score(self, X: Union[np.ndarray, pd.DataFrame], y: Union[np.ndarray, pd.Series]) -> float: ...
    def get_results(self) -> PipelineResult: ...
    def get_feature_importance(self) -> Optional[Dict[str, float]]: ...
    def format_feature_importance(self, top_n: Optional[int] = None) -> Optional[str]: ...

Configuration Classes

@dataclass
class PipelineConfig:
    cv_folds: int = 3
    test_size: float = 0.2
    random_state: int = 42
    problem_type: Optional[ProblemType] = None
    verbose: bool = False

class ProblemType(Enum):
    CLASSIFICATION = "classification"
    REGRESSION = "regression"

Result Objects

@dataclass
class PipelineResult:
    best_model: BaseEstimator
    best_score: float
    problem_type: ProblemType
    model_results: List[ModelResult]
    total_time: float
    # ... additional fields with proper type annotations

Python Conventions:

  • snake_case: All variable and function names
  • PascalCase: Class names (AutoMLPipeline, PipelineConfig)
  • Type Hints: Complete type annotations for better IDE support
  • Dataclasses: Modern Python data structures with automatic methods
  • Enums: Type-safe constants for problem types

About

A streamlined automated machine learning library focusing on core functionality and best practices

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载