+
Skip to content
/ freamon Public

A package to make data science projects on tabular data easier. Named after the great character from The Wire played by Clarke Peters.

License

Notifications You must be signed in to change notification settings

srepho/freamon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Freamon: Feature-Rich EDA, Analytics, and Modeling Toolkit

Freamon Logo

PyPI version GitHub release

Freamon is a comprehensive Python toolkit for exploratory data analysis, feature engineering, and model development with a focus on practical data science workflows.

Quick Start | Documentation | Installation | Examples

Features

  • Exploratory Data Analysis: Automatic EDA with comprehensive reporting in HTML, Markdown, Excel, PowerPoint, and interactive Jupyter notebook displays
  • Advanced Multivariate Analysis: PCA visualization, correlation networks, and target-oriented analysis
  • Feature Engineering: Advanced feature engineering for numeric, categorical, and text data
  • Feature Selection: Statistical feature selection including Chi-square, ANOVA F-test, and effect size analysis
  • Deduplication: High-performance deduplication with Polars optimization (2-5x faster, 60-70% less memory), LSH, supervised ML, and active learning
  • Topic Modeling: Optimized text analysis with NMF and LDA, supporting large datasets up to 100K documents
  • Automated Modeling: Intelligent end-to-end modeling workflow for text, tabular, and time series data
  • Modeling: Custom model implementations with feature importance and model interpretation
  • Pipeline: Scikit-learn compatible pipeline with additional features
  • Drift Analysis: Tools for detecting and analyzing data drift
  • Word Embeddings: Integration with various word embedding techniques
  • Visualization: Publication-quality visualizations with proper handling of all special characters
  • Performance Optimization: Multiprocessing support and intelligent sampling for large dataset analysis

Installation

Basic Installation

For basic functionality (EDA, visualization, core deduplication):

pip install freamon

Installation with All Features

For full functionality including advanced modeling, text processing, and performance optimizations:

pip install "freamon[all]"

Feature-Specific Installation

For specific feature sets:

# For high-performance with Polars acceleration
pip install "freamon[performance]"

# For text analysis and topic modeling
pip install "freamon[topic_modeling]"

# For word embeddings support
pip install "freamon[word_embeddings]"

# For extended features (modeling, Polars, LightGBM, SHAP, etc.)
pip install "freamon[extended]"

# For Markdown report generation
pip install "freamon[markdown_reports]"

Dependencies by Feature

Here's what each optional dependency provides:

  • Core (always installed):

    • numpy, pandas, scikit-learn, matplotlib, seaborn, networkx
  • Performance [freamon[performance]]:

    • pyarrow - For faster data processing
  • Extended [freamon[extended]]:

    • polars - High-performance DataFrame library (2-5x faster than pandas)
    • lightgbm - Gradient boosting framework
    • optuna - Hyperparameter optimization
    • shap - Model explanation
    • spacy - NLP processing
    • statsmodels - Statistical modeling
    • dask - Parallel computing
  • Topic Modeling [freamon[topic_modeling]]:

    • gensim - Topic modeling
    • pyldavis - Topic visualization
    • wordcloud - Word cloud generation
  • Word Embeddings [freamon[word_embeddings]]:

    • gensim - Word vectors
    • nltk - Natural language toolkit
    • spacy - Linguistic features
  • Markdown Reports [freamon[markdown_reports]]:

    • markdown - Report generation

Quick Start

from freamon.eda import EDAAnalyzer

# Create an analyzer instance
analyzer = EDAAnalyzer(df, target_column='target')

# Run the analysis
analyzer.run_full_analysis()

# Generate a report
analyzer.generate_report('eda_report.html')

# Or a markdown report for version control
analyzer.generate_report('eda_report.md', format='markdown')

Key Components

Step-by-Step Workflow Example

Below is a complete workflow showing data type detection, EDA analysis, modeling, and reporting:

import pandas as pd
import numpy as np
from freamon.eda import EDAAnalyzer
from freamon.utils.datatype_detector import detect_datatypes
from freamon import auto_model

# Required for PowerPoint/Excel reports
# pip install "freamon[extended]"
from freamon.eda.export import export_to_powerpoint, export_to_excel

# 1. Load sample data
df = pd.read_csv('customer_data.csv')
print(f"Dataset shape: {df.shape}")

# 2. Run data type detection
datatype_results = detect_datatypes(df)
print("\nDetected data types:")
print(f"Text columns: {datatype_results['text_columns']}")
print(f"Categorical columns: {datatype_results['categorical_columns']}")
print(f"Numeric columns: {datatype_results['numeric_columns']}")
print(f"Date columns: {datatype_results['date_columns']}")

# 3. Generate data type detection report
from freamon.utils.datatype_fixes import save_detection_report
save_detection_report(
    datatype_results,
    'datatype_detection_report.html',
    title='Customer Data Type Detection'
)
print("\nData type detection report saved to 'datatype_detection_report.html'")

# 4. Run EDA analysis
analyzer = EDAAnalyzer(
    df,
    target_column='churn',  # For supervised analysis
    text_columns=datatype_results['text_columns'],
    categorical_columns=datatype_results['categorical_columns'],
    numeric_columns=datatype_results['numeric_columns'],
    datetime_columns=datatype_results['date_columns']
)
analyzer.run_full_analysis()

# 5. Generate EDA reports in different formats
analyzer.generate_report('eda_report.html')  # HTML report
analyzer.generate_report('eda_report.md', format='markdown')  # Markdown report
print("\nEDA reports generated in HTML and Markdown formats")

# 6. Export EDA results to PowerPoint for presentations
export_to_powerpoint(
    analyzer.get_report_data(),
    'eda_presentation.pptx',
    report_type='eda'
)
print("\nEDA results exported to PowerPoint")

# 7. Run automated modeling with data type detection
# Note: Install required dependencies for advanced modeling:
# pip install "freamon[extended,topic_modeling]"
results = auto_model(
    df=df,
    target_column='churn',
    problem_type='classification',
    # Use our detected data types
    text_columns=datatype_results['text_columns'],
    categorical_columns=datatype_results['categorical_columns'],
    date_column=datatype_results['date_columns'][0] if datatype_results['date_columns'] else None
)

# 8. Examine model results
print("\nModel Performance:")
for metric, value in results['metrics'].items():
    if 'mean' in metric:
        print(f"{metric}: {value:.4f}")

# 9. Plot model visualizations
fig1 = results['autoflow'].plot_metrics()
fig1.savefig('cv_metrics.png')

fig2 = results['autoflow'].plot_importance(top_n=15)
fig2.savefig('feature_importance.png')

# 10. Export model results to Excel
model_data = {
    'model_type': results['autoflow'].model_type,
    'metrics': results['metrics'],
    'feature_importance': results['feature_importance'],
    'training_date': pd.Timestamp.now().strftime('%Y-%m-%d')
}
export_to_excel(model_data, 'model_performance.xlsx', report_type='model')

# 11. Export model results to PowerPoint
export_to_powerpoint(model_data, 'model_presentation.pptx', report_type='model')
print("\nModel results exported to Excel and PowerPoint")

# 12. Make predictions on new data
new_data = pd.read_csv('new_customers.csv')
predictions = results['autoflow'].predict(new_data)
new_data['predicted_churn'] = predictions
new_data.to_csv('predictions.csv', index=False)
print("\nPredictions saved to 'predictions.csv'")

Comprehensive Deduplication Workflow

Complete step-by-step process for deduplication, including analysis, visualization, and modeling:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from freamon.deduplication.exact_deduplication import hash_deduplication 
from freamon.deduplication.lsh_deduplication import lsh_deduplication
from freamon.data_quality.duplicates import detect_duplicates, get_duplicate_groups
from examples.deduplication_tracking_example import IndexTracker
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# For network visualization (optional)
# pip install networkx
import networkx as nx

# 1. Load sample data with text and duplicates
df = pd.read_csv('data_with_duplicates.csv')
print(f"Original dataset shape: {df.shape}")

# 2. Analyze duplicates using built-in detection
duplicate_stats = detect_duplicates(df)
print(f"\nDuplicate analysis:")
print(f"Exact duplicates: {duplicate_stats['duplicate_count']} records")
print(f"Duplicate percentage: {duplicate_stats['duplicate_percent']:.2f}%")

# 3. Get duplicate groups for examination
duplicate_groups = get_duplicate_groups(df)
print(f"\nFound {len(duplicate_groups)} duplicate groups")
for i, group in enumerate(duplicate_groups[:3]):  # Show first 3 groups
    print(f"\nDuplicate group {i+1}:")
    print(df.iloc[group].head(1))  # Show one example from each group

# 4. Initialize index tracker to maintain mapping
tracker = IndexTracker().initialize_from_df(df)

# 5. Find duplicates using LSH (locality-sensitive hashing) for text similarity
print("\nRunning LSH deduplication...")
kept_indices, similarity_dict = lsh_deduplication(
    df['description'],
    threshold=0.8,
    num_bands=20,
    preprocess=True,
    return_similarity_dict=True
)

# 6. Analyze LSH results
print(f"LSH kept {len(kept_indices)} out of {len(df)} records ({len(kept_indices)/len(df)*100:.1f}%)")

# 7. Visualize similarity network (for smaller datasets)
if len(df) < 1000:
    import networkx as nx
    G = nx.Graph()
    
    # Add all nodes (documents)
    for i in range(len(df)):
        G.add_node(i)
    
    # Add edges (similarities)
    for doc_id, similar_docs in similarity_dict.items():
        for similar_id in similar_docs:
            G.add_edge(doc_id, similar_id)
    
    # Plot network
    plt.figure(figsize=(10, 8))
    pos = nx.spring_layout(G)
    nx.draw(G, pos, node_size=50, node_color='blue', alpha=0.6)
    plt.title('Document Similarity Network')
    plt.savefig('similarity_network.png')
    plt.close()
    print("\nSaved similarity network visualization to 'similarity_network.png'")

# 8. Create deduplicated dataframe
deduped_df = df.iloc[kept_indices].copy()

# 9. Update tracker with kept indices
tracker.update_from_kept_indices(kept_indices, deduped_df)

# 10. Train model on deduplicated data
print("\nTraining model on deduplicated data...")
X = deduped_df.drop(['target', 'description'], axis=1)  # Exclude text column
y = deduped_df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# 11. Evaluate model
y_pred = model.predict(X_test)
print("\nModel performance on deduplicated test data:")
print(classification_report(y_test, y_pred))

# 12. Make predictions and generate results dataframe
y_pred_series = pd.Series(y_pred, index=X_test.index)
results_df = pd.DataFrame({'prediction': y_pred_series, 'actual': y_test})

# 13. Map results back to original dataset with all records
full_results = tracker.create_full_result_df(
    results_df, df, fill_value={'prediction': None, 'actual': None}
)

print(f"\nMapping results:")
print(f"Original dataset size: {len(df)}")
print(f"Deduplicated dataset size: {len(deduped_df)}")
print(f"Number of records with predictions: {full_results['prediction'].notna().sum()}")

# 14. Save full dataset with deduplication information
df['is_duplicate'] = ~df.index.isin(kept_indices)
df['has_prediction'] = full_results['prediction'].notna()
df['predicted'] = full_results['prediction']
df.to_csv('deduplication_results.csv', index=False)
print("\nSaved full dataset with deduplication and prediction information to 'deduplication_results.csv'")

Duplicate Flagging for Unlabeled Data

Comprehensive workflow to identify potential duplicates without removing them, with analysis and visualization:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Required for duplicate flagging functionality 
# pip install "freamon[extended]"
from freamon.deduplication.flag_duplicates import flag_similar_records, flag_text_duplicates

# Required for PowerPoint/Excel export
# pip install "freamon[extended]"
from freamon.eda.export import export_to_excel, export_to_powerpoint

# 1. Load unlabeled dataset
unlabeled_df = pd.read_csv('unlabeled_customer_data.csv')
print(f"Dataset shape: {unlabeled_df.shape}")

# 2. Flag potential text duplicates using LSH
print("\nProcessing text duplicates...")
text_df = flag_text_duplicates(
    unlabeled_df,
    text_column='description',
    threshold=0.8,
    method='lsh',
    add_group_id=True,
    add_similarity_score=True,
    add_duplicate_flag=True
)

# 3. Analyze text duplicate results
duplicate_text_groups = text_df['duplicate_group_id'].dropna().nunique()
duplicate_text_records = text_df['is_text_duplicate'].sum()
print(f"Text duplicate analysis:")
print(f"Found {duplicate_text_groups} potential duplicate text groups")
print(f"Found {duplicate_text_records} records ({duplicate_text_records/len(text_df)*100:.1f}%) with similar text")

# 4. Flag similar records across multiple fields using weighted similarity
print("\nProcessing multi-field similarity...")
similar_df = flag_similar_records(
    text_df,  # Use the dataframe that already has text duplicate info
    columns=['name', 'address', 'phone', 'email'],
    weights={'name': 0.4, 'address': 0.3, 'phone': 0.2, 'email': 0.1},
    threshold=0.7,
    similarity_column="similarity_score",  # Column to store similarity scores
    group_column="multifield_group_id",    # Column to store group IDs
    flag_column="is_multifield_duplicate"  # Column to store duplicate flags
)

# 5. Analyze multi-field similarity results
multifield_groups = similar_df['multifield_group_id'].dropna().nunique()
multifield_duplicates = similar_df['is_multifield_duplicate'].sum()
print(f"Multi-field duplicate analysis:")
print(f"Found {multifield_groups} potential duplicate groups based on multiple fields")
print(f"Found {multifield_duplicates} records ({multifield_duplicates/len(similar_df)*100:.1f}%) with similar fields")

# 6. Create a combined duplicate flag
similar_df['is_potential_duplicate'] = similar_df['is_text_duplicate'] | similar_df['is_multifield_duplicate']
total_duplicates = similar_df['is_potential_duplicate'].sum()
print(f"\nCombined results: {total_duplicates} potential duplicates ({total_duplicates/len(similar_df)*100:.1f}%)")

# 7. Visualize similarity score distribution
plt.figure(figsize=(10, 6))
sns.histplot(similar_df['similarity_score'].dropna(), bins=20)
plt.title('Distribution of Similarity Scores')
plt.xlabel('Similarity Score')
plt.ylabel('Count')
plt.axvline(x=0.7, color='r', linestyle='--', label='Threshold (0.7)')
plt.axvline(x=0.9, color='g', linestyle='--', label='High Similarity (0.9)')
plt.legend()
plt.savefig('similarity_distribution.png')
plt.close()
print("\nSaved similarity distribution chart to 'similarity_distribution.png'")

# 8. Create a group size analysis
group_sizes = similar_df[similar_df['multifield_group_id'].notna()].groupby('multifield_group_id').size()
plt.figure(figsize=(10, 6))
sns.histplot(group_sizes, bins=10)
plt.title('Duplicate Group Size Distribution')
plt.xlabel('Group Size')
plt.ylabel('Count')
plt.savefig('group_size_distribution.png')
plt.close()
print(f"Largest duplicate group has {group_sizes.max()} records")

# 9. Add confidence level based on combined evidence
similar_df['duplicate_confidence'] = 'None'
# Both text and multifield similarity = high confidence
similar_df.loc[(similar_df['is_text_duplicate']) & 
               (similar_df['is_multifield_duplicate']), 'duplicate_confidence'] = 'High'
# Only one method but high score = medium confidence
similar_df.loc[(similar_df['is_potential_duplicate']) & 
               (similar_df['similarity_score'] > 0.9) &
               (similar_df['duplicate_confidence'] == 'None'), 'duplicate_confidence'] = 'Medium'
# Flagged but lower score = low confidence
similar_df.loc[(similar_df['is_potential_duplicate']) & 
               (similar_df['duplicate_confidence'] == 'None'), 'duplicate_confidence'] = 'Low'

confidence_counts = similar_df['duplicate_confidence'].value_counts()
print("\nDuplicate confidence levels:")
for level, count in confidence_counts.items():
    print(f"{level} confidence: {count} records")

# 10. Export high confidence duplicates for review
high_confidence = similar_df[similar_df['duplicate_confidence'] == 'High']
medium_confidence = similar_df[similar_df['duplicate_confidence'] == 'Medium']

# 11. Create summary report with examples from each confidence level
report_data = []
for group_id in high_confidence['multifield_group_id'].dropna().unique()[:5]:  # Top 5 high confidence groups
    group_records = similar_df[similar_df['multifield_group_id'] == group_id]
    report_data.append({
        'confidence': 'High',
        'group_id': group_id,
        'group_size': len(group_records),
        'similarity_score': group_records['similarity_score'].mean(),
        'sample_records': group_records.head(2).to_dict('records')
    })

# 12. Export results in different formats
# Excel report
similar_df.to_csv('duplicate_analysis_complete.csv', index=False)
high_confidence.to_csv('high_confidence_duplicates.csv', index=False)
medium_confidence.to_csv('medium_confidence_duplicates.csv', index=False)

# 13. Export summary data for PowerPoint
summary_data = {
    'dataframe_size': len(similar_df),
    'duplicate_count': total_duplicates,
    'duplicate_percent': total_duplicates/len(similar_df)*100,
    'confidence_distribution': confidence_counts.to_dict(),
    'group_count': multifield_groups,
    'largest_group_size': group_sizes.max(),
    'similarity_scores': similar_df['similarity_score'].dropna().tolist(),
    'threshold': 0.7
}

# Create presentation-ready dictionary
presentation_data = {
    'metrics': {
        'dataset_size': len(similar_df),
        'duplicate_count': total_duplicates,
        'duplicate_percent': total_duplicates/len(similar_df)*100,
        'high_confidence': confidence_counts.get('High', 0),
        'medium_confidence': confidence_counts.get('Medium', 0),
        'low_confidence': confidence_counts.get('Low', 0),
    }
}

# 14. Export to PowerPoint (use model_type report since it has charts)
export_to_powerpoint(
    presentation_data, 
    'duplicate_analysis.pptx', 
    report_type='model'
)
print("\nExported reports to CSV files and PowerPoint")

print("\nDuplicate analysis complete.")

Performance Optimization for Large-Scale Deduplication

When working with large datasets, flag_similar_records offers powerful memory optimization options to balance performance and accuracy:

from freamon.deduplication.flag_duplicates import flag_similar_records

# For a dataset with 100,000+ records
result_df = flag_similar_records(
    large_df,
    columns=['name', 'address', 'phone', 'email'],
    weights={'name': 0.4, 'address': 0.3, 'phone': 0.2, 'email': 0.1},
    threshold=0.85,           # Higher threshold for precision
    chunk_size=500,           # Optimize memory usage
    max_comparisons=1000000,  # Limit total comparisons
    n_jobs=4,                 # Parallel processing
    use_polars=True           # Use Polars if available
)

Chunk Size and Accuracy Tradeoffs

The chunk_size parameter creates a fundamental tradeoff between memory efficiency and detection accuracy:

Dataset Size Recommended Chunk Size Recommended max_comparisons Impact on Accuracy
< 20,000 rows Non-chunked (None) Default Highest accuracy, full comparison
20,000-100,000 rows 1000-2000 1,000,000-3,000,000 Good balance of accuracy and memory usage
100,000-500,000 rows 500-1000 1,000,000-5,000,000 Some potential duplicates might be missed
> 500,000 rows 250-500 500,000-1,000,000 Focus on highest-quality matches

How it works:

  • Smaller chunks reduce memory usage dramatically but may miss some potential duplicates
  • The algorithm prioritizes within-chunk comparisons where duplicates are more likely
  • Connected components analysis helps capture relationships between records even across chunks
  • For critical applications, start with larger chunk sizes and decrease only if memory issues occur

For extremely large datasets, consider increasing the similarity threshold to focus on higher-quality matches:

# For very large datasets (500k+ records)
result_df = flag_similar_records(
    very_large_df,
    columns=columns,
    weights=weights,
    threshold=0.9,            # Higher threshold
    chunk_size=250,           # Very small chunks
    max_comparisons=500000,   # Limited comparisons
    n_jobs=8                  # More parallel workers
)

Advanced EDA and Feature Selection

Perform advanced multivariate analysis and feature selection:

from freamon.eda.advanced_multivariate import visualize_pca, analyze_target_relationships
from freamon.features.categorical_selection import chi2_selection, anova_f_selection

# PCA visualization with target coloring
fig, pca_results = visualize_pca(df, target_column='target')

# Target-oriented feature analysis
figures, target_results = analyze_target_relationships(df, target_column='target')

# Select important categorical features
selected_features, scores = chi2_selection(df, target='target', k=5, return_scores=True)

See Advanced EDA documentation for more details.

EDA Module

The EDA module provides comprehensive data analysis:

from freamon.eda import EDAAnalyzer

analyzer = EDAAnalyzer(df, target_column='target')
analyzer.run_full_analysis()

# Generate different types of reports
analyzer.generate_report('report.html')  # HTML report
analyzer.generate_report('report.md', format='markdown')  # Markdown report
analyzer.generate_report('report.md', format='markdown', convert_to_html=True)  # Both formats

# For Jupyter notebooks, display interactive report
analyzer.display_eda_report()  # Interactive display in notebook

Documentation

For more detailed information, refer to the examples directory and the following resources:

License

MIT License

About

A package to make data science projects on tabular data easier. Named after the great character from The Wire played by Clarke Peters.

Resources

License

Stars

Watchers

Forks

Packages

No packages published
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载