Freamon is a comprehensive Python toolkit for exploratory data analysis, feature engineering, and model development with a focus on practical data science workflows.
Quick Start | Documentation | Installation | Examples
- Exploratory Data Analysis: Automatic EDA with comprehensive reporting in HTML, Markdown, Excel, PowerPoint, and interactive Jupyter notebook displays
- Advanced Multivariate Analysis: PCA visualization, correlation networks, and target-oriented analysis
- Feature Engineering: Advanced feature engineering for numeric, categorical, and text data
- Feature Selection: Statistical feature selection including Chi-square, ANOVA F-test, and effect size analysis
- Deduplication: High-performance deduplication with Polars optimization (2-5x faster, 60-70% less memory), LSH, supervised ML, and active learning
- Topic Modeling: Optimized text analysis with NMF and LDA, supporting large datasets up to 100K documents
- Automated Modeling: Intelligent end-to-end modeling workflow for text, tabular, and time series data
- Modeling: Custom model implementations with feature importance and model interpretation
- Pipeline: Scikit-learn compatible pipeline with additional features
- Drift Analysis: Tools for detecting and analyzing data drift
- Word Embeddings: Integration with various word embedding techniques
- Visualization: Publication-quality visualizations with proper handling of all special characters
- Performance Optimization: Multiprocessing support and intelligent sampling for large dataset analysis
For basic functionality (EDA, visualization, core deduplication):
pip install freamon
For full functionality including advanced modeling, text processing, and performance optimizations:
pip install "freamon[all]"
For specific feature sets:
# For high-performance with Polars acceleration
pip install "freamon[performance]"
# For text analysis and topic modeling
pip install "freamon[topic_modeling]"
# For word embeddings support
pip install "freamon[word_embeddings]"
# For extended features (modeling, Polars, LightGBM, SHAP, etc.)
pip install "freamon[extended]"
# For Markdown report generation
pip install "freamon[markdown_reports]"
Here's what each optional dependency provides:
-
Core (always installed):
numpy
,pandas
,scikit-learn
,matplotlib
,seaborn
,networkx
-
Performance [
freamon[performance]
]:pyarrow
- For faster data processing
-
Extended [
freamon[extended]
]:polars
- High-performance DataFrame library (2-5x faster than pandas)lightgbm
- Gradient boosting frameworkoptuna
- Hyperparameter optimizationshap
- Model explanationspacy
- NLP processingstatsmodels
- Statistical modelingdask
- Parallel computing
-
Topic Modeling [
freamon[topic_modeling]
]:gensim
- Topic modelingpyldavis
- Topic visualizationwordcloud
- Word cloud generation
-
Word Embeddings [
freamon[word_embeddings]
]:gensim
- Word vectorsnltk
- Natural language toolkitspacy
- Linguistic features
-
Markdown Reports [
freamon[markdown_reports]
]:markdown
- Report generation
from freamon.eda import EDAAnalyzer
# Create an analyzer instance
analyzer = EDAAnalyzer(df, target_column='target')
# Run the analysis
analyzer.run_full_analysis()
# Generate a report
analyzer.generate_report('eda_report.html')
# Or a markdown report for version control
analyzer.generate_report('eda_report.md', format='markdown')
Below is a complete workflow showing data type detection, EDA analysis, modeling, and reporting:
import pandas as pd
import numpy as np
from freamon.eda import EDAAnalyzer
from freamon.utils.datatype_detector import detect_datatypes
from freamon import auto_model
# Required for PowerPoint/Excel reports
# pip install "freamon[extended]"
from freamon.eda.export import export_to_powerpoint, export_to_excel
# 1. Load sample data
df = pd.read_csv('customer_data.csv')
print(f"Dataset shape: {df.shape}")
# 2. Run data type detection
datatype_results = detect_datatypes(df)
print("\nDetected data types:")
print(f"Text columns: {datatype_results['text_columns']}")
print(f"Categorical columns: {datatype_results['categorical_columns']}")
print(f"Numeric columns: {datatype_results['numeric_columns']}")
print(f"Date columns: {datatype_results['date_columns']}")
# 3. Generate data type detection report
from freamon.utils.datatype_fixes import save_detection_report
save_detection_report(
datatype_results,
'datatype_detection_report.html',
title='Customer Data Type Detection'
)
print("\nData type detection report saved to 'datatype_detection_report.html'")
# 4. Run EDA analysis
analyzer = EDAAnalyzer(
df,
target_column='churn', # For supervised analysis
text_columns=datatype_results['text_columns'],
categorical_columns=datatype_results['categorical_columns'],
numeric_columns=datatype_results['numeric_columns'],
datetime_columns=datatype_results['date_columns']
)
analyzer.run_full_analysis()
# 5. Generate EDA reports in different formats
analyzer.generate_report('eda_report.html') # HTML report
analyzer.generate_report('eda_report.md', format='markdown') # Markdown report
print("\nEDA reports generated in HTML and Markdown formats")
# 6. Export EDA results to PowerPoint for presentations
export_to_powerpoint(
analyzer.get_report_data(),
'eda_presentation.pptx',
report_type='eda'
)
print("\nEDA results exported to PowerPoint")
# 7. Run automated modeling with data type detection
# Note: Install required dependencies for advanced modeling:
# pip install "freamon[extended,topic_modeling]"
results = auto_model(
df=df,
target_column='churn',
problem_type='classification',
# Use our detected data types
text_columns=datatype_results['text_columns'],
categorical_columns=datatype_results['categorical_columns'],
date_column=datatype_results['date_columns'][0] if datatype_results['date_columns'] else None
)
# 8. Examine model results
print("\nModel Performance:")
for metric, value in results['metrics'].items():
if 'mean' in metric:
print(f"{metric}: {value:.4f}")
# 9. Plot model visualizations
fig1 = results['autoflow'].plot_metrics()
fig1.savefig('cv_metrics.png')
fig2 = results['autoflow'].plot_importance(top_n=15)
fig2.savefig('feature_importance.png')
# 10. Export model results to Excel
model_data = {
'model_type': results['autoflow'].model_type,
'metrics': results['metrics'],
'feature_importance': results['feature_importance'],
'training_date': pd.Timestamp.now().strftime('%Y-%m-%d')
}
export_to_excel(model_data, 'model_performance.xlsx', report_type='model')
# 11. Export model results to PowerPoint
export_to_powerpoint(model_data, 'model_presentation.pptx', report_type='model')
print("\nModel results exported to Excel and PowerPoint")
# 12. Make predictions on new data
new_data = pd.read_csv('new_customers.csv')
predictions = results['autoflow'].predict(new_data)
new_data['predicted_churn'] = predictions
new_data.to_csv('predictions.csv', index=False)
print("\nPredictions saved to 'predictions.csv'")
Complete step-by-step process for deduplication, including analysis, visualization, and modeling:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from freamon.deduplication.exact_deduplication import hash_deduplication
from freamon.deduplication.lsh_deduplication import lsh_deduplication
from freamon.data_quality.duplicates import detect_duplicates, get_duplicate_groups
from examples.deduplication_tracking_example import IndexTracker
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
# For network visualization (optional)
# pip install networkx
import networkx as nx
# 1. Load sample data with text and duplicates
df = pd.read_csv('data_with_duplicates.csv')
print(f"Original dataset shape: {df.shape}")
# 2. Analyze duplicates using built-in detection
duplicate_stats = detect_duplicates(df)
print(f"\nDuplicate analysis:")
print(f"Exact duplicates: {duplicate_stats['duplicate_count']} records")
print(f"Duplicate percentage: {duplicate_stats['duplicate_percent']:.2f}%")
# 3. Get duplicate groups for examination
duplicate_groups = get_duplicate_groups(df)
print(f"\nFound {len(duplicate_groups)} duplicate groups")
for i, group in enumerate(duplicate_groups[:3]): # Show first 3 groups
print(f"\nDuplicate group {i+1}:")
print(df.iloc[group].head(1)) # Show one example from each group
# 4. Initialize index tracker to maintain mapping
tracker = IndexTracker().initialize_from_df(df)
# 5. Find duplicates using LSH (locality-sensitive hashing) for text similarity
print("\nRunning LSH deduplication...")
kept_indices, similarity_dict = lsh_deduplication(
df['description'],
threshold=0.8,
num_bands=20,
preprocess=True,
return_similarity_dict=True
)
# 6. Analyze LSH results
print(f"LSH kept {len(kept_indices)} out of {len(df)} records ({len(kept_indices)/len(df)*100:.1f}%)")
# 7. Visualize similarity network (for smaller datasets)
if len(df) < 1000:
import networkx as nx
G = nx.Graph()
# Add all nodes (documents)
for i in range(len(df)):
G.add_node(i)
# Add edges (similarities)
for doc_id, similar_docs in similarity_dict.items():
for similar_id in similar_docs:
G.add_edge(doc_id, similar_id)
# Plot network
plt.figure(figsize=(10, 8))
pos = nx.spring_layout(G)
nx.draw(G, pos, node_size=50, node_color='blue', alpha=0.6)
plt.title('Document Similarity Network')
plt.savefig('similarity_network.png')
plt.close()
print("\nSaved similarity network visualization to 'similarity_network.png'")
# 8. Create deduplicated dataframe
deduped_df = df.iloc[kept_indices].copy()
# 9. Update tracker with kept indices
tracker.update_from_kept_indices(kept_indices, deduped_df)
# 10. Train model on deduplicated data
print("\nTraining model on deduplicated data...")
X = deduped_df.drop(['target', 'description'], axis=1) # Exclude text column
y = deduped_df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# 11. Evaluate model
y_pred = model.predict(X_test)
print("\nModel performance on deduplicated test data:")
print(classification_report(y_test, y_pred))
# 12. Make predictions and generate results dataframe
y_pred_series = pd.Series(y_pred, index=X_test.index)
results_df = pd.DataFrame({'prediction': y_pred_series, 'actual': y_test})
# 13. Map results back to original dataset with all records
full_results = tracker.create_full_result_df(
results_df, df, fill_value={'prediction': None, 'actual': None}
)
print(f"\nMapping results:")
print(f"Original dataset size: {len(df)}")
print(f"Deduplicated dataset size: {len(deduped_df)}")
print(f"Number of records with predictions: {full_results['prediction'].notna().sum()}")
# 14. Save full dataset with deduplication information
df['is_duplicate'] = ~df.index.isin(kept_indices)
df['has_prediction'] = full_results['prediction'].notna()
df['predicted'] = full_results['prediction']
df.to_csv('deduplication_results.csv', index=False)
print("\nSaved full dataset with deduplication and prediction information to 'deduplication_results.csv'")
Comprehensive workflow to identify potential duplicates without removing them, with analysis and visualization:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Required for duplicate flagging functionality
# pip install "freamon[extended]"
from freamon.deduplication.flag_duplicates import flag_similar_records, flag_text_duplicates
# Required for PowerPoint/Excel export
# pip install "freamon[extended]"
from freamon.eda.export import export_to_excel, export_to_powerpoint
# 1. Load unlabeled dataset
unlabeled_df = pd.read_csv('unlabeled_customer_data.csv')
print(f"Dataset shape: {unlabeled_df.shape}")
# 2. Flag potential text duplicates using LSH
print("\nProcessing text duplicates...")
text_df = flag_text_duplicates(
unlabeled_df,
text_column='description',
threshold=0.8,
method='lsh',
add_group_id=True,
add_similarity_score=True,
add_duplicate_flag=True
)
# 3. Analyze text duplicate results
duplicate_text_groups = text_df['duplicate_group_id'].dropna().nunique()
duplicate_text_records = text_df['is_text_duplicate'].sum()
print(f"Text duplicate analysis:")
print(f"Found {duplicate_text_groups} potential duplicate text groups")
print(f"Found {duplicate_text_records} records ({duplicate_text_records/len(text_df)*100:.1f}%) with similar text")
# 4. Flag similar records across multiple fields using weighted similarity
print("\nProcessing multi-field similarity...")
similar_df = flag_similar_records(
text_df, # Use the dataframe that already has text duplicate info
columns=['name', 'address', 'phone', 'email'],
weights={'name': 0.4, 'address': 0.3, 'phone': 0.2, 'email': 0.1},
threshold=0.7,
similarity_column="similarity_score", # Column to store similarity scores
group_column="multifield_group_id", # Column to store group IDs
flag_column="is_multifield_duplicate" # Column to store duplicate flags
)
# 5. Analyze multi-field similarity results
multifield_groups = similar_df['multifield_group_id'].dropna().nunique()
multifield_duplicates = similar_df['is_multifield_duplicate'].sum()
print(f"Multi-field duplicate analysis:")
print(f"Found {multifield_groups} potential duplicate groups based on multiple fields")
print(f"Found {multifield_duplicates} records ({multifield_duplicates/len(similar_df)*100:.1f}%) with similar fields")
# 6. Create a combined duplicate flag
similar_df['is_potential_duplicate'] = similar_df['is_text_duplicate'] | similar_df['is_multifield_duplicate']
total_duplicates = similar_df['is_potential_duplicate'].sum()
print(f"\nCombined results: {total_duplicates} potential duplicates ({total_duplicates/len(similar_df)*100:.1f}%)")
# 7. Visualize similarity score distribution
plt.figure(figsize=(10, 6))
sns.histplot(similar_df['similarity_score'].dropna(), bins=20)
plt.title('Distribution of Similarity Scores')
plt.xlabel('Similarity Score')
plt.ylabel('Count')
plt.axvline(x=0.7, color='r', linestyle='--', label='Threshold (0.7)')
plt.axvline(x=0.9, color='g', linestyle='--', label='High Similarity (0.9)')
plt.legend()
plt.savefig('similarity_distribution.png')
plt.close()
print("\nSaved similarity distribution chart to 'similarity_distribution.png'")
# 8. Create a group size analysis
group_sizes = similar_df[similar_df['multifield_group_id'].notna()].groupby('multifield_group_id').size()
plt.figure(figsize=(10, 6))
sns.histplot(group_sizes, bins=10)
plt.title('Duplicate Group Size Distribution')
plt.xlabel('Group Size')
plt.ylabel('Count')
plt.savefig('group_size_distribution.png')
plt.close()
print(f"Largest duplicate group has {group_sizes.max()} records")
# 9. Add confidence level based on combined evidence
similar_df['duplicate_confidence'] = 'None'
# Both text and multifield similarity = high confidence
similar_df.loc[(similar_df['is_text_duplicate']) &
(similar_df['is_multifield_duplicate']), 'duplicate_confidence'] = 'High'
# Only one method but high score = medium confidence
similar_df.loc[(similar_df['is_potential_duplicate']) &
(similar_df['similarity_score'] > 0.9) &
(similar_df['duplicate_confidence'] == 'None'), 'duplicate_confidence'] = 'Medium'
# Flagged but lower score = low confidence
similar_df.loc[(similar_df['is_potential_duplicate']) &
(similar_df['duplicate_confidence'] == 'None'), 'duplicate_confidence'] = 'Low'
confidence_counts = similar_df['duplicate_confidence'].value_counts()
print("\nDuplicate confidence levels:")
for level, count in confidence_counts.items():
print(f"{level} confidence: {count} records")
# 10. Export high confidence duplicates for review
high_confidence = similar_df[similar_df['duplicate_confidence'] == 'High']
medium_confidence = similar_df[similar_df['duplicate_confidence'] == 'Medium']
# 11. Create summary report with examples from each confidence level
report_data = []
for group_id in high_confidence['multifield_group_id'].dropna().unique()[:5]: # Top 5 high confidence groups
group_records = similar_df[similar_df['multifield_group_id'] == group_id]
report_data.append({
'confidence': 'High',
'group_id': group_id,
'group_size': len(group_records),
'similarity_score': group_records['similarity_score'].mean(),
'sample_records': group_records.head(2).to_dict('records')
})
# 12. Export results in different formats
# Excel report
similar_df.to_csv('duplicate_analysis_complete.csv', index=False)
high_confidence.to_csv('high_confidence_duplicates.csv', index=False)
medium_confidence.to_csv('medium_confidence_duplicates.csv', index=False)
# 13. Export summary data for PowerPoint
summary_data = {
'dataframe_size': len(similar_df),
'duplicate_count': total_duplicates,
'duplicate_percent': total_duplicates/len(similar_df)*100,
'confidence_distribution': confidence_counts.to_dict(),
'group_count': multifield_groups,
'largest_group_size': group_sizes.max(),
'similarity_scores': similar_df['similarity_score'].dropna().tolist(),
'threshold': 0.7
}
# Create presentation-ready dictionary
presentation_data = {
'metrics': {
'dataset_size': len(similar_df),
'duplicate_count': total_duplicates,
'duplicate_percent': total_duplicates/len(similar_df)*100,
'high_confidence': confidence_counts.get('High', 0),
'medium_confidence': confidence_counts.get('Medium', 0),
'low_confidence': confidence_counts.get('Low', 0),
}
}
# 14. Export to PowerPoint (use model_type report since it has charts)
export_to_powerpoint(
presentation_data,
'duplicate_analysis.pptx',
report_type='model'
)
print("\nExported reports to CSV files and PowerPoint")
print("\nDuplicate analysis complete.")
When working with large datasets, flag_similar_records
offers powerful memory optimization options to balance performance and accuracy:
from freamon.deduplication.flag_duplicates import flag_similar_records
# For a dataset with 100,000+ records
result_df = flag_similar_records(
large_df,
columns=['name', 'address', 'phone', 'email'],
weights={'name': 0.4, 'address': 0.3, 'phone': 0.2, 'email': 0.1},
threshold=0.85, # Higher threshold for precision
chunk_size=500, # Optimize memory usage
max_comparisons=1000000, # Limit total comparisons
n_jobs=4, # Parallel processing
use_polars=True # Use Polars if available
)
The chunk_size
parameter creates a fundamental tradeoff between memory efficiency and detection accuracy:
Dataset Size | Recommended Chunk Size | Recommended max_comparisons | Impact on Accuracy |
---|---|---|---|
< 20,000 rows | Non-chunked (None) | Default | Highest accuracy, full comparison |
20,000-100,000 rows | 1000-2000 | 1,000,000-3,000,000 | Good balance of accuracy and memory usage |
100,000-500,000 rows | 500-1000 | 1,000,000-5,000,000 | Some potential duplicates might be missed |
> 500,000 rows | 250-500 | 500,000-1,000,000 | Focus on highest-quality matches |
How it works:
- Smaller chunks reduce memory usage dramatically but may miss some potential duplicates
- The algorithm prioritizes within-chunk comparisons where duplicates are more likely
- Connected components analysis helps capture relationships between records even across chunks
- For critical applications, start with larger chunk sizes and decrease only if memory issues occur
For extremely large datasets, consider increasing the similarity threshold to focus on higher-quality matches:
# For very large datasets (500k+ records)
result_df = flag_similar_records(
very_large_df,
columns=columns,
weights=weights,
threshold=0.9, # Higher threshold
chunk_size=250, # Very small chunks
max_comparisons=500000, # Limited comparisons
n_jobs=8 # More parallel workers
)
Perform advanced multivariate analysis and feature selection:
from freamon.eda.advanced_multivariate import visualize_pca, analyze_target_relationships
from freamon.features.categorical_selection import chi2_selection, anova_f_selection
# PCA visualization with target coloring
fig, pca_results = visualize_pca(df, target_column='target')
# Target-oriented feature analysis
figures, target_results = analyze_target_relationships(df, target_column='target')
# Select important categorical features
selected_features, scores = chi2_selection(df, target='target', k=5, return_scores=True)
See Advanced EDA documentation for more details.
The EDA module provides comprehensive data analysis:
from freamon.eda import EDAAnalyzer
analyzer = EDAAnalyzer(df, target_column='target')
analyzer.run_full_analysis()
# Generate different types of reports
analyzer.generate_report('report.html') # HTML report
analyzer.generate_report('report.md', format='markdown') # Markdown report
analyzer.generate_report('report.md', format='markdown', convert_to_html=True) # Both formats
# For Jupyter notebooks, display interactive report
analyzer.display_eda_report() # Interactive display in notebook
For more detailed information, refer to the examples directory and the following resources: