ShareWallet - Mongolian Trade Data Analysis & Brand Extraction

A comprehensive Python-based data processing system for analyzing Mongolian customs and trade data, with AI-powered brand name extraction capabilities using the DeepSeek API.

📋 Table of Contents

Overview
Features
Project Structure
Installation
Configuration
Usage
Data Processing Workflows
AI Brand Extraction
Cleaning & Resetting
Contributing
License

🎯 Overview

ShareWallet is a data analysis toolkit designed to process and analyze large-scale Mongolian trade and customs data. The project focuses on:

Trade Data Analysis: Processing 800,000+ trade transaction records
AI-Powered Brand Extraction: Automated brand name identification from product descriptions
Weekly Data Organization: Structured processing of weekly trade data (W14-W27)
Multi-format Support: Handles CSV, XLSX, XLSB, and PBIX files
Cost-Optimized AI Processing: Smart caching and pattern matching to minimize API costs

Key Statistics

Total Records: 837,664 trade transactions (April-June 2025)
Data Period: 2021-2025
Weekly Files: 14 weeks of complete trade data (W14-W27)
Total Data Size: ~800+ MB of processed trade data

✨ Features

1. AI Brand Extraction

DeepSeek API integration for intelligent brand name prediction
Multi-language support (English, Cyrillic, Mongolian)
Cost-optimized processing with caching and pattern matching
Batch processing capabilities for large datasets

2. Data Processing

Excel/CSV file conversion and manipulation
XLSB to XLSX conversion
Random sampling for testing
Data validation and quality checks

3. Trade Data Analysis

Weekly trade data organization
Product categorization
Company and manufacturer analysis
Financial data aggregation
Temporal pattern analysis

4. Reporting

Executive summary generation
Detailed analysis reports
Cost optimization reports
Processing progress tracking

📁 Project Structure

sharewallet-main/
├── complete_weekly_data/          # Weekly trade data files (W14-W27)
│   ├── complete_trade_data_2025_W14.csv
│   ├── complete_trade_data_2025_W15.csv
│   └── ... (14 weekly files)
│
├── Core Processing Scripts
│   ├── ai_brand_extractor.py              # Main AI brand extraction engine
│   ├── cost_optimized_extractor.py        # Cost-efficient brand extraction
│   ├── quality_brand_extractor.py         # High-quality extraction mode
│   ├── smart_brand_finder.py              # Pattern-based brand detection
│   └── product_data_processor.py          # General data processing
│
├── Batch Processing Scripts
│   ├── full_smart_brand_processing.py     # Process all weekly data
│   ├── process_all_weekly_trade_brands.py # Weekly trade brand filling
│   ├── ultra_fast_smart_brands.py         # Parallel processing mode
│   └── ultra_fast_trade_brands.py         # Ultra-fast batch processing
│
├── Data Conversion & Analysis
│   ├── convert_xlsb_to_xlsx.py            # XLSB file conversion
│   ├── convert_csv_to_excel.py            # CSV to Excel conversion
│   ├── analyze_product_names_all_weeks.py # Product name analysis
│   ├── detailed_data_analysis.py          # Comprehensive data analysis
│   └── examine_data.py                    # Data exploration
│
├── Testing & Validation
│   ├── test_brand_filling.py              # Test brand extraction
│   ├── test_ai_brands.py                  # AI extraction testing
│   ├── validate_results.py                # Result validation
│   └── check_test_results.py              # Test result verification
│
├── Utility Scripts
│   ├── setup_brand_filling.py             # Setup and configuration
│   ├── extract_random_sample.py           # Random sampling
│   ├── split_csv_by_weeks.py              # Weekly data splitting
│   ├── organize_weekly_data.py            # Data organization
│   └── recover_all_records.py             # Data recovery
│
├── Documentation
│   ├── README.md                          # This file
│   ├── BRAND_FILLING_AUTOMATION_GUIDE.md  # Brand filling guide
│   ├── EXECUTIVE_SUMMARY_REPORT.md        # Executive summary
│   ├── cost_optimization_report.md        # Cost analysis
│   └── ai_brand_extraction_report.md      # AI extraction report
│
└── Data Files
    ├── 2023.xlsx                          # Historical data (86.7 MB)
    ├── processed_import_data.xlsx         # Processed import data (15.6 MB)
    ├── ________.xlsb                      # Raw data file (17.7 MB)
    └── Monos Food 2021-2023.pbix          # Power BI report (26.5 MB)

🚀 Installation

Prerequisites

Python: 3.12 or higher (tested with Python 3.13.7)
Operating System: Windows, macOS, or Linux
Memory: Minimum 4GB RAM (8GB+ recommended for large datasets)
Storage: At least 2GB free space

Required Python Packages

# Install required packages
pip install pandas openpyxl xlrd requests python-dotenv

# Optional packages for advanced features
pip install pyxlsb  # For XLSB file support
pip install tqdm    # For progress bars

DeepSeek API Setup

Visit DeepSeek Platform
Sign up or log in to your account
Navigate to the API section
Generate a new API key (starts with sk-...)
Set the environment variable:

Windows (PowerShell):

$env:DEEPSEEK_API_KEY = "your-api-key-here"

macOS/Linux:

export DEEPSEEK_API_KEY='your-api-key-here'

Or create a .env file:

DEEPSEEK_API_KEY=your-api-key-here

⚙️ Configuration

API Configuration

The project uses DeepSeek API for brand extraction. Configure your API keys in the scripts or environment variables:

# Primary API key (from environment or hardcoded)
primary_key = os.getenv('DEEPSEEK_API_KEY') or 'sk-your-primary-key'

# Backup API key (optional)
backup_key = 'sk-your-backup-key'

Processing Parameters

Key parameters you can adjust in the scripts:

Batch Size: Number of records to process at once (default: 100-500)
Rate Limiting: Delay between API calls (default: 0.1-2 seconds)
Sample Size: Number of records for testing (default: 50-500)
Max Tokens: Maximum tokens per API request (default: 15-50)

📖 Usage

Quick Start

Test the Setup

python setup_brand_filling.py

Test Brand Extraction (5 records)

python test_brand_filling.py

Process a Sample File

python fill_brand_with_deepseek.py

Common Workflows

1. Extract Brands from a CSV File

# Process a specific file with AI brand extraction
python ai_brand_extractor.py

2. Process All Weekly Data

# Process all weekly trade data files
python process_all_weekly_trade_brands.py

3. Ultra-Fast Batch Processing

# Use parallel processing for maximum speed
python ultra_fast_smart_brands.py

4. Cost-Optimized Processing

# Minimize API costs with smart caching
python cost_optimized_extractor.py

5. Data Analysis

# Analyze product names across all weeks
python analyze_product_names_all_weeks.py

# Detailed trade data analysis
python detailed_data_analysis.py

🔄 Data Processing Workflows

Workflow 1: Brand Filling Pipeline

1. Extract random sample → extract_random_sample.py
2. Test brand filling → test_brand_filling.py
3. Fill brands with AI → fill_brand_with_deepseek.py
4. Validate results → validate_results.py
5. Apply to full dataset → process_all_weekly_trade_brands.py

Workflow 2: Weekly Data Processing

1. Organize weekly data → organize_weekly_data.py
2. Split by weeks → split_csv_by_weeks.py
3. Process each week → full_smart_brand_processing.py
4. Generate reports → detailed_data_analysis.py

Workflow 3: Data Conversion

1. Convert XLSB to XLSX → convert_xlsb_to_xlsx.py
2. Convert to CSV → convert_excel_to_csv.py
3. Process and analyze → analyze_excel.py

🤖 AI Brand Extraction

How It Works

The AI brand extraction system uses DeepSeek's language model to intelligently identify brand names from product descriptions:

Input Data: Product name, code, description, manufacturer
Pattern Matching: First attempts rule-based extraction (free)
AI Processing: Uses DeepSeek API for complex cases
Caching: Stores results to avoid duplicate API calls
Validation: Checks and cleans extracted brand names

Cost Optimization

Rule-based extraction: ~60% of brands extracted for free
Smart caching: Avoids duplicate API calls
Shorter prompts: Reduced token usage (70% fewer tokens)
Batch processing: Efficient API usage

Estimated Costs:

500 records: ~$0.11 USD
5,000 records: ~$1.10 USD
50,000 records: ~$11.00 USD

Example Usage

from ai_brand_extractor import AIBrandExtractor

# Initialize extractor
extractor = AIBrandExtractor(api_key="your-api-key")

# Extract brand from product description
brand = extractor.extract_brand(
    product_name="Samsung Galaxy S21",
    product_code="SM-G991B",
    description="Smartphone 5G 128GB",
    manufacturer="Samsung Electronics"
)
# Returns: "Samsung"

🧹 Cleaning & Resetting

Clear Python Cache

# Windows PowerShell
Remove-Item -Recurse -Force __pycache__

# macOS/Linux
find . -type d -name "__pycache__" -exec rm -rf {} +

Remove Temporary Files

# Remove temporary Excel files
Remove-Item ~$*.xlsx, ~$*.xlsb

# Remove test output files
Remove-Item *_test_*.xlsx, *_test_*.csv

Reset to Clean State

# Keep only source code and documentation
# Remove all data files (WARNING: This deletes data!)

# Windows PowerShell
Remove-Item *.xlsx, *.xlsb, *.csv, *.pbix -Exclude "README.md", "*.py", "*.md"

# macOS/Linux
find . -type f \( -name "*.xlsx" -o -name "*.xlsb" -o -name "*.csv" -o -name "*.pbix" \) -delete

Database Reset (if applicable)

This project primarily uses file-based data storage. To reset:

Delete all CSV files in complete_weekly_data/
Delete processed Excel files
Re-run data recovery: python recover_all_records.py

🤝 Contributing

Contributions are welcome! Please follow these guidelines:

Fork the repository
Create a feature branch: git checkout -b feature/your-feature-name
Make your changes
Test thoroughly: Run test scripts to ensure nothing breaks
Commit your changes: git commit -m "Add your feature"
Push to the branch: git push origin feature/your-feature-name
Create a Pull Request

Code Style

Follow PEP 8 guidelines for Python code
Use meaningful variable and function names
Add docstrings to functions and classes
Comment complex logic

Testing

Before submitting a PR:

Test with sample data (50-100 records)
Verify API costs are reasonable
Check for data integrity
Validate output files

📄 License

Unauthorized copying, distribution, or use of this software is strictly prohibited.

📞 Support & Contact

For questions, issues, or support:

GitHub Issues: Report an issue
Documentation: See BRAND_FILLING_AUTOMATION_GUIDE.md for detailed guides

Last Updated: January 2025
Version: 1.0.0
Maintained by: Boldbat

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitattributes		.gitattributes
.gitignore		.gitignore
BRAND_FILLING_AUTOMATION_GUIDE.md		BRAND_FILLING_AUTOMATION_GUIDE.md
EXECUTIVE_SUMMARY_REPORT.md		EXECUTIVE_SUMMARY_REPORT.md
PRODUCT_NAMES_EXECUTIVE_SUMMARY.md		PRODUCT_NAMES_EXECUTIVE_SUMMARY.md
README.md		README.md
W14_CATEGORY_SUMMARY.md		W14_CATEGORY_SUMMARY.md
W14_category_analysis_20250710_141757.md		W14_category_analysis_20250710_141757.md
W14_random_sample_report_20250723_112539.md		W14_random_sample_report_20250723_112539.md
ai_brand_extraction_report.md		ai_brand_extraction_report.md
ai_brand_extractor.py		ai_brand_extractor.py
analyze_2023_data.py		analyze_2023_data.py
analyze_excel.py		analyze_excel.py
analyze_import_data.py		analyze_import_data.py
analyze_product_names_all_weeks.py		analyze_product_names_all_weeks.py
analyze_xlsb_file.py		analyze_xlsb_file.py
check_excel_file.py		check_excel_file.py
check_nan_issue.py		check_nan_issue.py
check_output_file.py		check_output_file.py
check_permissions.py		check_permissions.py
check_test_results.py		check_test_results.py
check_xlsb_file.py		check_xlsb_file.py
check_xlsb_sheets.py		check_xlsb_sheets.py
check_xlsb_structure.py		check_xlsb_structure.py
clean_cosmetics_results.py		clean_cosmetics_results.py
complete_recovery_report_20250714_211553.md		complete_recovery_report_20250714_211553.md
convert_brands_csv_to_excel.py		convert_brands_csv_to_excel.py
convert_csv_to_excel.py		convert_csv_to_excel.py
convert_excel_to_csv.py		convert_excel_to_csv.py
convert_xlsb_to_excel.py		convert_xlsb_to_excel.py
convert_xlsb_to_temp.py		convert_xlsb_to_temp.py
convert_xlsb_to_xlsx.py		convert_xlsb_to_xlsx.py
copy_excel_file.py		copy_excel_file.py
copy_test_data.py		copy_test_data.py
cost_optimization_report.md		cost_optimization_report.md
cost_optimized_extractor.py		cost_optimized_extractor.py
create_categories_w14.py		create_categories_w14.py
debug_english_extraction.py		debug_english_extraction.py
detailed_data_analysis.py		detailed_data_analysis.py
detailed_trade_analysis_20250710_122852.md		detailed_trade_analysis_20250710_122852.md
efficient_excel_converter.py		efficient_excel_converter.py
examine_data.py		examine_data.py
examine_results.py		examine_results.py
examine_xlsb_rows.py		examine_xlsb_rows.py
excel_analysis_report.md		excel_analysis_report.md
extract_data_sheet.py		extract_data_sheet.py
extract_random_sample.py		extract_random_sample.py
fill_all_cosmetics_brands.py		fill_all_cosmetics_brands.py
fill_brand_with_deepseek.py		fill_brand_with_deepseek.py
fill_cos_ai_brands.py		fill_cos_ai_brands.py
fill_cosmetics_brands.py		fill_cosmetics_brands.py
fill_trade_brands_random20.py		fill_trade_brands_random20.py
fix_brand_update.py		fix_brand_update.py
full_smart_brand_processing.py		full_smart_brand_processing.py
improved_cosmetics_brands.py		improved_cosmetics_brands.py
investigate_missing_records.py		investigate_missing_records.py
organize_weekly_data.py		organize_weekly_data.py
process_all_weekly_trade_brands.py		process_all_weekly_trade_brands.py
processing_report.md		processing_report.md
product_data_processor.py		product_data_processor.py
product_names_analysis_20250710_142747.md		product_names_analysis_20250710_142747.md
quality_brand_extractor.py		quality_brand_extractor.py
quality_extraction_report.md		quality_extraction_report.md
recover_all_records.py		recover_all_records.py
run_ai_extraction.py		run_ai_extraction.py
save_results.py		save_results.py
save_test_here.py		save_test_here.py
setup_brand_filling.py		setup_brand_filling.py
simple_csv_to_excel.py		simple_csv_to_excel.py
smart_brand_finder.py		smart_brand_finder.py
split_csv_by_weeks.py		split_csv_by_weeks.py
super_fast_clean_brands.py		super_fast_clean_brands.py
test_ai_brands.py		test_ai_brands.py
test_brand_filling.py		test_brand_filling.py
test_clean_brands.py		test_clean_brands.py
track_progress_brands.py		track_progress_brands.py
ultra_fast_smart_brands.py		ultra_fast_smart_brands.py
ultra_fast_trade_brands.py		ultra_fast_trade_brands.py
validate_results.py		validate_results.py

Folders and files

Latest commit

History

Repository files navigation

ShareWallet - Mongolian Trade Data Analysis & Brand Extraction

📋 Table of Contents

🎯 Overview

Key Statistics

✨ Features

1. AI Brand Extraction

2. Data Processing

3. Trade Data Analysis

4. Reporting

📁 Project Structure

🚀 Installation

Prerequisites

Required Python Packages

DeepSeek API Setup

⚙️ Configuration

API Configuration

Processing Parameters

📖 Usage

Quick Start

Common Workflows

1. Extract Brands from a CSV File

2. Process All Weekly Data

3. Ultra-Fast Batch Processing

4. Cost-Optimized Processing

5. Data Analysis

🔄 Data Processing Workflows

Workflow 1: Brand Filling Pipeline

Workflow 2: Weekly Data Processing

Workflow 3: Data Conversion

🤖 AI Brand Extraction

How It Works

Cost Optimization

Example Usage

🧹 Cleaning & Resetting

Clear Python Cache

Remove Temporary Files

Reset to Clean State

Database Reset (if applicable)

🤝 Contributing

Code Style

Testing

📄 License

📞 Support & Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages