Skip to content

boldbat/data02

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ShareWallet - Mongolian Trade Data Analysis & Brand Extraction

A comprehensive Python-based data processing system for analyzing Mongolian customs and trade data, with AI-powered brand name extraction capabilities using the DeepSeek API.

📋 Table of Contents

🎯 Overview

ShareWallet is a data analysis toolkit designed to process and analyze large-scale Mongolian trade and customs data. The project focuses on:

  • Trade Data Analysis: Processing 800,000+ trade transaction records
  • AI-Powered Brand Extraction: Automated brand name identification from product descriptions
  • Weekly Data Organization: Structured processing of weekly trade data (W14-W27)
  • Multi-format Support: Handles CSV, XLSX, XLSB, and PBIX files
  • Cost-Optimized AI Processing: Smart caching and pattern matching to minimize API costs

Key Statistics

  • Total Records: 837,664 trade transactions (April-June 2025)
  • Data Period: 2021-2025
  • Weekly Files: 14 weeks of complete trade data (W14-W27)
  • Total Data Size: ~800+ MB of processed trade data

✨ Features

1. AI Brand Extraction

  • DeepSeek API integration for intelligent brand name prediction
  • Multi-language support (English, Cyrillic, Mongolian)
  • Cost-optimized processing with caching and pattern matching
  • Batch processing capabilities for large datasets

2. Data Processing

  • Excel/CSV file conversion and manipulation
  • XLSB to XLSX conversion
  • Random sampling for testing
  • Data validation and quality checks

3. Trade Data Analysis

  • Weekly trade data organization
  • Product categorization
  • Company and manufacturer analysis
  • Financial data aggregation
  • Temporal pattern analysis

4. Reporting

  • Executive summary generation
  • Detailed analysis reports
  • Cost optimization reports
  • Processing progress tracking

📁 Project Structure

sharewallet-main/
├── complete_weekly_data/          # Weekly trade data files (W14-W27)
│   ├── complete_trade_data_2025_W14.csv
│   ├── complete_trade_data_2025_W15.csv
│   └── ... (14 weekly files)
│
├── Core Processing Scripts
│   ├── ai_brand_extractor.py              # Main AI brand extraction engine
│   ├── cost_optimized_extractor.py        # Cost-efficient brand extraction
│   ├── quality_brand_extractor.py         # High-quality extraction mode
│   ├── smart_brand_finder.py              # Pattern-based brand detection
│   └── product_data_processor.py          # General data processing
│
├── Batch Processing Scripts
│   ├── full_smart_brand_processing.py     # Process all weekly data
│   ├── process_all_weekly_trade_brands.py # Weekly trade brand filling
│   ├── ultra_fast_smart_brands.py         # Parallel processing mode
│   └── ultra_fast_trade_brands.py         # Ultra-fast batch processing
│
├── Data Conversion & Analysis
│   ├── convert_xlsb_to_xlsx.py            # XLSB file conversion
│   ├── convert_csv_to_excel.py            # CSV to Excel conversion
│   ├── analyze_product_names_all_weeks.py # Product name analysis
│   ├── detailed_data_analysis.py          # Comprehensive data analysis
│   └── examine_data.py                    # Data exploration
│
├── Testing & Validation
│   ├── test_brand_filling.py              # Test brand extraction
│   ├── test_ai_brands.py                  # AI extraction testing
│   ├── validate_results.py                # Result validation
│   └── check_test_results.py              # Test result verification
│
├── Utility Scripts
│   ├── setup_brand_filling.py             # Setup and configuration
│   ├── extract_random_sample.py           # Random sampling
│   ├── split_csv_by_weeks.py              # Weekly data splitting
│   ├── organize_weekly_data.py            # Data organization
│   └── recover_all_records.py             # Data recovery
│
├── Documentation
│   ├── README.md                          # This file
│   ├── BRAND_FILLING_AUTOMATION_GUIDE.md  # Brand filling guide
│   ├── EXECUTIVE_SUMMARY_REPORT.md        # Executive summary
│   ├── cost_optimization_report.md        # Cost analysis
│   └── ai_brand_extraction_report.md      # AI extraction report
│
└── Data Files
    ├── 2023.xlsx                          # Historical data (86.7 MB)
    ├── processed_import_data.xlsx         # Processed import data (15.6 MB)
    ├── ________.xlsb                      # Raw data file (17.7 MB)
    └── Monos Food 2021-2023.pbix          # Power BI report (26.5 MB)

🚀 Installation

Prerequisites

  • Python: 3.12 or higher (tested with Python 3.13.7)
  • Operating System: Windows, macOS, or Linux
  • Memory: Minimum 4GB RAM (8GB+ recommended for large datasets)
  • Storage: At least 2GB free space

Required Python Packages

# Install required packages
pip install pandas openpyxl xlrd requests python-dotenv

# Optional packages for advanced features
pip install pyxlsb  # For XLSB file support
pip install tqdm    # For progress bars

DeepSeek API Setup

  1. Visit DeepSeek Platform
  2. Sign up or log in to your account
  3. Navigate to the API section
  4. Generate a new API key (starts with sk-...)
  5. Set the environment variable:

Windows (PowerShell):

$env:DEEPSEEK_API_KEY = "your-api-key-here"

macOS/Linux:

export DEEPSEEK_API_KEY='your-api-key-here'

Or create a .env file:

DEEPSEEK_API_KEY=your-api-key-here

⚙️ Configuration

API Configuration

The project uses DeepSeek API for brand extraction. Configure your API keys in the scripts or environment variables:

# Primary API key (from environment or hardcoded)
primary_key = os.getenv('DEEPSEEK_API_KEY') or 'sk-your-primary-key'

# Backup API key (optional)
backup_key = 'sk-your-backup-key'

Processing Parameters

Key parameters you can adjust in the scripts:

  • Batch Size: Number of records to process at once (default: 100-500)
  • Rate Limiting: Delay between API calls (default: 0.1-2 seconds)
  • Sample Size: Number of records for testing (default: 50-500)
  • Max Tokens: Maximum tokens per API request (default: 15-50)

📖 Usage

Quick Start

  1. Test the Setup
python setup_brand_filling.py
  1. Test Brand Extraction (5 records)
python test_brand_filling.py
  1. Process a Sample File
python fill_brand_with_deepseek.py

Common Workflows

1. Extract Brands from a CSV File

# Process a specific file with AI brand extraction
python ai_brand_extractor.py

2. Process All Weekly Data

# Process all weekly trade data files
python process_all_weekly_trade_brands.py

3. Ultra-Fast Batch Processing

# Use parallel processing for maximum speed
python ultra_fast_smart_brands.py

4. Cost-Optimized Processing

# Minimize API costs with smart caching
python cost_optimized_extractor.py

5. Data Analysis

# Analyze product names across all weeks
python analyze_product_names_all_weeks.py

# Detailed trade data analysis
python detailed_data_analysis.py

🔄 Data Processing Workflows

Workflow 1: Brand Filling Pipeline

1. Extract random sample → extract_random_sample.py
2. Test brand filling → test_brand_filling.py
3. Fill brands with AI → fill_brand_with_deepseek.py
4. Validate results → validate_results.py
5. Apply to full dataset → process_all_weekly_trade_brands.py

Workflow 2: Weekly Data Processing

1. Organize weekly data → organize_weekly_data.py
2. Split by weeks → split_csv_by_weeks.py
3. Process each week → full_smart_brand_processing.py
4. Generate reports → detailed_data_analysis.py

Workflow 3: Data Conversion

1. Convert XLSB to XLSX → convert_xlsb_to_xlsx.py
2. Convert to CSV → convert_excel_to_csv.py
3. Process and analyze → analyze_excel.py

🤖 AI Brand Extraction

How It Works

The AI brand extraction system uses DeepSeek's language model to intelligently identify brand names from product descriptions:

  1. Input Data: Product name, code, description, manufacturer
  2. Pattern Matching: First attempts rule-based extraction (free)
  3. AI Processing: Uses DeepSeek API for complex cases
  4. Caching: Stores results to avoid duplicate API calls
  5. Validation: Checks and cleans extracted brand names

Cost Optimization

  • Rule-based extraction: ~60% of brands extracted for free
  • Smart caching: Avoids duplicate API calls
  • Shorter prompts: Reduced token usage (70% fewer tokens)
  • Batch processing: Efficient API usage

Estimated Costs:

  • 500 records: ~$0.11 USD
  • 5,000 records: ~$1.10 USD
  • 50,000 records: ~$11.00 USD

Example Usage

from ai_brand_extractor import AIBrandExtractor

# Initialize extractor
extractor = AIBrandExtractor(api_key="your-api-key")

# Extract brand from product description
brand = extractor.extract_brand(
    product_name="Samsung Galaxy S21",
    product_code="SM-G991B",
    description="Smartphone 5G 128GB",
    manufacturer="Samsung Electronics"
)
# Returns: "Samsung"

🧹 Cleaning & Resetting

Clear Python Cache

# Windows PowerShell
Remove-Item -Recurse -Force __pycache__

# macOS/Linux
find . -type d -name "__pycache__" -exec rm -rf {} +

Remove Temporary Files

# Remove temporary Excel files
Remove-Item ~$*.xlsx, ~$*.xlsb

# Remove test output files
Remove-Item *_test_*.xlsx, *_test_*.csv

Reset to Clean State

# Keep only source code and documentation
# Remove all data files (WARNING: This deletes data!)

# Windows PowerShell
Remove-Item *.xlsx, *.xlsb, *.csv, *.pbix -Exclude "README.md", "*.py", "*.md"

# macOS/Linux
find . -type f \( -name "*.xlsx" -o -name "*.xlsb" -o -name "*.csv" -o -name "*.pbix" \) -delete

Database Reset (if applicable)

This project primarily uses file-based data storage. To reset:

  1. Delete all CSV files in complete_weekly_data/
  2. Delete processed Excel files
  3. Re-run data recovery: python recover_all_records.py

🤝 Contributing

Contributions are welcome! Please follow these guidelines:

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/your-feature-name
  3. Make your changes
  4. Test thoroughly: Run test scripts to ensure nothing breaks
  5. Commit your changes: git commit -m "Add your feature"
  6. Push to the branch: git push origin feature/your-feature-name
  7. Create a Pull Request

Code Style

  • Follow PEP 8 guidelines for Python code
  • Use meaningful variable and function names
  • Add docstrings to functions and classes
  • Comment complex logic

Testing

Before submitting a PR:

  • Test with sample data (50-100 records)
  • Verify API costs are reasonable
  • Check for data integrity
  • Validate output files

📄 License

This project is proprietary and confidential. All rights reserved.

Copyright © 2025 ShareWallet Project

Unauthorized copying, distribution, or use of this software is strictly prohibited.


📞 Support & Contact

For questions, issues, or support:

  • GitHub Issues: Report an issue
  • Documentation: See BRAND_FILLING_AUTOMATION_GUIDE.md for detailed guides

Last Updated: January 2025
Version: 1.0.0
Maintained by: Boldbat

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages