A comprehensive Python-based data processing system for analyzing Mongolian customs and trade data, with AI-powered brand name extraction capabilities using the DeepSeek API.
- Overview
- Features
- Project Structure
- Installation
- Configuration
- Usage
- Data Processing Workflows
- AI Brand Extraction
- Cleaning & Resetting
- Contributing
- License
ShareWallet is a data analysis toolkit designed to process and analyze large-scale Mongolian trade and customs data. The project focuses on:
- Trade Data Analysis: Processing 800,000+ trade transaction records
- AI-Powered Brand Extraction: Automated brand name identification from product descriptions
- Weekly Data Organization: Structured processing of weekly trade data (W14-W27)
- Multi-format Support: Handles CSV, XLSX, XLSB, and PBIX files
- Cost-Optimized AI Processing: Smart caching and pattern matching to minimize API costs
- Total Records: 837,664 trade transactions (April-June 2025)
- Data Period: 2021-2025
- Weekly Files: 14 weeks of complete trade data (W14-W27)
- Total Data Size: ~800+ MB of processed trade data
- DeepSeek API integration for intelligent brand name prediction
- Multi-language support (English, Cyrillic, Mongolian)
- Cost-optimized processing with caching and pattern matching
- Batch processing capabilities for large datasets
- Excel/CSV file conversion and manipulation
- XLSB to XLSX conversion
- Random sampling for testing
- Data validation and quality checks
- Weekly trade data organization
- Product categorization
- Company and manufacturer analysis
- Financial data aggregation
- Temporal pattern analysis
- Executive summary generation
- Detailed analysis reports
- Cost optimization reports
- Processing progress tracking
sharewallet-main/
├── complete_weekly_data/ # Weekly trade data files (W14-W27)
│ ├── complete_trade_data_2025_W14.csv
│ ├── complete_trade_data_2025_W15.csv
│ └── ... (14 weekly files)
│
├── Core Processing Scripts
│ ├── ai_brand_extractor.py # Main AI brand extraction engine
│ ├── cost_optimized_extractor.py # Cost-efficient brand extraction
│ ├── quality_brand_extractor.py # High-quality extraction mode
│ ├── smart_brand_finder.py # Pattern-based brand detection
│ └── product_data_processor.py # General data processing
│
├── Batch Processing Scripts
│ ├── full_smart_brand_processing.py # Process all weekly data
│ ├── process_all_weekly_trade_brands.py # Weekly trade brand filling
│ ├── ultra_fast_smart_brands.py # Parallel processing mode
│ └── ultra_fast_trade_brands.py # Ultra-fast batch processing
│
├── Data Conversion & Analysis
│ ├── convert_xlsb_to_xlsx.py # XLSB file conversion
│ ├── convert_csv_to_excel.py # CSV to Excel conversion
│ ├── analyze_product_names_all_weeks.py # Product name analysis
│ ├── detailed_data_analysis.py # Comprehensive data analysis
│ └── examine_data.py # Data exploration
│
├── Testing & Validation
│ ├── test_brand_filling.py # Test brand extraction
│ ├── test_ai_brands.py # AI extraction testing
│ ├── validate_results.py # Result validation
│ └── check_test_results.py # Test result verification
│
├── Utility Scripts
│ ├── setup_brand_filling.py # Setup and configuration
│ ├── extract_random_sample.py # Random sampling
│ ├── split_csv_by_weeks.py # Weekly data splitting
│ ├── organize_weekly_data.py # Data organization
│ └── recover_all_records.py # Data recovery
│
├── Documentation
│ ├── README.md # This file
│ ├── BRAND_FILLING_AUTOMATION_GUIDE.md # Brand filling guide
│ ├── EXECUTIVE_SUMMARY_REPORT.md # Executive summary
│ ├── cost_optimization_report.md # Cost analysis
│ └── ai_brand_extraction_report.md # AI extraction report
│
└── Data Files
├── 2023.xlsx # Historical data (86.7 MB)
├── processed_import_data.xlsx # Processed import data (15.6 MB)
├── ________.xlsb # Raw data file (17.7 MB)
└── Monos Food 2021-2023.pbix # Power BI report (26.5 MB)
- Python: 3.12 or higher (tested with Python 3.13.7)
- Operating System: Windows, macOS, or Linux
- Memory: Minimum 4GB RAM (8GB+ recommended for large datasets)
- Storage: At least 2GB free space
# Install required packages
pip install pandas openpyxl xlrd requests python-dotenv
# Optional packages for advanced features
pip install pyxlsb # For XLSB file support
pip install tqdm # For progress bars- Visit DeepSeek Platform
- Sign up or log in to your account
- Navigate to the API section
- Generate a new API key (starts with
sk-...) - Set the environment variable:
Windows (PowerShell):
$env:DEEPSEEK_API_KEY = "your-api-key-here"macOS/Linux:
export DEEPSEEK_API_KEY='your-api-key-here'Or create a .env file:
DEEPSEEK_API_KEY=your-api-key-here
The project uses DeepSeek API for brand extraction. Configure your API keys in the scripts or environment variables:
# Primary API key (from environment or hardcoded)
primary_key = os.getenv('DEEPSEEK_API_KEY') or 'sk-your-primary-key'
# Backup API key (optional)
backup_key = 'sk-your-backup-key'Key parameters you can adjust in the scripts:
- Batch Size: Number of records to process at once (default: 100-500)
- Rate Limiting: Delay between API calls (default: 0.1-2 seconds)
- Sample Size: Number of records for testing (default: 50-500)
- Max Tokens: Maximum tokens per API request (default: 15-50)
- Test the Setup
python setup_brand_filling.py- Test Brand Extraction (5 records)
python test_brand_filling.py- Process a Sample File
python fill_brand_with_deepseek.py# Process a specific file with AI brand extraction
python ai_brand_extractor.py# Process all weekly trade data files
python process_all_weekly_trade_brands.py# Use parallel processing for maximum speed
python ultra_fast_smart_brands.py# Minimize API costs with smart caching
python cost_optimized_extractor.py# Analyze product names across all weeks
python analyze_product_names_all_weeks.py
# Detailed trade data analysis
python detailed_data_analysis.py1. Extract random sample → extract_random_sample.py
2. Test brand filling → test_brand_filling.py
3. Fill brands with AI → fill_brand_with_deepseek.py
4. Validate results → validate_results.py
5. Apply to full dataset → process_all_weekly_trade_brands.py
1. Organize weekly data → organize_weekly_data.py
2. Split by weeks → split_csv_by_weeks.py
3. Process each week → full_smart_brand_processing.py
4. Generate reports → detailed_data_analysis.py
1. Convert XLSB to XLSX → convert_xlsb_to_xlsx.py
2. Convert to CSV → convert_excel_to_csv.py
3. Process and analyze → analyze_excel.py
The AI brand extraction system uses DeepSeek's language model to intelligently identify brand names from product descriptions:
- Input Data: Product name, code, description, manufacturer
- Pattern Matching: First attempts rule-based extraction (free)
- AI Processing: Uses DeepSeek API for complex cases
- Caching: Stores results to avoid duplicate API calls
- Validation: Checks and cleans extracted brand names
- Rule-based extraction: ~60% of brands extracted for free
- Smart caching: Avoids duplicate API calls
- Shorter prompts: Reduced token usage (70% fewer tokens)
- Batch processing: Efficient API usage
Estimated Costs:
- 500 records: ~$0.11 USD
- 5,000 records: ~$1.10 USD
- 50,000 records: ~$11.00 USD
from ai_brand_extractor import AIBrandExtractor
# Initialize extractor
extractor = AIBrandExtractor(api_key="your-api-key")
# Extract brand from product description
brand = extractor.extract_brand(
product_name="Samsung Galaxy S21",
product_code="SM-G991B",
description="Smartphone 5G 128GB",
manufacturer="Samsung Electronics"
)
# Returns: "Samsung"# Windows PowerShell
Remove-Item -Recurse -Force __pycache__
# macOS/Linux
find . -type d -name "__pycache__" -exec rm -rf {} +# Remove temporary Excel files
Remove-Item ~$*.xlsx, ~$*.xlsb
# Remove test output files
Remove-Item *_test_*.xlsx, *_test_*.csv# Keep only source code and documentation
# Remove all data files (WARNING: This deletes data!)
# Windows PowerShell
Remove-Item *.xlsx, *.xlsb, *.csv, *.pbix -Exclude "README.md", "*.py", "*.md"
# macOS/Linux
find . -type f \( -name "*.xlsx" -o -name "*.xlsb" -o -name "*.csv" -o -name "*.pbix" \) -deleteThis project primarily uses file-based data storage. To reset:
- Delete all CSV files in
complete_weekly_data/ - Delete processed Excel files
- Re-run data recovery:
python recover_all_records.py
Contributions are welcome! Please follow these guidelines:
- Fork the repository
- Create a feature branch:
git checkout -b feature/your-feature-name - Make your changes
- Test thoroughly: Run test scripts to ensure nothing breaks
- Commit your changes:
git commit -m "Add your feature" - Push to the branch:
git push origin feature/your-feature-name - Create a Pull Request
- Follow PEP 8 guidelines for Python code
- Use meaningful variable and function names
- Add docstrings to functions and classes
- Comment complex logic
Before submitting a PR:
- Test with sample data (50-100 records)
- Verify API costs are reasonable
- Check for data integrity
- Validate output files
This project is proprietary and confidential. All rights reserved.
Copyright © 2025 ShareWallet Project
Unauthorized copying, distribution, or use of this software is strictly prohibited.
For questions, issues, or support:
- GitHub Issues: Report an issue
- Documentation: See
BRAND_FILLING_AUTOMATION_GUIDE.mdfor detailed guides
Last Updated: January 2025
Version: 1.0.0
Maintained by: Boldbat