A modern React application that automatically classifies PDF documents as Invoice, Receipt, or Delivery Order using intelligent text analysis and keyword matching.
Visit the live application: Invoice Classifier
- π Drag & Drop Interface: Intuitive file upload with drag-and-drop support
- π€ AI-Powered Classification: Uses Hugging Face BART model for zero-shot classification
- π Hybrid AI Approach: Combines AI results (70%) with keyword analysis (30%)
- π Visual Results: Beautiful progress bars and confidence scores
- π PDF Text Extraction: Extracts text from PDF documents using PDF.js
- π‘οΈ Enhanced Security: File processing happens entirely client-side
- βΏ Accessibility: Full keyboard navigation and screen reader support
- π± Responsive Design: Works seamlessly on desktop and mobile devices
- β‘ Real-time Processing: Instant classification with detailed processing pipeline
- π Debug Mode: Advanced debugging information for AI analysis
The classifier can identify three types of business documents:
- Keywords: "invoice", "bill", "amount due", "payment terms", "net 30", "invoice number", etc.
- Common business billing documents
- Keywords: "receipt", "thank you", "purchase", "transaction", "cash", "credit card", etc.
- Purchase confirmations and payment receipts
- Keywords: "delivery", "order", "shipment", "tracking", "shipping", "carrier", etc.
- Shipping and delivery documentation
- Frontend: React 19.1.1 with Hooks optimization
- PDF Processing: PDF.js 5.4.149 with enhanced error handling
- AI Classification: Hugging Face BART model (zero-shot classification)
- AI Fallback: Local pattern analysis with weighted features
- Performance: useCallback, useMemo for optimized rendering
- Security: Client-side processing with file validation
- Accessibility: ARIA labels, keyboard navigation, screen reader support
- Styling: Custom CSS with modern gradients and animations
- Build Tool: Create React App
- Deployment: GitHub Pages
- Node.js (version 14 or higher)
- npm or yarn package manager
- (Optional) Hugging Face API key for enhanced AI classification
-
Clone the repository
git clone https://github.com/onsenix12/dtx-assignment1.git cd dtx-assignment1 -
Install dependencies
npm install
-
Configure AI API (Optional but Recommended)
For enhanced AI classification using Hugging Face models:
a. Get a free API key from Hugging Face
b. Create a
.envfile in the project root:# Create .env file echo "REACT_APP_HF_API_KEY=your_huggingface_api_key_here" > .env
c. Replace
your_huggingface_api_key_herewith your actual API key (starts withhf_) -
Start the development server
npm start
-
Open your browser Navigate to http://localhost:3000
The application uses a hybrid AI approach:
- With API Key: Uses Hugging Face's BART model for zero-shot classification
- Without API Key: Uses sophisticated local AI pattern analysis
- Always: Combines AI results with keyword analysis for best accuracy
-
Upload a Document
- Drag and drop a PDF file onto the upload area, or
- Click "Choose File" to browse and select a PDF
-
Classify the Document
- Click the "π Classify Document" button
- The app will extract text from the PDF and analyze it
-
View Results
- See classification results with confidence scores
- The best match is highlighted with a "Best Match" badge
- Results are displayed as progress bars with percentages
-
Upload Another Document
- Click "π Upload New Document" to start over
| Command | Description |
|---|---|
npm start |
Runs the app in development mode |
npm test |
Launches the test runner |
npm run build |
Builds the app for production |
npm run eject |
Ejects from Create React App (one-way operation) |
dtx-assignment1/
βββ public/
β βββ pdf.worker.min.js # PDF.js worker for text extraction
β βββ index.html
βββ src/
β βββ App.js # Main application component
β βββ App.css # Application styles
β βββ index.js # Application entry point
βββ package.json
βββ README.md
- Uses PDF.js to extract text content from PDF documents
- Processes all pages of multi-page documents
- Handles text-based PDFs (not scanned images)
- Enhanced Security: File size validation (10MB limit) and type checking
- Error Handling: Graceful fallback for corrupted or image-based PDFs
- π€ AI Analysis:
- With API: Uses Hugging Face BART model for zero-shot classification
- Without API: Uses sophisticated local AI pattern analysis with weighted features
- Timeout Protection: 30-second timeout to prevent hanging requests
- π Keyword Matching: Analyzes extracted text against predefined keyword lists
- π Hybrid Approach: Combines AI results (70%) with keyword analysis (30%)
- π Confidence Scoring: Converts AI and keyword scores to percentage confidence
- π‘οΈ Fallback System: Uses business indicators when no clear patterns are found
- β‘ Performance: Optimized with React hooks for better rendering performance
The system uses comprehensive keyword lists for each document type:
- Invoice: 20+ keywords including billing terms, payment conditions
- Receipt: 15+ keywords including transaction terms, payment methods
- Delivery Order: 15+ keywords including shipping terms, tracking info
- Modern UI: Clean, professional interface with gradient backgrounds
- Interactive Elements: Hover effects, smooth transitions, and animations
- Visual Feedback: Progress bars, badges, and color-coded results
- Responsive Layout: Optimized for all screen sizes
- Enhanced Accessibility:
- ARIA labels and roles for screen readers
- Full keyboard navigation support
- Progress bar accessibility with proper values
- Alert regions for dynamic content updates
- Debug Information: Toggle-able detailed AI processing information
- Status Indicators: Real-time AI processing status with visual feedback
The application is automatically deployed to GitHub Pages:
- Production URL: https://onsenix12.github.io/dtx-assignment1
- Build Command:
npm run build - Deploy Branch:
gh-pages
- PDF Text Only: Works best with PDFs containing extractable text
- No OCR Support: Scanned images within PDFs may not be processed
- File Size Limit: Maximum 10MB file size for processing
- Text Extraction: Requires PDFs with extractable text (not scanned images)
- English Language: Optimized for English business documents
- Internet Required: Hugging Face API requires internet connection (fallback available)
- Client-Side Processing: PDF files never leave your browser
- Local Text Extraction: All PDF processing happens locally using PDF.js
- Minimal Data Transmission: Only extracted text snippets (500 chars) sent to AI API
- No File Storage: Documents are not saved or stored anywhere
- Privacy Mode: Can operate entirely offline with local AI analysis
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is part of a Digital Transformation assignment and is for educational purposes.
Onsenix12
- GitHub: @onsenix12
- Project: Digital Transformation Assignment 1
- Built with Create React App
- PDF processing powered by PDF.js
- Deployed on GitHub Pages
Built for Digital Transformation Assignment β’ Powered by Text Analysis π€