M8 Final Project · Applied NLP for university course evaluation Team: Miao Jingzhe · Zeng · Zhong · Liang
An end-to-end NLP pipeline that turns unstructured student course reviews into actionable insights for professors and administrators. Built on Zhipu AI's GLM-4-Flash model — multilingual (English + Chinese), free tier, sub-second latency per review.
Feed in a CSV of anonymous student course reviews. Out comes:
- Sentiment classification — positive / neutral / negative, with more confidence
- Aspect-based tagging — which part of the course is being discussed (teaching style, workload, materials, exams, instructor, logistics)
- Keyword extraction — the phrases students actually use
- An interactive dashboard — filterable by course, language, sentiment
Plus a live-demo mode for the final presentation: paste any comment, watch GLM classify it in real time.
campusvoice/
├── src/
│ ├── glm_client.py # GLM-4-Flash wrapper (retry, cache, JSON parsing)
│ ├── generate_data.py # Synthesize 300 labeled reviews for the demo
│ ├── pipeline.py # Run each review through GLM → enriched CSV
│ ├── evaluate.py # Accuracy + confusion matrix vs intended labels
│ └── app.py # Streamlit dashboard
├── data/
│ ├── feedback_raw.csv # generated
│ ├── feedback_analyzed.csv # pipeline output
│ └── .glm_cache/ # response cache (auto, gitignored)
├── outputs/ # evaluation artifacts for the slides
├── requirements.txt
├── Makefile
└── .env.example
pip install -r requirements.txtYou may creat a file named .env
ZHIPUAI_API_KEY=your_glm_api_key_hereGet a key at open.bigmodel.cn. GLM-4-Flash is free.
make generateThis asks GLM to write 300 realistic student reviews — a mix of positive, negative, and neutral, in both English and Chinese, across 8 courses. Each review comes with a target sentiment label we use later for evaluation.
make analyzeEvery review gets classified. Results are cached on disk — re-running is free.
make evalPrints accuracy and a confusion matrix. Writes artifacts to outputs/.
make runOpens at http://localhost:8501. Three views:
- Overview — all courses, all feedback
- By Course — pick one course, see only its reviews
- Live Demo — paste any text, analyze it live (this is the presentation mode)
| Member | Role | Modules |
|---|---|---|
| Miao | Data Engineering & Preprocessing | generate_data.py, pipeline.py |
| Zeng | NLP Integration & Model Deployment | glm_client.py, prompt design |
| Zhong | Frontend UI & Data Visualization | app.py, chart design |
| Liang | Project Management & Testing | evaluate.py, README, demo script |
Why GLM-4-Flash, not a fine-tuned classifier? A fine-tuned BERT would need labeled data we don't have and a training loop we don't have time for. GLM-4-Flash is free, handles Chinese + English out of the box, and lets us do both sentiment classification and aspect tagging in a single API call via structured JSON output.
Why cache every response?
During development we re-run the pipeline dozens of times. Caching on a
SHA-256 of (system, prompt) means we only pay (in latency and quota) for
each unique review once. Purge with make clean.
Why synthesize the data instead of scraping real reviews? Two reasons: (1) it gives us ground-truth labels for evaluation — we asked GLM to write a review with a target sentiment, then check whether it classifies its own output correctly. (2) We control the distribution, so every aspect category is represented.
This is documented as a limitation in the report. Real deployment would use human-annotated data.
- Open the dashboard on Overview — show the KPIs and sentiment-by-aspect chart.
- Switch to By Course — pick a course with mixed feedback, narrate what a professor would learn from this view.
- Switch to Live Demo — paste a review the audience suggests, or a pre-prepared bilingual one. Show the JSON output.
- Close with accuracy number from
make eval.