> A personal, automated intelligence pipeline that aggregates ArXiv research papers and tech news, processes them with an LLM, and publishes structured daily and weekly digests to a GitHub Pages site — with a Telegram push for on-the-go reading.
Keeping up with AI/ML research and tech news is a full-time job. ArXiv alone publishes hundreds of papers per day across cs.AI, cs.LG, cs.CV, cs.CL, and cs.RO. Manually filtering signal from noise is exhausting.
AI Digest Daily automates the entire curation loop:
- Every morning it fetches new ArXiv papers and RSS news from curated sources
- Low-signal papers are pruned early using a Semantic Scholar h-index prefilter — before any LLM call is made
- An LLM scores relevance, categorises each item by topic, and generates structured insight blocks for papers
- Everything is persisted to a local SQLite database for deduplication and weekly rollups
- The day's results are rendered to Markdown, committed to GitHub Pages, and pushed as a Telegram message
The system runs entirely on a self-hosted GitHub Actions runner, so the SQLite database survives between runs naturally — no cloud database required.
Two complementary discovery modes run every day and feed the same downstream pipeline:
| Mode | How it works |
|---|---|
| Subject (RSS) | ArXiv RSS feed queried per configured category (cs.AI, cs.LG, cs.CV, cs.CL, cs.RO, cs.HC) — global trend discovery |
| Topic (keyword search) | ArXiv search API queried per topic_interests entry using ti:/abs: phrase search with a date-range filter — personalized discovery |
| Step | What happens |
|---|---|
| Fetch | Both modes run; each paper tagged with discovery_type (subject or topic) and search_topic |
| Dedup | Each arxiv_id is checked against SQLite before any processing begins |
| H-index filter | Semantic Scholar API fetches author h-indexes; papers with no high-h-index author are dropped, saving LLM budget |
| Author watchlist | Papers by watched authors receive an automatic relevance boost |
| LLM pass 1 — Scoring | Papers are batched and sent to the LLM; each returns {arxiv_id, relevance 1–5, topic_category} as JSONL |
| LLM pass 2 — Insights | Papers above the relevance cutoff receive a full structured insight block: contribution, core idea, technique, pipeline, methodology, results, limitations |
| Code detection | Abstract is scanned for GitHub URLs; Papers With Code API queried for official repositories |
| Persist | Results written to the papers table; embedding column reserved (NULL) for V2 RAG |
| Step | What happens |
|---|---|
| Fetch | All enabled RSS sources polled for items newer than the previous run window |
| Dedup | URL deduplicated against SQLite before LLM calls |
| LLM summarise | Single call per item returns {summary, tags, topic_category, relevance 1–5} |
| Filter & persist | Items below relevance cutoff are discarded; the rest are written to news_items |
| Step | What happens |
|---|---|
| Fetch | gtrending library queries trending repos for each configured language ("" = all, python, jupyter-notebook, etc.) with the configured since window (daily / weekly / monthly) |
| Archived filter | Repos flagged archived=True are immediately dropped |
| GitHub API enrichment | Topics, pushed_at, description, and README text fetched via GitHub REST API (parallel, up to 4 workers) |
| Stage 1 — Keyword match | Repo name, description, topics, and README scanned against merged keyword list (ArXiv topic_interests + interests) — misses are dropped without LLM cost |
| Stage 2 — LLM score | Each surviving repo sent to LLM with full context; returns {relevance 1–5, topic_category, summary, tags}; repos below relevance_cutoff dropped |
| Sort & cap | Repos sorted by pushed_at descending, capped at max_repos |
| Persist | Written to news_items table with source="GitHub Trending" |
- GitHub Pages — Daily digest and weekly rollup pages committed and deployed on every run
- Telegram — Numbered list of paper titles with ArXiv links + news headlines pushed to a private bot; weekly push includes the LLM-generated narrative paragraph
- Weekly rollup — Every Sunday, a separate workflow queries the week's DB entries, generates a 4–5 sentence narrative with the LLM, and publishes a rollup page
- Reading progress bar — 3 px accent bar fixed to the top, filled by 8 lines of vanilla JS
- Page fade-in — 300 ms ease-in with 6 px upward drift, CSS only
- Smooth
<details>expand —max-heighttransition, no snap-open - Tag hover — background color shift in 150 ms
- Relevance dots —
●●●○○visual scannable score (no numbers needed) - GitHub code badge — Shields.io badge linking directly to the paper's code repository when available
| Library | Role |
|---|---|
feedparser |
Parse ArXiv RSS and all news RSS/Atom feeds with one unified interface |
gtrending |
Scrape GitHub Trending page for rising repositories by language and time window |
openai |
OpenAI-compatible client used for both the self-hosted vLLM endpoint and the Azure OpenAI fallback |
requests |
Synchronous HTTP for Semantic Scholar API, Papers With Code API, and Telegram Bot API |
PyYAML |
Load config.yaml — user preferences, source list, topic categories |
python-dotenv |
Inject secrets from .env at runtime, keeping credentials out of version control |
retry |
Transparent retry decorator on flaky external API calls (Semantic Scholar, ArXiv RSS) |
tqdm |
Progress bars during batch operations |
sqlite3 |
Standard library — no ORM, no migration framework; intentionally minimal |
| Component | Detail |
|---|---|
| Primary | Self-hosted vLLM endpoint — any OpenAI-compatible model |
| Fallback 1 | Local Ollama — used if vLLM is unavailable; endpoint and model configurable via OLLAMA_ENDPOINT / OLLAMA_MODEL env vars |
| Fallback 2 | Azure OpenAI — last resort if both primary and Ollama fail |
| Routing | Sequential try/except per tier — no circuit breaker complexity |
| Deep model | ArXiv insight generation pass — higher quality, used once per paper |
| Fast model | Relevance scoring and news summarisation — throughput-optimised |
| Component | Detail |
|---|---|
| Jekyll | Static site generator; Markdown files from the pipeline become HTML pages at deploy time |
| Minimal Mistakes | Jekyll theme providing layout, typography, and responsive design |
peaceiris/actions-gh-pages |
GitHub Action that publishes site/ to the gh-pages branch on every pipeline run |
| API | Used for |
|---|---|
| ArXiv RSS | Source of new paper submissions — official feed, no scraping |
| Semantic Scholar | Author h-index lookup for prefiltering; watched-author matching |
| Papers With Code | Code repository lookup per paper (official repo, sorted by stars) |
| Telegram Bot API | Outbound-only push for daily and weekly summaries |
The project is structured as two independent pipelines sharing a common database, LLM client, and rendering layer.
ai-digest-daily/
├── src/
│ ├── configs/
│ │ ├── config.yaml ← user preferences (sources, categories, cutoffs)
│ │ └── authors.txt ← watched Semantic Scholar author IDs
│ │
│ ├── db/
│ │ └── database.py ← SQLite init, upsert, dedup helpers
│ │
│ ├── llm/
│ │ ├── client.py ← primary vLLM → Ollama → Azure fallback routing
│ │ ├── prompts.py ← all prompt templates (scoring, insights, news, narrative)
│ │ └── scoring.py ← batched relevance scoring + JSONL parser
│ │
│ ├── pipelines/
│ │ ├── arxiv_pipeline.py ← fetch (RSS + keyword search) → filter → score → code lookup → insights → DB
│ │ ├── news_pipeline.py ← fetch → dedup → summarise → DB
│ │ └── github_trending_pipeline.py ← fetch → keyword filter → LLM score → DB (source="GitHub Trending")
│ │
│ ├── render/
│ │ ├── daily.py ← daily digest Markdown with relevance dots + code badges
│ │ └── weekly.py ← weekly rollup Markdown with narrative blockquote
│ │
│ ├── delivery/
│ │ └── telegram.py ← outbound Telegram push (daily + weekly)
│ │
│ ├── main.py ← daily entrypoint → python -m src.main
│ └── weekly_main.py ← weekly entrypoint → python -m src.weekly_main
│
├── site/ ← Jekyll source (pushed to gh-pages on each run)
│ ├── _config.yml
│ ├── _posts/ ← generated daily digest pages land here
│ ├── _weekly/ ← generated weekly rollup pages land here
│ ├── _pages/ ← static pages: archive, papers, news, weekly index
│ ├── assets/css/digest.css ← progress bar, fade-in, transitions
│ └── assets/js/progress.js ← 8-line scroll progress bar
│
├── .github/workflows/
│ ├── daily.yml ← cron 03:00 UTC (10:00 UTC+7), self-hosted runner
│ └── weekly.yml ← cron Sunday 20:00 UTC, self-hosted runner
│
├── digest/ ← gitignored; SQLite DB lives here
└── .env ← secrets, never committed
ArXiv RSS (per category) ──┐
│ fetch_arxiv_rss()
ArXiv keyword search ──────┘ fetch_arxiv_by_topic()
(per topic_interest) │ discovery_type + search_topic tagged
│
dedup vs SQLite ──── (already seen → skip)
│
Semantic Scholar API
(h-index filter)
│
LLM pass 1
(batch scoring)
│
relevance cutoff ──── (score < 2 → drop)
│
Papers With Code
(code URL lookup)
│
LLM pass 2
(per-paper insights)
│
upsert_paper()
│
RSS Feeds ─────────┐
▼
fetch_rss_feed()
│
dedup vs SQLite
│
LLM summarise
│
relevance cutoff
│
upsert_news_item()
│
GitHub Trending ───────────┐
▼
gtrending.fetch_repos()
│
archived filter
│
GitHub API enrichment
(topics, README, pushed_at)
│
Stage 1: keyword match ──── (no match → skip)
│
Stage 2: LLM score ──────── (score < cutoff → drop)
│
upsert_news_item()
(source="GitHub Trending")
│
render_daily()
│
site/_posts/YYYY-MM-DD.md
│
GitHub Pages deploy ──► telegram.push_daily()
papers
| Column | Type | Notes |
|---|---|---|
arxiv_id |
TEXT PK | Deduplication key |
title |
TEXT | |
authors |
TEXT | Comma-separated |
abstract |
TEXT | Raw from ArXiv |
categories |
TEXT | ArXiv categories |
topic_category |
TEXT | LLM-assigned |
relevance |
INTEGER | 1–5 |
insights |
TEXT | JSON blob (7 structured fields) |
code_url |
TEXT | GitHub repo URL if found |
discovery_type |
TEXT | subject (RSS) or topic (keyword search) |
search_topic |
TEXT | Interest topic that triggered discovery (topic mode only) |
published_date |
TEXT | |
fetched_date |
TEXT | |
digest_date |
TEXT | |
digest_week |
INTEGER | ISO week — enables weekly rollup queries |
embedding |
BLOB | NULL in V1, reserved for V2 RAG backfill |
news_items
| Column | Type | Notes |
|---|---|---|
url |
TEXT PK | Deduplication key |
title |
TEXT | |
source |
TEXT | Source name from config |
summary |
TEXT | LLM-generated, 2–4 sentences |
tags |
TEXT | Comma-separated |
topic_category |
TEXT | LLM-assigned |
relevance |
INTEGER | 1–5 |
published_date |
TEXT | |
fetched_date |
TEXT | |
digest_date |
TEXT | |
digest_week |
INTEGER | ISO week |
embedding |
BLOB | NULL in V1 |
complete(prompt)
│
├─ try: POST vLLM endpoint (self-hosted)
│ └─ success → return response
│
├─ except: POST Ollama (localhost:11434, fallback 1)
│ └─ success → return response
│
└─ except: POST Azure OpenAI (fallback 2)
└─ return response
No retry loops, no circuit breaker — sequential try/except per tier. Simple and auditable.
Ollama endpoint and model are configurable via OLLAMA_ENDPOINT and OLLAMA_MODEL env vars (defaults: http://localhost:11434/v1, qwen3:14b).
- Python 3.10+
- Ruby + Bundler (for local site preview only)
- A self-hosted GitHub Actions runner (for scheduled runs with persistent SQLite)
git clone https://github.com/hiimmuc/ai-digest-daily
cd ai-digest-daily
pip install -r requirements.txtcp .env.example .env
# Edit .env — minimum required: AZURE_OPENAI_* keys
# VLLM_ENDPOINT is optional; falls back to Ollama then Azure automatically
# OLLAMA_ENDPOINT / OLLAMA_MODEL are optional (defaults: localhost:11434, qwen2.5:7b)Edit src/configs/config.yaml to set:
- ArXiv categories and relevance cutoffs
- News RSS sources (enable/disable per source)
- Telegram delivery toggle
- Topic categories for LLM classification
Optionally add watched author IDs to src/configs/authors.txt:
# Name, Semantic Scholar Author ID
Yann LeCun, 1741101
# Daily pipeline (ArXiv + News → DB → Markdown → Telegram)
python -m src.main
# Weekly rollup (query DB → narrative → Markdown → Telegram)
python -m src.weekly_main
# Preview the Jekyll site locally
cd site && bundle install && bundle exec jekyll serveCron trigger (GitHub Actions)
↓
Self-hosted runner wakes up
↓
main.py runs both pipelines
↓
SQLite updated
↓
Markdown digest generated
↓
Committed to gh-pages branch
↓
GitHub Pages deploys automatically
↓
Telegram push sent
The pipeline requires a self-hosted runner so that the SQLite database persists between runs (GitHub-hosted runners are ephemeral and would lose the DB on every run).
-
Register a runner on your machine:
# Download from: Settings → Actions → Runners → New self-hosted runner mkdir actions-runner && cd actions-runner curl -o actions-runner-linux-x64.tar.gz -L https://github.com/actions/runner/releases/latest/download/actions-runner-linux-x64.tar.gz tar xzf actions-runner-linux-x64.tar.gz ./config.sh --url https://github.com/<owner>/<repo> --token <RUNNER_TOKEN>
-
Install as a systemd service so it survives reboots:
sudo ./svc.sh install sudo ./svc.sh start # Check status sudo ./svc.sh status -
Verify the runner appears as Idle under Settings → Actions → Runners in your repository.
The workflows use
runs-on: self-hosted— GitHub will route all scheduled jobs to this runner automatically.
Set the following secrets under Settings → Secrets and variables → Actions:
| Secret | Required | Description |
|---|---|---|
VLLM_ENDPOINT |
Optional | Self-hosted vLLM base URL |
VLLM_MODEL |
Optional | Model name on the vLLM instance |
OLLAMA_ENDPOINT |
Optional | Ollama base URL (default: http://localhost:11434/v1) |
OLLAMA_MODEL |
Optional | Ollama model name (default: qwen3:14b) |
AZURE_OPENAI_ENDPOINT |
Required | Azure OpenAI resource URL |
AZURE_OPENAI_KEY |
Required | Azure API key |
AZURE_OPENAI_MODEL |
Required | Deployment name (e.g. gpt-4o) |
SEMANTIC_SCHOLAR_KEY |
Optional | S2 API key (avoids rate limits) |
GITHUB_TOKEN |
Optional | GitHub PAT for GitHub API enrichment in the Trending pipeline (avoids rate limits) |
TELEGRAM_BOT_TOKEN |
Optional | Bot token from @BotFather |
TELEGRAM_CHAT_ID |
Optional | Target chat or channel ID |