Skip to content

Promote Lance datasets to a top-level Datasets tab with auto-sync#237

Merged
prrao87 merged 3 commits into
mainfrom
claude/great-chebyshev-111a9d
May 12, 2026
Merged

Promote Lance datasets to a top-level Datasets tab with auto-sync#237
prrao87 merged 3 commits into
mainfrom
claude/great-chebyshev-111a9d

Conversation

@prrao87
Copy link
Copy Markdown
Contributor

@prrao87 prrao87 commented May 12, 2026

Summary

  • Promote Hugging Face Lance datasets out of Integrations > Hugging Face Hub into a top-level Datasets tab between Demos and API Reference. Each of the ~30 datasets in lance-format/lance-huggingface gets its own page, organized into 9 categories in the sidebar (Image Classification, Object Detection & Segmentation, Image Retrieval, VQA, Text QA, Text Corpora, Speech, Video, Robotics).
  • Per-dataset pages are auto-generated from upstream HF_DATASET_CARD.md files. The GitHub repo stays the single source of truth — the same files are pushed to HF Hub as each dataset's README.md, so we don't maintain content in two places.
  • The Hugging Face Hub integration walkthrough (the LAION example) moves to integrations/ai/huggingface.mdx, alongside the other AI-platform integration pages. Old /huggingface/... URLs redirect to the new locations.

How the sync works

scripts/sync_hf_datasets.py (run via make hf-sync) reads scripts/hf_datasets.yaml, fetches each upstream card, and updates three things in one pass:

  1. docs/datasets/<slug>.mdx — one MDX page per dataset, with Mintlify frontmatter (title, sidebarTitle, description derived from the first prose paragraph), a "View on Hugging Face" link card injected at the top, the upstream H1 stripped, and the rest of the body passed through.
  2. docs/datasets/index.mdx — the landing-page card grid is regenerated between HF_SYNC:START / HF_SYNC:END MDX comment markers. The hand-written intro and the "Share your own dataset" footer stay untouched.
  3. docs/docs.json — the Datasets tab's groups array is rebuilt to match the yaml, so the sidebar stays in sync without manual edits.

The yaml encodes four fields per dataset (dir, slug, hf, title) because the GitHub directory names, the HF Hub repo slugs, and the desired URL slugs don't follow a derivable convention (e.g. imagenet1k_valimagenet-1k-val-lance/datasets/imagenet-1k-val). Verbose but explicit; adding a new dataset is still one line.

MDX sanitization

Two upstream cards triggered MDX parse errors during the initial sync, so the script's sanitizer now handles them defensively:

  • openvid_hf/HF_DATASET_CARD.md has a bare @article{...} citation outside a code fence — MDX treats {...} as a JSX expression and rejects it. The script auto-wraps these in a ```bibtex block.
  • laion-1M/HF_DATASET_CARD.md contains the string Lance<>HF in prose — MDX parses <> as an empty React fragment. The script escapes literal <> to &lt;&gt; outside code regions.

Both could also be fixed upstream; the sanitizer is belt-and-braces for future cards.

Adding a new dataset

  1. Author the new HF_DATASET_CARD.md upstream in lance-format/lance-huggingface and push it to the Hub as usual.
  2. Add one line under the right category in scripts/hf_datasets.yaml.
  3. Run make hf-sync and commit the regenerated MDX, docs.json, and the yaml entry.

See the new "Sync Hugging Face dataset pages" section in README.md for the full maintainer workflow.

Test plan

  • cd docs && mint dev — verify the Datasets tab renders with the 9 categories in the sidebar
  • Click into 3–4 dataset pages (e.g. /datasets/flickr30k, /datasets/laion-1m, /datasets/openvid, /datasets/lerobot-pusht) — confirm titles, descriptions, "View on Hugging Face" link, and body content render correctly
  • Visit /huggingface/overview and /huggingface/datasets — confirm both redirect to the new locations
  • Verify cd docs && mint broken-links reports no broken links
  • Re-run make hf-sync — confirm it's idempotent (no diff on a clean run)

🤖 Generated with Claude Code

prrao87 and others added 2 commits May 12, 2026 10:28
…nc from upstream cards

Move Hugging Face datasets out of the Integrations > Hugging Face Hub subgroup
into their own top-level Datasets tab between Demos and API Reference. Each
dataset gets its own page, populated automatically from the
HF_DATASET_CARD.md files in lance-format/lance-huggingface so the upstream
repo remains the single source of truth.

- huggingface/overview.mdx -> integrations/ai/huggingface.mdx (the integration
  walkthrough now lives next to the other AI-platform integration pages)
- huggingface/datasets.mdx -> datasets/index.mdx (landing page + auto-generated
  card grid between HF_SYNC:START/HF_SYNC:END markers)
- 30 per-dataset MDX pages under docs/datasets/, organized into 9 categories
  in the sidebar
- scripts/hf_datasets.yaml: explicit mapping of upstream directory, URL slug,
  HF Hub repo, and display title for each dataset (the three names don't have
  a derivable relationship)
- scripts/sync_hf_datasets.py: fetches each upstream card, rewrites frontmatter
  for Mintlify, strips the H1, injects a "View on Hugging Face" card, and
  sanitizes known MDX hazards (orphan bibtex citations, literal "<>" in prose)
- Makefile: hf-sync target wires it up; pyproject adds pyyaml
- Redirects in docs.json keep old /huggingface/* URLs working
- README documents why the scripts exist and the maintainer workflow for
  adding a new dataset

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…e page

The slug ms-marco-v2.1 contains a period that Mintlify can mis-parse as a
file extension, which can silently drop the entire Datasets tab from the
top nav. Use ms-marco-v2 instead — the v2.1 detail is still in the page
title and the source repo name on Hugging Face.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@mintlify
Copy link
Copy Markdown
Contributor

mintlify Bot commented May 12, 2026

Preview deployment for your docs. Learn more about Mintlify Previews.

Project Status Preview Updated (UTC)
lancedb-bcbb4faf 🟢 Ready View Preview May 12, 2026, 5:59 PM

💡 Tip: Enable Workflows to automatically generate PRs for you.

@prrao87 prrao87 merged commit 2830b37 into main May 12, 2026
2 checks passed
@prrao87 prrao87 deleted the claude/great-chebyshev-111a9d branch May 12, 2026 18:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant