Promote Lance datasets to a top-level Datasets tab with auto-sync#237
Merged
Conversation
…nc from upstream cards Move Hugging Face datasets out of the Integrations > Hugging Face Hub subgroup into their own top-level Datasets tab between Demos and API Reference. Each dataset gets its own page, populated automatically from the HF_DATASET_CARD.md files in lance-format/lance-huggingface so the upstream repo remains the single source of truth. - huggingface/overview.mdx -> integrations/ai/huggingface.mdx (the integration walkthrough now lives next to the other AI-platform integration pages) - huggingface/datasets.mdx -> datasets/index.mdx (landing page + auto-generated card grid between HF_SYNC:START/HF_SYNC:END markers) - 30 per-dataset MDX pages under docs/datasets/, organized into 9 categories in the sidebar - scripts/hf_datasets.yaml: explicit mapping of upstream directory, URL slug, HF Hub repo, and display title for each dataset (the three names don't have a derivable relationship) - scripts/sync_hf_datasets.py: fetches each upstream card, rewrites frontmatter for Mintlify, strips the H1, injects a "View on Hugging Face" card, and sanitizes known MDX hazards (orphan bibtex citations, literal "<>" in prose) - Makefile: hf-sync target wires it up; pyproject adds pyyaml - Redirects in docs.json keep old /huggingface/* URLs working - README documents why the scripts exist and the maintainer workflow for adding a new dataset Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…e page The slug ms-marco-v2.1 contains a period that Mintlify can mis-parse as a file extension, which can silently drop the entire Datasets tab from the top nav. Use ms-marco-v2 instead — the v2.1 detail is still in the page title and the source repo name on Hugging Face. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Contributor
|
Preview deployment for your docs. Learn more about Mintlify Previews.
💡 Tip: Enable Workflows to automatically generate PRs for you. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Integrations > Hugging Face Hubinto a top-level Datasets tab between Demos and API Reference. Each of the ~30 datasets inlance-format/lance-huggingfacegets its own page, organized into 9 categories in the sidebar (Image Classification, Object Detection & Segmentation, Image Retrieval, VQA, Text QA, Text Corpora, Speech, Video, Robotics).HF_DATASET_CARD.mdfiles. The GitHub repo stays the single source of truth — the same files are pushed to HF Hub as each dataset'sREADME.md, so we don't maintain content in two places.integrations/ai/huggingface.mdx, alongside the other AI-platform integration pages. Old/huggingface/...URLs redirect to the new locations.How the sync works
scripts/sync_hf_datasets.py(run viamake hf-sync) readsscripts/hf_datasets.yaml, fetches each upstream card, and updates three things in one pass:docs/datasets/<slug>.mdx— one MDX page per dataset, with Mintlify frontmatter (title,sidebarTitle,descriptionderived from the first prose paragraph), a "View on Hugging Face" link card injected at the top, the upstream H1 stripped, and the rest of the body passed through.docs/datasets/index.mdx— the landing-page card grid is regenerated betweenHF_SYNC:START/HF_SYNC:ENDMDX comment markers. The hand-written intro and the "Share your own dataset" footer stay untouched.docs/docs.json— theDatasetstab'sgroupsarray is rebuilt to match the yaml, so the sidebar stays in sync without manual edits.The yaml encodes four fields per dataset (
dir,slug,hf,title) because the GitHub directory names, the HF Hub repo slugs, and the desired URL slugs don't follow a derivable convention (e.g.imagenet1k_val↔imagenet-1k-val-lance↔/datasets/imagenet-1k-val). Verbose but explicit; adding a new dataset is still one line.MDX sanitization
Two upstream cards triggered MDX parse errors during the initial sync, so the script's sanitizer now handles them defensively:
openvid_hf/HF_DATASET_CARD.mdhas a bare@article{...}citation outside a code fence — MDX treats{...}as a JSX expression and rejects it. The script auto-wraps these in a```bibtexblock.laion-1M/HF_DATASET_CARD.mdcontains the stringLance<>HFin prose — MDX parses<>as an empty React fragment. The script escapes literal<>to<>outside code regions.Both could also be fixed upstream; the sanitizer is belt-and-braces for future cards.
Adding a new dataset
HF_DATASET_CARD.mdupstream inlance-format/lance-huggingfaceand push it to the Hub as usual.scripts/hf_datasets.yaml.make hf-syncand commit the regenerated MDX,docs.json, and the yaml entry.See the new "Sync Hugging Face dataset pages" section in
README.mdfor the full maintainer workflow.Test plan
cd docs && mint dev— verify the Datasets tab renders with the 9 categories in the sidebar/datasets/flickr30k,/datasets/laion-1m,/datasets/openvid,/datasets/lerobot-pusht) — confirm titles, descriptions, "View on Hugging Face" link, and body content render correctly/huggingface/overviewand/huggingface/datasets— confirm both redirect to the new locationscd docs && mint broken-linksreports no broken linksmake hf-sync— confirm it's idempotent (no diff on a clean run)🤖 Generated with Claude Code