Skip to content

add webshart format exporter#67

Merged
bghira merged 3 commits into
mainfrom
feature/webshart-exporter
Apr 27, 2026
Merged

add webshart format exporter#67
bghira merged 3 commits into
mainfrom
feature/webshart-exporter

Conversation

@bghira

@bghira bghira commented Apr 27, 2026

Copy link
Copy Markdown
Owner

This pull request adds support for exporting captions to the "webshart" format, which writes caption data directly into existing webshart shard metadata JSON files. It also updates the CLI, exporter, and processor logic to handle this new format, and includes tests to ensure correct integration. Additionally, there are improvements to metadata handling in the webdataset processor and enhanced test coverage.

Webshart format support:

  • Added "webshart" as a valid export format to the CLI (src/caption_flow/cli.py), documentation (README.md), and the exporter (src/caption_flow/storage/exporter.py). [1] [2] [3]
  • Implemented the to_webshart_metadata method in StorageExporter to write captions into webshart metadata JSON files using the webshart API.
  • Updated the export logic to handle the "webshart" format, including output file path handling and error checking for webshart API presence and version. [1] [2]

Webdataset processor improvements:

  • Improved how the processor retrieves the number of samples from shard info, falling back to "num_files" if "num_samples" is missing.
  • Enhanced metadata extraction in process_unit to include additional fields from the entry's metadata, such as json_path and other custom metadata, while avoiding duplication of standard fields. [1] [2] [3]

Testing and validation:

  • Added a unit test to verify that webshart export delegates to the webshart API and correctly writes caption data.
  • Updated webdataset processor tests to mock and validate the presence and extraction of metadata fields, including new webshart-specific fields. [1] [2]

These changes collectively enable seamless export of caption data into webshart metadata files, improve metadata handling, and ensure robust test coverage for the new and updated functionality.

This comment was marked as resolved.

@codecov

codecov Bot commented Apr 27, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 76.00000% with 18 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/caption_flow/storage/exporter.py 72.58% 17 Missing ⚠️
src/caption_flow/processors/webdataset.py 92.30% 1 Missing ⚠️
Files with missing lines Coverage Δ
src/caption_flow/cli.py 46.75% <ø> (ø)
src/caption_flow/processors/webdataset.py 88.53% <92.30%> (-0.03%) ⬇️
src/caption_flow/storage/exporter.py 64.90% <72.58%> (+1.50%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/caption_flow/processors/webdataset.py Outdated
Comment thread src/caption_flow/storage/exporter.py Outdated
Comment thread src/caption_flow/storage/exporter.py Outdated
@bghira bghira merged commit c72d964 into main Apr 27, 2026
4 checks passed
@bghira bghira deleted the feature/webshart-exporter branch April 27, 2026 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants