Skip to content

update vllm version compatibility for disabling mm preprocessor args, add image support for HF datasets#66

Merged
bghira merged 1 commit into
mainfrom
chore/vllm-compat
Apr 26, 2026
Merged

update vllm version compatibility for disabling mm preprocessor args, add image support for HF datasets#66
bghira merged 1 commit into
mainfrom
chore/vllm-compat

Conversation

@bghira

@bghira bghira commented Apr 26, 2026

Copy link
Copy Markdown
Owner

This pull request introduces significant improvements to how Hugging Face datasets—especially raw image files—are handled in the data processing pipeline, and enhances vLLM worker compatibility with different vLLM versions. The changes allow the system to correctly identify and process image shards, improve chunk/shard tracking, and ensure that only supported vLLM initialization parameters are passed to the engine, increasing robustness and flexibility.

Hugging Face dataset processing improvements:

  • Added detection and handling for raw image files in Hugging Face datasets, including new logic to identify image file extensions and process image shards correctly. This allows the pipeline to treat image files as single-item shards and load/process them appropriately. [1] [2] [3] [4] [5]
  • Updated work unit and shard assignment logic to track all shards touched by a chunk or range, not just a single shard. This ensures accurate work distribution and metadata for both image and parquet datasets. [1] [2] [3] [4] [5] [6]
  • Improved result handling to mark items as processed even when only a single item index is present in the result metadata, supporting more flexible result reporting.
  • Fixed job ID and processed indices tracking to use global indices for correct bookkeeping and downstream processing. [1] [2]

vLLM worker compatibility and configuration:

  • Added new optional vLLM engine parameters (mm_processor_cache_gb, mm_processor_cache_type, mm_shm_cache_max_object_size_mb) to configuration and ensured they are included only if present. [1] [2]
  • Implemented a filtering method in the caption worker to pass only supported initialization parameters to vLLM's engine, including logic to map deprecated or version-specific parameters for maximum compatibility.

Documentation updates:

  • Updated installation instructions in README.md to recommend the [vllm] extra, ensuring users get the correct dependencies for vLLM support. [1] [2]

@bghira bghira merged commit 35590cf into main Apr 26, 2026
4 checks passed
@bghira bghira deleted the chore/vllm-compat branch April 26, 2026 23:19
@codecov

codecov Bot commented Apr 26, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 82.95455% with 15 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/caption_flow/processors/huggingface.py 83.33% 10 Missing ⚠️
src/caption_flow/workers/caption.py 78.26% 5 Missing ⚠️
Files with missing lines Coverage Δ
src/caption_flow/utils/vllm_config.py 93.33% <100.00%> (+0.37%) ⬆️
src/caption_flow/workers/caption.py 62.57% <78.26%> (+0.65%) ⬆️
src/caption_flow/processors/huggingface.py 71.48% <83.33%> (+1.13%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant