Skip to content

feat: implement deep dependency tracking in scientific manifests#246

Open
google-labs-jules[bot] wants to merge 1 commit into
developfrom
jules/scientific-snapshot-manifest-js1-b6c6fc17-fdd0-44d1-95bb-a5667e1b5e9b
Open

feat: implement deep dependency tracking in scientific manifests#246
google-labs-jules[bot] wants to merge 1 commit into
developfrom
jules/scientific-snapshot-manifest-js1-b6c6fc17-fdd0-44d1-95bb-a5667e1b5e9b

Conversation

@google-labs-jules

Copy link
Copy Markdown

Overview

This PR enhances the manifest generation utility to include a comprehensive snapshot of the scientific computational environment. Previously, manifests only recorded the versions of Python and eegprep, which was insufficient for diagnosing numerical drift caused by variations in critical dependencies like NumPy or SciPy.

Rationale

Scientific reproducibility depends heavily on the specific versions and sources of underlying libraries. Identical pipeline code can produce different results if one environment uses a stable PyPI release while another uses a custom Git-based build or a local development version.

By capturing "deep" environment metadata, we allow researchers to perform high-fidelity audits of their analysis runs, ensuring that variations in results can be traced back to specific environmental discrepancies rather than flaws in the logic.

Key Changes

  • Dynamic Dependency Discovery: Updated src/eegprep/cli/core.py to inspect sys.modules at runtime. This ensures that only packages actually active during the execution are recorded.
  • Provenance Tracking: Utilized importlib.metadata to extract not just version strings, but also the origin of the package (e.g., PyPI registry vs. Direct URL/Git).
  • Metadata Serialization: Integrated a new software_info block into the centralized manifest generation logic.
  • Robustness Improvements: Added logic to handle non-standard package versions and local paths to prevent serialization failures during JSON export.

Technical Decisions

  • importlib.metadata over pkg_resources: We chose the standard library's metadata utility to avoid the performance overhead and deprecation issues associated with setuptools.
  • Runtime Discovery: Instead of listing all installed packages (which can be noisy), we specifically target packages active in the current process to keep the manifest concise and relevant.
  • JSON Compatibility: All metadata is sanitized into a safe format for JSON storage to maintain compatibility with existing downstream validation schemas and tools.

Success Criteria Verification

  • CLI executions now produce a software_info block in the output manifest.
  • Manifests distinguish between standard PyPI installs and local/Git-based builds.
  • No serialization failures encountered with complex or "editable" package installs.
  • Validated against existing JSON schemas.

This commit modifies software_info to capture all active packages from sys.modules and includes their versions and source origins using importlib.metadata.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants