Skip to content

JuBiotech/pyProteoFlow

Repository files navigation

pyProteoFlow

1. What is pyProteoFlow

pyProteoFlow is a modular and extensible Python toolkit for LC–MS/MS proteomics workflows. It covers the entire processing chain from raw data parsing and assay quantification to filtering, profiling, modeling, and visualization.

Originally designed for untargeted and semi-targeted LC–MS proteomics, the toolkit also includes a powerful assay quantifier/analyzer for multi-plate experiments — initially developed for protein quantification assays, but easily adaptable to any quantitative plate-based assay (e.g. enzymatic, metabolite, or fluorescence assays).

The workflow was originally implemented for fully automated bioreactor cultivation experiments, but is equally applicable to shake flasks and manual bioreactor setups. It integrates sample metadata, dilution information, LC retention times, and raw fragment data to enable automated quality control and quantification across multiple biological levels.

Experimental setup

The pyProteoFlow workflow is typically carried out in two experimental stages:

  1. Dilution series (calibration run): A simple dilution series of the protein matrix is analyzed to establish calibration curves and determine the linear dynamic range (R²). This calibration defines the quantitative range for all downstream analysis.

  2. Actual experiment (quantitative run): The main LC–MS/MS measurement of biological samples (e.g., bioreactor time course, perturbation experiment, etc.). Quantification and filtering are performed using the calibration parameters from the dilution series.

If sufficient material is available, the dilution series can be prepared using remaining lysate or residual sample material from the main experiment — ensuring matrix similarity between calibration and quantification runs.


Key features include:

  • 🔬 LC–MS/MS workflow for untargeted proteomics
  • ⚗️ Assay quantifier/analyzer for plate-based assays
    • Originally developed for protein quantification
    • Estimation of key quality metrics (linearity, reproducibility, dynamic range)
    • Modeling of linear and non-linear calibration curves (5PL, 4PL, linear)
    • Automatic detection of the linear dynamic range
    • Identification of shared linear regions between different plates
    • Requires dilution series of the sample matrix to determine the linear range and compute R² calibration models
    • Outlier detection at multiple experimental levels:
      • Technical replicates
      • Analytical replicates
      • Biological replicates
  • 📊 Export of all key results (regressions, QC summaries, filtered data, etc.) as Excel reports
  • 🧪 Bradford assay parser — reads both text and Excel formats for absorbance-based quantification
  • 🧬 Sciex file parser — imports and standardizes SCIEX Analyst outputs (.txt, .xls, .xlsx, .slx)

2. Required Input Files

To run a complete pyProteoFlow analysis, several input files are needed. These provide structured metadata, calibration information, and raw LC–MS data for automated filtering and quantification.

File Type Description Format Required For
1. Metadata / Sample List Contains experimental metadata (sample IDs, conditions, biological and technical replicates). A preformatted Excel template is provided in the repository. .xlsx Required for all modules
2. Dilution Series of the Protein Matrix Defines the serial dilution steps used to build calibration curves. This is critical for determining the R² fit and linear range of the assay. .xlsx Required for assay quantification
3. LC Retention Time Data Contains measured LC retention (for the dilution series and the actual experiment) times for each peptide or fragment, used for retention time–based outlier detection. .xlsx Required for filtering and QC
4. Raw Data with Fragments The main data table from the LC–MS system, containing fragment intensities, retention times, and peptide/protein identifiers. .xlsx, .xls, .txt, .slx Required for parsing, quantification, and profiling

All files can be automatically loaded and validated via the parsing and assay quantification modules. Retention time information and fragment-level data are especially important for fragment-specific QC and replicate validation steps.


3. Module Overview

🔹 parsing

Tools for reading, reshaping, and interpreting raw proteomics files.

  • Handles SCIEX Analyst and PeakView exports (.txt, .xls, .xlsx, .slx)
  • Extracts protein / peptide / fragment information from complex peak names (e.g., sp|P63286.1|CLPB_ECOL6.QLPDKAIDLIDEAASSIR.+3y8+1)
  • Provides DataFrame utilities:
    • Wide-to-long transformations
    • Mapping of replicate → group labels
  • Includes specialized parsers for:
    • Bradford assay results
    • Quantitative tables from LC–MS instrument software

🔹 proteo_filter

Applies quality filters at multiple analysis stages.

Available filters include:

  • NaN ratio filter → remove features with too many missing values
  • R² filter → exclude poorly fitting calibration curves
  • Signal threshold filters → remove low-intensity or noisy features
  • Outlier filters → identify and exclude abnormal measurements
  • Replicate consistency filters → enforce reproducibility across replicates
  • peptide/fragement abundance filter → detect how many fragmente/peptides per protein accour

Each filter logs and reports its impact, enabling reproducibility and transparent QC.


🔹 proteo_assay_quantifier

A comprehensive module for assay quantification and regression modeling.

  • Originally built for protein quantification in LC–MS proteomics
  • Fully extensible to other plate-based assays (enzymatic, metabolite, fluorescence, etc.)
  • Linear and non-linear regression fitting (1st–5th order, 4PL, 5PL)
  • Automatic linear region detection
  • Computation of calibration metrics:
    • R², slope, intercept, standard error
    • LOD, LOQ (limits of detection and quantification)
  • Detection of shared linear ranges across multiple plates
  • Integration of technical, analytical, and biological replicates
  • Excel export of full quantification summaries
  • Visualization of:
    • Fitted calibration curves with regression equations
    • Residual and RSD plots
    • Outlier diagnostics and reproducibility summaries

🔹 proteo_profiler

Annotation and pathway profiling tools.

  • Interfaces with NCBI and KEGG databases for automatic annotation
  • Assigns proteins to pathways, subsystems, and functional categories
  • Supports Escher-based pathway visualization → detected proteins are highlighted on metabolic pathway maps
  • Generates:
    • Volcano plots (ANOVA / fold-change-based)
    • Pathway coverage and subsystem summaries
    • KEGG enrichment and function-based groupings
  • Enables biological interpretation of quantitative proteomics results

🔹 stats

Mathematical and statistical foundation of pyProteoFlow.

Includes:

  • Outlier detection algorithms
    • Tukey IQR, Thompson Tau, Rosner’s test, Modified Z-score
  • Descriptive and comparative statistics
    • Mean, median, SD, CV, RSD, t-tests, ANOVA, fold-change
  • Regression utilities
    • Linear regression, R² computation, slope comparison
  • Error propagation and uncertainty estimation
  • Robust summarization methods for replicate data

🔹 utils

General-purpose helper functions used across modules.

Features:

  • File and path management
  • Safe directory creation and file export helpers
  • DataFrame and dictionary flattening
  • Data intersection and grouping utilities
  • Sampling and randomization tools
  • Logging, configuration, and runtime utilities

🔹 viz

Visualization and reporting module — a key part of the analysis workflow.

  • Generates publication-quality plots using Matplotlib + Seaborn
  • Integrated into all major modules (filtering, quantification, profiling)
  • Available plots:
    • Calibration and regression curves
    • R² and RSD bar charts
    • UpSet and Venn diagrams
    • Volcano plots
    • Filter-step protein count charts
    • Correlation and heatmap matrices
    • Pathway coverage overlays
  • Supports automated figure saving and report integration

4. Aim of the Project

During LC–MS proteomics analysis, peak detection errors and replicate variability can lead to inconsistent quantification and misinterpretation. To ensure a reproducible, traceable, and extensible workflow, pyProteoFlow reconstructs and extends the SCIEX Analyst / MultiQuant data-processing pipeline entirely in Python.

The main goals are:

  • To provide a robust, reproducible end-to-end pipeline for proteomics and assay data
  • To enable transparent filtering, quantification, and reporting
  • To support reliable protein identification and quantification using robust statistical methods
  • To include automated, high-quality visualization and QC reports
  • To create a standardized, reproducible workflow across experiments and users

5. Supported File Formats

Category Supported Formats Example Sources
LC–MS exports .txt, .xls, .xlsx, .slx SCIEX Analyst / PeakView
Assay data .txt, .xls, .xlsx Bradford assay, quantitative plates
Intermediate data .csv, .tsv User-generated or exported data
Annotation .gff, .gtf, .json, .csv NCBI, KEGG, or Escher models
Outputs .xlsx, .csv, .png, .pdf Reports, plots, and summaries

6. License & Disclaimer

This package is intended for research and analytical workflows. Users are encouraged to critically review all filtering, regression, and outlier-detection parameters before publication or reporting.


7. Contributing & Development Installation

install pyProteoFlow for development, clone the repository and create an isolated Python environment.

  1. Clone the repository via git, e.g. from the command line: git clone https://github.com/JuBiotech/pyProteoFlow.git
  2. Open a terminal in the project folder and set up a local environment (recommended: miniconda)
  3. Create a fresh environment:
    conda create --name pyprot python==3.12 -y
    conda activate pyprot
  4. Install uv via:
    pip install uv
  5. Editable install of pyProteoFlow:
    uv pip install -e ".[dev]"
  6. Set up pre-commit:
    pre-commit install
    (If not installed, run pip install pre-commit first)