pyProteoFlow

1. What is pyProteoFlow

pyProteoFlow is a modular and extensible Python toolkit for LC–MS/MS proteomics workflows. It covers the entire processing chain from raw data parsing and assay quantification to filtering, profiling, modeling, and visualization.

Originally designed for untargeted and semi-targeted LC–MS proteomics, the toolkit also includes a powerful assay quantifier/analyzer for multi-plate experiments — initially developed for protein quantification assays, but easily adaptable to any quantitative plate-based assay (e.g. enzymatic, metabolite, or fluorescence assays).

The workflow was originally implemented for fully automated bioreactor cultivation experiments, but is equally applicable to shake flasks and manual bioreactor setups. It integrates sample metadata, dilution information, LC retention times, and raw fragment data to enable automated quality control and quantification across multiple biological levels.

Experimental setup

The pyProteoFlow workflow is typically carried out in two experimental stages:

Dilution series (calibration run): A simple dilution series of the protein matrix is analyzed to establish calibration curves and determine the linear dynamic range (R²). This calibration defines the quantitative range for all downstream analysis.
Actual experiment (quantitative run): The main LC–MS/MS measurement of biological samples (e.g., bioreactor time course, perturbation experiment, etc.). Quantification and filtering are performed using the calibration parameters from the dilution series.

If sufficient material is available, the dilution series can be prepared using remaining lysate or residual sample material from the main experiment — ensuring matrix similarity between calibration and quantification runs.

Key features include:

🔬 LC–MS/MS workflow for untargeted proteomics
⚗️ Assay quantifier/analyzer for plate-based assays
- Originally developed for protein quantification
- Estimation of key quality metrics (linearity, reproducibility, dynamic range)
- Modeling of linear and non-linear calibration curves (5PL, 4PL, linear)
- Automatic detection of the linear dynamic range
- Identification of shared linear regions between different plates
- Requires dilution series of the sample matrix to determine the linear range and compute R² calibration models
- Outlier detection at multiple experimental levels:
  - Technical replicates
  - Analytical replicates
  - Biological replicates
📊 Export of all key results (regressions, QC summaries, filtered data, etc.) as Excel reports
🧪 Bradford assay parser — reads both text and Excel formats for absorbance-based quantification
🧬 Sciex file parser — imports and standardizes SCIEX Analyst outputs (.txt, .xls, .xlsx, .slx)

2. Required Input Files

To run a complete pyProteoFlow analysis, several input files are needed. These provide structured metadata, calibration information, and raw LC–MS data for automated filtering and quantification.

File Type	Description	Format	Required For
1. Metadata / Sample List	Contains experimental metadata (sample IDs, conditions, biological and technical replicates). A preformatted Excel template is provided in the repository.	`.xlsx`	Required for all modules
2. Dilution Series of the Protein Matrix	Defines the serial dilution steps used to build calibration curves. This is critical for determining the R² fit and linear range of the assay.	`.xlsx`	Required for assay quantification
3. LC Retention Time Data	Contains measured LC retention (for the dilution series and the actual experiment) times for each peptide or fragment, used for retention time–based outlier detection.	`.xlsx`	Required for filtering and QC
4. Raw Data with Fragments	The main data table from the LC–MS system, containing fragment intensities, retention times, and peptide/protein identifiers.	`.xlsx`, `.xls`, `.txt`, `.slx`	Required for parsing, quantification, and profiling

All files can be automatically loaded and validated via the parsing and assay quantification modules. Retention time information and fragment-level data are especially important for fragment-specific QC and replicate validation steps.

3. Module Overview

🔹 `parsing`

Tools for reading, reshaping, and interpreting raw proteomics files.

Handles SCIEX Analyst and PeakView exports (.txt, .xls, .xlsx, .slx)
Extracts protein / peptide / fragment information from complex peak names (e.g., sp|P63286.1|CLPB_ECOL6.QLPDKAIDLIDEAASSIR.+3y8+1)
Provides DataFrame utilities:
- Wide-to-long transformations
- Mapping of replicate → group labels
Includes specialized parsers for:
- Bradford assay results
- Quantitative tables from LC–MS instrument software

🔹 `proteo_filter`

Applies quality filters at multiple analysis stages.

Available filters include:

NaN ratio filter → remove features with too many missing values
R² filter → exclude poorly fitting calibration curves
Signal threshold filters → remove low-intensity or noisy features
Outlier filters → identify and exclude abnormal measurements
Replicate consistency filters → enforce reproducibility across replicates
peptide/fragement abundance filter → detect how many fragmente/peptides per protein accour

Each filter logs and reports its impact, enabling reproducibility and transparent QC.

🔹 `proteo_assay_quantifier`

A comprehensive module for assay quantification and regression modeling.

Originally built for protein quantification in LC–MS proteomics
Fully extensible to other plate-based assays (enzymatic, metabolite, fluorescence, etc.)
Linear and non-linear regression fitting (1st–5th order, 4PL, 5PL)
Automatic linear region detection
Computation of calibration metrics:
- R², slope, intercept, standard error
- LOD, LOQ (limits of detection and quantification)
Detection of shared linear ranges across multiple plates
Integration of technical, analytical, and biological replicates
Excel export of full quantification summaries
Visualization of:
- Fitted calibration curves with regression equations
- Residual and RSD plots
- Outlier diagnostics and reproducibility summaries

🔹 `proteo_profiler`

Annotation and pathway profiling tools.

Interfaces with NCBI and KEGG databases for automatic annotation
Assigns proteins to pathways, subsystems, and functional categories
Supports Escher-based pathway visualization → detected proteins are highlighted on metabolic pathway maps
Generates:
- Volcano plots (ANOVA / fold-change-based)
- Pathway coverage and subsystem summaries
- KEGG enrichment and function-based groupings
Enables biological interpretation of quantitative proteomics results

🔹 `stats`

Mathematical and statistical foundation of pyProteoFlow.

Includes:

Outlier detection algorithms
- Tukey IQR, Thompson Tau, Rosner’s test, Modified Z-score
Descriptive and comparative statistics
- Mean, median, SD, CV, RSD, t-tests, ANOVA, fold-change
Regression utilities
- Linear regression, R² computation, slope comparison
Error propagation and uncertainty estimation
Robust summarization methods for replicate data

🔹 `utils`

General-purpose helper functions used across modules.

Features:

File and path management
Safe directory creation and file export helpers
DataFrame and dictionary flattening
Data intersection and grouping utilities
Sampling and randomization tools
Logging, configuration, and runtime utilities

🔹 `viz`

Visualization and reporting module — a key part of the analysis workflow.

Generates publication-quality plots using Matplotlib + Seaborn
Integrated into all major modules (filtering, quantification, profiling)
Available plots:
- Calibration and regression curves
- R² and RSD bar charts
- UpSet and Venn diagrams
- Volcano plots
- Filter-step protein count charts
- Correlation and heatmap matrices
- Pathway coverage overlays
Supports automated figure saving and report integration

4. Aim of the Project

During LC–MS proteomics analysis, peak detection errors and replicate variability can lead to inconsistent quantification and misinterpretation. To ensure a reproducible, traceable, and extensible workflow, pyProteoFlow reconstructs and extends the SCIEX Analyst / MultiQuant data-processing pipeline entirely in Python.

The main goals are:

To provide a robust, reproducible end-to-end pipeline for proteomics and assay data
To enable transparent filtering, quantification, and reporting
To support reliable protein identification and quantification using robust statistical methods
To include automated, high-quality visualization and QC reports
To create a standardized, reproducible workflow across experiments and users

5. Supported File Formats

Category	Supported Formats	Example Sources
LC–MS exports	`.txt`, `.xls`, `.xlsx`, `.slx`	SCIEX Analyst / PeakView
Assay data	`.txt`, `.xls`, `.xlsx`	Bradford assay, quantitative plates
Intermediate data	`.csv`, `.tsv`	User-generated or exported data
Annotation	`.gff`, `.gtf`, `.json`, `.csv`	NCBI, KEGG, or Escher models
Outputs	`.xlsx`, `.csv`, `.png`, `.pdf`	Reports, plots, and summaries

6. License & Disclaimer

This package is intended for research and analytical workflows. Users are encouraged to critically review all filtering, regression, and outlier-detection parameters before publication or reporting.

7. Contributing & Development Installation

install pyProteoFlow for development, clone the repository and create an isolated Python environment.

Clone the repository via git, e.g. from the command line: git clone https://github.com/JuBiotech/pyProteoFlow.git
Open a terminal in the project folder and set up a local environment (recommended: miniconda)

Create a fresh environment:

conda create --name pyprot python==3.12 -y
conda activate pyprot

Install uv via:
```
pip install uv
```
Editable install of pyProteoFlow:
```
uv pip install -e ".[dev]"
```
Set up pre-commit:
```
pre-commit install
```
(If not installed, run pip install pre-commit first)

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
notebooks		notebooks
src/pyproteoflow		src/pyproteoflow
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pyProteoFlow

1. What is pyProteoFlow

Experimental setup

2. Required Input Files

3. Module Overview

🔹 `parsing`

🔹 `proteo_filter`

🔹 `proteo_assay_quantifier`

🔹 `proteo_profiler`

🔹 `stats`

🔹 `utils`

🔹 `viz`

4. Aim of the Project

5. Supported File Formats

6. License & Disclaimer

7. Contributing & Development Installation

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

pyProteoFlow

1. What is pyProteoFlow

Experimental setup

2. Required Input Files

3. Module Overview

🔹 parsing

🔹 proteo_filter

🔹 proteo_assay_quantifier

🔹 proteo_profiler

🔹 stats

🔹 utils

🔹 viz

4. Aim of the Project

5. Supported File Formats

6. License & Disclaimer

7. Contributing & Development Installation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

🔹 `parsing`

🔹 `proteo_filter`

🔹 `proteo_assay_quantifier`

🔹 `proteo_profiler`

🔹 `stats`

🔹 `utils`

🔹 `viz`

Packages