Skip to content

AndreaBozzo/dataprof

dataprof logo

dataprof

High-performance data profiling with ISO 8000/25012 quality metrics

Crates.io docs.rs PyPI License: MIT OR Apache-2.0


dataprof is a Rust and Python library for profiling tabular data. It computes column-level statistics, detects data types and patterns, and evaluates data quality against the ISO 8000/25012 standard, all with bounded memory usage that lets you profile datasets far larger than your available RAM.

It is built for the first ten minutes with unfamiliar data: find sparse columns, unstable types, duplicate keys, stale timestamps, and suspicious values before they turn into pipeline bugs.

Note

dataprof is in beta. Current releases ship a Rust crate and a Python package. The historical CLI remains documented only for older releases.

What dataprof answers quickly

Question What you get back
Which columns are thin, empty, or structurally broken? Null counts, completeness metrics, and schema shape in one pass
Did this feed drift or spike somewhere suspicious? Numeric summaries, outlier signals, and range checks
Are these IDs really unique or just pretending to be keys? Distinct counts, uniqueness ratios, and duplicate warnings
Are my timestamps plausible and fresh? Future-date detection, stale-data signals, and timeliness scoring
Did parsing silently go wrong? Type inference, pattern matches, format violations, and source metadata

Pick your entry point

You are doing this Start with
Embedding profiling in a Rust service, ETL job, or batch tool cargo add dataprof@0.8 and Profiler::new().analyze_file(...)
Inspecting files in notebooks, validation scripts, or data apps uv pip install dataprof and dp.profile(...)
Profiling streams, remote Parquet, or database queries Rust feature flags, or a source-built Python extension with async/database features enabled

Start in 30 Seconds

Python

uv pip install dataprof

Pre-built PyPI wheels ship the base Python API for local files, DataFrames, and Arrow objects. Async URL profiling and database helpers remain opt-in source builds.

import dataprof as dp

report = dp.profile("data.csv", metrics=["schema", "statistics", "quality"])
print(f"{report.rows} rows, {report.columns} columns")
print(f"quality={report.quality_score:.1f}")

age = report["age"]
print(age.data_type, age.mean, age.null_percentage)

report.save("report.json")

Rust

[dependencies]
dataprof = "0.8"
# or: dataprof = { version = "0.8", default-features = false }
use dataprof::Profiler;

let report = Profiler::new().analyze_file("data.csv")?;
println!("Rows: {}", report.execution.rows_processed);
println!("Quality: {:.1}%", report.quality_score().unwrap_or(0.0));

for col in &report.column_profiles {
  println!("{} {:?} nulls={}", col.name, col.data_type, col.null_count);
}

Why it feels modern

  • Fast first-pass signal -- surface null pockets, type drift, duplicate keys, and outliers quickly
  • True streaming -- bounded-memory profiling with online algorithms for files bigger than RAM
  • Multi-format by default -- move from CSV and JSON to Parquet, live databases, DataFrames, and Arrow batches without changing tools
  • Two polished entry points -- a compact Rust facade and a Python package that feels natural in notebooks
  • Async-ready -- Rust async APIs and opt-in Python extension builds cover stream pipelines, services, and remote Parquet sources
  • ISO 8000/25012 quality assessment -- five dimensions: Completeness, Consistency, Uniqueness, Accuracy, Timeliness

Feature Flags

Feature Description
arrow Arrow-backed columnar engine
parquet (default) Parquet profiling; includes arrow
async-streaming Async profiling engine with tokio
parquet-async Profile Parquet files over HTTP; includes parquet and async-streaming
database Database profiling (connection handling, retry, SSL)
postgres PostgreSQL connector (includes database)
mysql MySQL/MariaDB connector (includes database)
sqlite SQLite connector (includes database)
all-db All three database connectors

For the leanest Rust build, use default-features = false or cargo --no-default-features instead of a separate minimal alias.

Supported Formats

Format Engine Notes
CSV Incremental, Columnar Auto-detects , ; | \t delimiters
JSON Incremental Array-of-objects
JSONL / NDJSON Incremental One object per line
Parquet Columnar Reads metadata for schema/count without scanning rows
Database query Async PostgreSQL, MySQL, SQLite via connection string
pandas / polars DataFrame Columnar Python API only
Arrow RecordBatch Columnar Via PyCapsule (zero-copy) or Rust API
Async byte stream Incremental Any AsyncRead source (HTTP, WebSocket, etc.)

Quality Metrics

dataprof evaluates data quality against the five dimensions defined in ISO 8000-8 and ISO/IEC 25012:

Dimension What it measures
Completeness Missing values ratio, complete records ratio, fully-null columns
Consistency Data type consistency, format violations, encoding issues
Uniqueness Duplicate rows, key uniqueness, high-cardinality warnings
Accuracy Outlier ratio, range violations, negative values in positive-only columns
Timeliness Future dates, stale data ratio, temporal ordering violations

An overall quality score (0 -- 100) is computed as a weighted average of dimension scores.

Documentation

Start here

Integrate it

  • Python API Guide -- files, DataFrames, Arrow interop, exports, and optional source-built async/database features
  • Database Connectors -- PostgreSQL, MySQL, SQLite setup and connection patterns

Understand it

Historical

Academic Work

dataprof is the subject of a peer-reviewed paper submitted to IEEE ScalCom 2026:

A. Bozzo, "A Compiled Paradigm for Scalable and Sustainable Edge AI: Out-of-Core Execution and SIMD Acceleration in Telemetry Profiling," IEEE ScalCom 2026 (under review). [Repository & reproducible benchmarks]

The paper benchmarks dataprof against YData Profiling, Polars, and pandas across execution efficiency, memory scalability, energy consumption, and zero-copy interoperability in constrained Edge AI environments.

BibTeX

@inproceedings{bozzo2026compiled,
  author={Bozzo, Andrea},
  title={A Compiled Paradigm for Scalable and Sustainable Edge AI: Out-of-Core Execution and SIMD Acceleration in Telemetry Profiling},
  booktitle={2026 IEEE International Conference on Scalable Computing and Communications (ScalCom)},
  year={2026},
  note={Under review}
}

License

Dual-licensed under either the MIT License or the Apache License, Version 2.0, at your option.