dataprof is a Rust and Python library for profiling tabular data. It computes column-level statistics, detects data types and patterns, and evaluates data quality against the ISO 8000/25012 standard, all with bounded memory usage that lets you profile datasets far larger than your available RAM.
It is built for the first ten minutes with unfamiliar data: find sparse columns, unstable types, duplicate keys, stale timestamps, and suspicious values before they turn into pipeline bugs.
Note
dataprof is in beta. Current releases ship a Rust crate and a Python package. The historical CLI remains documented only for older releases.
| Question | What you get back |
|---|---|
| Which columns are thin, empty, or structurally broken? | Null counts, completeness metrics, and schema shape in one pass |
| Did this feed drift or spike somewhere suspicious? | Numeric summaries, outlier signals, and range checks |
| Are these IDs really unique or just pretending to be keys? | Distinct counts, uniqueness ratios, and duplicate warnings |
| Are my timestamps plausible and fresh? | Future-date detection, stale-data signals, and timeliness scoring |
| Did parsing silently go wrong? | Type inference, pattern matches, format violations, and source metadata |
| You are doing this | Start with |
|---|---|
| Embedding profiling in a Rust service, ETL job, or batch tool | cargo add dataprof@0.8 and Profiler::new().analyze_file(...) |
| Inspecting files in notebooks, validation scripts, or data apps | uv pip install dataprof and dp.profile(...) |
| Profiling streams, remote Parquet, or database queries | Rust feature flags, or a source-built Python extension with async/database features enabled |
uv pip install dataprofPre-built PyPI wheels ship the base Python API for local files, DataFrames, and Arrow objects. Async URL profiling and database helpers remain opt-in source builds.
import dataprof as dp
report = dp.profile("data.csv", metrics=["schema", "statistics", "quality"])
print(f"{report.rows} rows, {report.columns} columns")
print(f"quality={report.quality_score:.1f}")
age = report["age"]
print(age.data_type, age.mean, age.null_percentage)
report.save("report.json")[dependencies]
dataprof = "0.8"
# or: dataprof = { version = "0.8", default-features = false }use dataprof::Profiler;
let report = Profiler::new().analyze_file("data.csv")?;
println!("Rows: {}", report.execution.rows_processed);
println!("Quality: {:.1}%", report.quality_score().unwrap_or(0.0));
for col in &report.column_profiles {
println!("{} {:?} nulls={}", col.name, col.data_type, col.null_count);
}- Fast first-pass signal -- surface null pockets, type drift, duplicate keys, and outliers quickly
- True streaming -- bounded-memory profiling with online algorithms for files bigger than RAM
- Multi-format by default -- move from CSV and JSON to Parquet, live databases, DataFrames, and Arrow batches without changing tools
- Two polished entry points -- a compact Rust facade and a Python package that feels natural in notebooks
- Async-ready -- Rust async APIs and opt-in Python extension builds cover stream pipelines, services, and remote Parquet sources
- ISO 8000/25012 quality assessment -- five dimensions: Completeness, Consistency, Uniqueness, Accuracy, Timeliness
| Feature | Description |
|---|---|
arrow |
Arrow-backed columnar engine |
parquet (default) |
Parquet profiling; includes arrow |
async-streaming |
Async profiling engine with tokio |
parquet-async |
Profile Parquet files over HTTP; includes parquet and async-streaming |
database |
Database profiling (connection handling, retry, SSL) |
postgres |
PostgreSQL connector (includes database) |
mysql |
MySQL/MariaDB connector (includes database) |
sqlite |
SQLite connector (includes database) |
all-db |
All three database connectors |
For the leanest Rust build, use default-features = false or cargo --no-default-features instead of a separate minimal alias.
| Format | Engine | Notes |
|---|---|---|
| CSV | Incremental, Columnar | Auto-detects , ; | \t delimiters |
| JSON | Incremental | Array-of-objects |
| JSONL / NDJSON | Incremental | One object per line |
| Parquet | Columnar | Reads metadata for schema/count without scanning rows |
| Database query | Async | PostgreSQL, MySQL, SQLite via connection string |
| pandas / polars DataFrame | Columnar | Python API only |
| Arrow RecordBatch | Columnar | Via PyCapsule (zero-copy) or Rust API |
| Async byte stream | Incremental | Any AsyncRead source (HTTP, WebSocket, etc.) |
dataprof evaluates data quality against the five dimensions defined in ISO 8000-8 and ISO/IEC 25012:
| Dimension | What it measures |
|---|---|
| Completeness | Missing values ratio, complete records ratio, fully-null columns |
| Consistency | Data type consistency, format violations, encoding issues |
| Uniqueness | Duplicate rows, key uniqueness, high-cardinality warnings |
| Accuracy | Outlier ratio, range violations, negative values in positive-only columns |
| Timeliness | Future dates, stale data ratio, temporal ordering violations |
An overall quality score (0 -- 100) is computed as a weighted average of dimension scores.
- Getting Started -- the shortest path from mystery dataset to useful signal
- Examples Cookbook -- focused Rust and Python recipes you can adapt quickly
- Python API Guide -- files, DataFrames, Arrow interop, exports, and optional source-built async/database features
- Database Connectors -- PostgreSQL, MySQL, SQLite setup and connection patterns
- Crate Redesign Notes -- what the facade owns and why the workspace is split this way
- Contributing
- Changelog
- Archived CLI Guide -- pre-0.8 reference only
dataprof is the subject of a peer-reviewed paper submitted to IEEE ScalCom 2026:
A. Bozzo, "A Compiled Paradigm for Scalable and Sustainable Edge AI: Out-of-Core Execution and SIMD Acceleration in Telemetry Profiling," IEEE ScalCom 2026 (under review). [Repository & reproducible benchmarks]
The paper benchmarks dataprof against YData Profiling, Polars, and pandas across execution efficiency, memory scalability, energy consumption, and zero-copy interoperability in constrained Edge AI environments.
@inproceedings{bozzo2026compiled,
author={Bozzo, Andrea},
title={A Compiled Paradigm for Scalable and Sustainable Edge AI: Out-of-Core Execution and SIMD Acceleration in Telemetry Profiling},
booktitle={2026 IEEE International Conference on Scalable Computing and Communications (ScalCom)},
year={2026},
note={Under review}
}Dual-licensed under either the MIT License or the Apache License, Version 2.0, at your option.
