Add SKILL.md and enrich package docstring by timsaucer · Pull Request #1497 · apache/datafusion-python

timsaucer · 2026-04-15T13:51:47Z

Which issue does this PR close?

Addresses part of #1394 (PR 1a from the implementation plan).

Rationale for this change

AI agents (and humans) encountering datafusion currently get a 2-line module docstring and no structured guide to the DataFrame API. Agents are very capable with SQL but struggle to produce idiomatic DataFrame code without a reference for imports, expression building, boolean-operator quirks, and SQL-to-DataFrame mappings.

An earlier draft of this PR shipped the guide as python/datafusion/AGENTS.md inside the wheel, on the theory that an installed package would surface its own guide. In practice no shipping agent walks site-packages/*/AGENTS.md, so the in-wheel file was not actually a discovery channel. This PR takes the honest route: publish the guide as SKILL.md at the repo root (where skill ecosystems such as npx skills, Claude Code plugin marketplaces, and community aggregators look for it), and enrich the module docstring for the one surface that does reach agents today (help(datafusion) / IDE introspection / PyPI rendering).

What changes are included in this PR?

SKILL.md (new, repo root) — comprehensive DataFrame API guide with Agent Skills YAML frontmatter (name, description) so skill tooling can auto-activate it. Covers:
- What DataFusion is and core abstractions
- Import conventions and data loading
- All DataFrame operations with examples (select, filter, join, aggregate, window, sort, limit, set operations, deduplication)
- Executing and collecting results
- Expression building (arithmetic, comparisons, boolean logic, null handling, CASE/WHEN, casting, aliasing, BETWEEN, IN)
- SQL-to-DataFrame reference table (~25 mappings)
- Common pitfalls (boolean operators, lit() wrapping, column quoting, immutable DataFrames, window frame defaults, arithmetic on aggregates, join-column aliasing)
- Idiomatic patterns (fluent chaining, variables as CTEs, window functions for scalar subqueries, semi/anti joins for EXISTS/NOT EXISTS)
- Categorized function index
python/datafusion/__init__.py — enriched module docstring from 2 lines to a full overview with core abstractions, a quick-start example, and a link to SKILL.md on GitHub.
AGENTS.md (root) — clarified that the root file is for contributors working on the project, and pointed agents that need to use the DataFrame API at SKILL.md.

Are there any user-facing changes?

The datafusion wheel now has a richer module docstring visible via help(datafusion). No API changes. The DataFrame API guide is not bundled in the wheel — it lives at the repo root as SKILL.md so skill-aware tooling can discover it.

Add python/datafusion/AGENTS.md as a comprehensive DataFrame API guide for AI agents and users. It ships with pip automatically (Maturin includes everything under python-source = "python"). Covers core abstractions, import conventions, data loading, all DataFrame operations, expression building, a SQL-to-DataFrame reference table, common pitfalls, idiomatic patterns, and a categorized function index. Enrich the __init__.py module docstring from 2 lines to a full overview with core abstractions, a quick-start example, and a pointer to AGENTS.md. Closes apache#1394 (PR 1a) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The root AGENTS.md (symlinked as CLAUDE.md) is for contributors working on the project. Add a pointer to python/datafusion/AGENTS.md which is the user-facing DataFrame API guide shipped with the package. Also add the Apache license header to the package AGENTS.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Document that all PRs must follow .github/pull_request_template.md and that pre-commit hooks must pass before committing. List all configured hooks (actionlint, ruff, ruff-format, cargo fmt, cargo clippy, codespell, uv-lock) and the command to run them manually. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Let the hooks be discoverable from .pre-commit-config.yaml rather than maintaining a separate list that can drift. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Clarify that DataFusion works with any Arrow C Data Interface implementation, not just PyArrow. - Show the filter keyword argument on aggregate functions (the idiomatic HAVING equivalent) instead of the post-aggregate .filter() pattern. - Update the SQL reference table to show FILTER (WHERE ...) syntax. - Remove the now-incorrect "Aggregate then filter for HAVING" pitfall. - Add .collect() to the fluent chaining example so the result is clearly materialized. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…only the text description

timsaucer · 2026-04-16T14:23:23Z

Positive update: After my latest push 4429a08 it now correctly creates an idiomatic datafusion-python file for the first TPC-H query using only the text description from the specification and being directly to strictly not use the SQL as a reference. I didn't feed it the SQL but I gave it those instructions so it didn't find the answer during it's searching. When I get more time I plan on working through each one of the queries until we have an agent file that can reproduce all of TPC-H with idiomatic code.

timsaucer · 2026-04-16T14:25:09Z

FYI @ntjohnson1 you might get some value out of grabbing the python/datafusion/AGENTS.md file but this is still a work in progress.

ntjohnson1 · 2026-04-16T17:35:53Z

FYI @ntjohnson1 you might get some value out of grabbing the python/datafusion/AGENTS.md file but this is still a work in progress.

Thanks for the heads up @iblnkn is going to do some query work in the short term so would be good to try this out in addition to some of the internal AGENTS.md stuff we have.

timsaucer · 2026-04-17T18:27:10Z

With my latest push I have a folder that contains only the text descriptions of the TPC-H queries and I gave it this guidance:

Review the @README.md and @AGENTS.md in this directory. Each of the problem statements is listed in @problems/ . I want you to generate solutions for each problem statement. However when you do this you are forbidden from making any changes to your solution after your first evaluation. This is an attempt to test that our agents file contains all of the necessary instructions, so you should be able to get each one right on the first attempt.

The contents of README.md was:

DataFusion Python - TPC-H Queries

Overview

This project implements TPC-H benchmark queries using idiomatic datafusion-python code. The goal is to translate natural language problem descriptions into DataFrame API queries, not to transliterate SQL into Python.

Data

TPC-H parquet files are located in the data/ directory:

customer.parquet
lineitem.parquet
nation.parquet
orders.parquet
part.parquet
partsupp.parquet
region.parquet
supplier.parquet

Approach

Each query should be written as idiomatic datafusion-python, using the DataFrame
API with fluent chaining, col()/lit() expressions, and functions from the functions module. Solutions should keep data in Arrow-native formats and avoid unnecessary conversions to Python types.

Allowed Sources

AGENTS.md — local copy of the datafusion-python DataFrame API guide
datafusion-python documentation at https://datafusion.apache.org/python/
Problem descriptions in the problems/ directory

Restrictions

Do not use or analyze any TPC-H SQL queries. Solutions must be derived from the natural language problem descriptions alone, not by translating SQL.

Additionally I have a CLAUDE.md file with:

Do not store auto-memory for this folder. The user is developing and testing skills here, and cross-session memory may bias how skills get written or evaluated between runs. Do not write to ~/.claude/projects/-Users-tsaucer-working-agentic-dfpython/memory/ — no feedback, user, project, or reference memories.

Do not read prior query solutions under solutions/ when writing a new query. Each query must be derived only from AGENTS.md (and the resources it points to) plus the problem description in problems/. The goal is to build up AGENTS.md as the sole durable guide; cross-referencing other solutions biases new queries toward patterns that may or may not be captured in the guide, and hides gaps we want to surface. This applies even for "style matching" — if a style convention matters, it belongs in AGENTS.md, not inferred from siblings.

Whenever you hit a problem while generating a query — a DataFusion error, a surprising planner rejection, a type mismatch, an API quirk not covered by the existing guide — after resolving it, propose a concrete addition or edit to AGENTS.md so a future agent does not repeat the mistake. Phrase the proposal as a short recommendation (the rule, a minimal wrong/right example, and where it should live in the file) and wait for user approval before editing AGENTS.md. Since memory is disabled for this folder, AGENTS.md is the only durable channel for these lessons.

Results

Using this it created all 22 TPC-H queries. I then validated that they all work at scale factor 1 and produce the expected results. I also checked each file to make sure it created idiomatic code.

Omega359 · 2026-04-18T14:54:38Z

We need this for datafusion too :)

Copilot

Pull request overview

Adds in-package, user-facing guidance for writing idiomatic DataFusion Python DataFrame API code, and makes it discoverable via the package docstring and repo root instructions.

Changes:

Add a comprehensive python/datafusion/AGENTS.md DataFrame API guide intended to ship in the wheel.
Expand python/datafusion/__init__.py module docstring with core abstractions, a quick start, and a pointer to the shipped guide.
Update repo-root AGENTS.md to clarify it targets contributors and link to the user-facing guide.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File	Description
python/datafusion/init.py	Replaces the minimal module docstring with a richer overview + quick start and pointer to shipped AGENTS.md
python/datafusion/AGENTS.md	New, comprehensive DataFrame API reference/guide intended for agent + human consumption
AGENTS.md	Clarifies contributor-focused scope and points users/agents to `python/datafusion/AGENTS.md`

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Wrap CASE/WHEN method-chain examples in parentheses and assign to a variable so they are valid Python as shown (Copilot #1, #2). - Fix INTERSECT/EXCEPT mapping: the default distinct=False corresponds to INTERSECT ALL / EXCEPT ALL, not the distinct forms. Updated both the Set Operations section and the SQL reference table to show both the ALL and distinct variants (Copilot apache#4). - Change write_parquet / write_csv / write_json examples to file-style paths (output.parquet, etc.) to match the convention used in existing tests and examples. Note that a directory path is also valid for partitioned output (Copilot apache#5). Verified INTERSECT/EXCEPT semantics with a script: df1.intersect(df2) -> [1, 1, 2] (= INTERSECT ALL) df1.intersect(df2, distinct=True) -> [1, 2] (= INTERSECT) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Drop lit() on the RHS of comparison operators since Expr auto-wraps raw Python values, matching the style the guide recommends (Copilot apache#3, apache#6). Updates examples in the Aggregation, CASE/WHEN, SQL reference table, Common Pitfalls, Fluent Chaining, and Variables-as-CTEs sections, plus the __init__.py quick-start snippet. Prose explanations of the rule (which cite the long form as the thing to avoid) are left unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ntjohnson1

I think this looks great! I'm not sure if you want to try to land this or wait for some people to test it out first. I figure landing it then iterating might make the most sense.

1 concern is how to maintain/validate the stuff in the AGENTS.md is actually up to date. If nothing can run it does it still execute? I think doctests can run on markdown, or could do a more complex method where the md gets built as an artifact.

You mentioned how to distribute this, probably for follow on work. One idea could be to register this as a skill in one of the various online registries. Then people could install datafusion-python support and just run /dfn-py and then ask for queries.

ntjohnson1 · 2026-04-19T12:39:03Z

+count = df.count()                      # int
+
+# Streaming
+stream = df.execute_stream()            # RecordBatchStream (single partition)


I think this needs more context. Is this fetching 1 at a time, fetching everything, fetching up to some internal prefetch buffer? When to prefer this over collect etc? A few sentences would probably help a lot.

ntjohnson1 · 2026-04-19T12:41:41Z

+
+### Date Arithmetic
+
+`Date32` columns require `Interval` types for arithmetic, not `Duration`. Use


Date64 works with Duration or this is only discussing date32?

The whole datetime only has ms precision so exporting to numpy for their datetime64 or have pandas installed if trying to go to raw python types feels helpful describing as a dates related footgun that is an pyarrow problem but gets inherited here.

The in-wheel AGENTS.md was not a real distribution channel -- no shipping agent walks site-packages for AGENTS.md files. Moving to SKILL.md at the repo root, with YAML frontmatter, lets the skill ecosystems (npx skills, Claude Code plugin marketplaces, community aggregators) discover it. Update the pointers in the contributor AGENTS.md and the __init__.py module docstring accordingly. The docstring now references the GitHub URL since the file no longer ships with the wheel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Convert the __init__.py quick-start block to doctest format so it is picked up by `pytest --doctest-modules` (already the project default), preventing silent rot. - Extract streaming into its own SKILL.md subsection with guidance on when to prefer execute_stream() over collect(), sync and async iteration, and execute_stream_partitioned() for per-partition streams. - Generalize the date-arithmetic rule from Date32 to both Date32 and Date64 (both reject Duration at any precision, both accept month_day_nano_interval), and note that Timestamp columns differ and do accept Duration. - Document the PyArrow-inherited type mapping returned by to_pydict()/to_pylist(), including the nanosecond fallback to pandas.Timestamp / pandas.Timedelta and the to_pandas() footgun where date columns come back as an object dtype. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The docstring pointed readers at SKILL.md as a "comprehensive guide," but SKILL.md is written in a dense, skill-oriented format for agents — humans are better served by the online user guide. Put the online docs first as the primary reference and label the SKILL.md link as the agent reference. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

timsaucer and others added 2 commits April 15, 2026 09:51

timsaucer marked this pull request as draft April 15, 2026 13:54

timsaucer and others added 4 commits April 15, 2026 10:02

Remove duplicated hook list from AGENTS.md

c5f75f5

Let the hooks be discoverable from .pre-commit-config.yaml rather than maintaining a separate list that can drift. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Update agents file after working through the first tpc-h query using …

4429a08

…only the text description

Add feedback from working through each of the TPC-H queries

c620a80

timsaucer marked this pull request as ready for review April 17, 2026 18:27

timsaucer requested a review from Copilot April 18, 2026 17:45

Copilot started reviewing on behalf of timsaucer April 18, 2026 17:45 View session

Copilot AI reviewed Apr 18, 2026

View reviewed changes

timsaucer and others added 2 commits April 18, 2026 14:15

ntjohnson1 approved these changes Apr 19, 2026

View reviewed changes

timsaucer changed the title ~~Add AGENTS.md and enrich package docstring~~ Add SKILL.md and enrich package docstring Apr 23, 2026

timsaucer and others added 2 commits April 23, 2026 16:50

timsaucer merged commit 4030997 into apache:main Apr 23, 2026
14 checks passed

timsaucer deleted the feat/create-user-agent-file branch April 23, 2026 22:29

timsaucer mentioned this pull request Apr 23, 2026

Make it easier for agents to generate datafusion-python code #1394

Open

9 tasks


		### Date Arithmetic

		`Date32` columns require `Interval` types for arithmetic, not `Duration`. Use

Conversation

timsaucer commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

timsaucer commented Apr 16, 2026

Uh oh!

timsaucer commented Apr 16, 2026

Uh oh!

ntjohnson1 commented Apr 16, 2026

Uh oh!

timsaucer commented Apr 17, 2026

DataFusion Python - TPC-H Queries

Overview

Data

Approach

Allowed Sources

Restrictions

Results

Uh oh!

Omega359 commented Apr 18, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ntjohnson1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ntjohnson1 Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

ntjohnson1 Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

timsaucer commented Apr 15, 2026 •

edited

Loading