One possibility to simplify access to data files would be to include them in a Python package that could be made available on PyPI and conda-forge. The package could include functionality to open files with pandas, xarray, or h5py, which could then be imported into PlasmaPy.
Instead of needing to download the data files separately, they could be acquired via pip install plasmapy-data, and then accessed by PlasmaPy. We could potentially have plasmapy-data be a dependency of PlasmaPy. We could perhaps even allow installation without plasmapy-data via pip install plasmapy[lite] if the size of the data increases to ≳ 10 MB.
So far, the sizes of data files in this repository are of a scope that is well within what can reasonable included in a Python package. PlasmaPy wheels are ∼9 MB and source distributions are ∼14 MB.
The main disadvantage of creating a package is that we would have an additional package to maintain, but there are tools like cruft that could simplify package maintenance. I don't expect the amount of maintenance for this package to be very large compared to the main PlasmaPy repo, though. We would want to make the release process simpler than for the main PlasmaPy repo (i.e., by avoiding changelogs).
We'd have to figure out what we'd want to do with data used in tests. If PlasmaPy moves to an src layout with a separate tests directory, then the test data could live in the tests directory.
An advantage of incorporating the data into a Python package is that it could be cached in GitHub Actions very straightforwardly.
I do not know if this is the best approach, so I'd also like to look into best practices and check with people in pyOpenSci about alternatives.
This will take a while quite a bit more discussion, so we should proceed with PlasmaPy/PlasmaPy#2570 (which we may need for especially large data sets).
@pheuer, @JaydenR2305 — I'm curious what your thoughts are on this!
One possibility to simplify access to data files would be to include them in a Python package that could be made available on PyPI and conda-forge. The package could include functionality to open files with
pandas,xarray, orh5py, which could then be imported into PlasmaPy.Instead of needing to download the data files separately, they could be acquired via
pip install plasmapy-data, and then accessed by PlasmaPy. We could potentially haveplasmapy-databe a dependency of PlasmaPy. We could perhaps even allow installation withoutplasmapy-dataviapip install plasmapy[lite]if the size of the data increases to ≳ 10 MB.So far, the sizes of data files in this repository are of a scope that is well within what can reasonable included in a Python package. PlasmaPy wheels are ∼9 MB and source distributions are ∼14 MB.
The main disadvantage of creating a package is that we would have an additional package to maintain, but there are tools like
cruftthat could simplify package maintenance. I don't expect the amount of maintenance for this package to be very large compared to the main PlasmaPy repo, though. We would want to make the release process simpler than for the main PlasmaPy repo (i.e., by avoiding changelogs).We'd have to figure out what we'd want to do with data used in tests. If PlasmaPy moves to an
srclayout with a separatetestsdirectory, then the test data could live in thetestsdirectory.An advantage of incorporating the data into a Python package is that it could be cached in GitHub Actions very straightforwardly.
I do not know if this is the best approach, so I'd also like to look into best practices and check with people in pyOpenSci about alternatives.
This will take a while quite a bit more discussion, so we should proceed with PlasmaPy/PlasmaPy#2570 (which we may need for especially large data sets).
@pheuer, @JaydenR2305 — I'm curious what your thoughts are on this!