Skip to content

Add Support for Virtualizarr files saved as Kerchunk parquet files #76

@rmendels

Description

@rmendels

Virtualizarr/kerchunk parquet files save the bit range information about any number of file formats that is supported, and make the files look like zarr files. To use that information you need to be able to:

  1. read and interpret the parquet file
  2. make the bit range requests.

zarr-java already has what is needed to do step 2. For step 1 you can use either the java version of duckDB or the direct java parquet version. In fact, I have test Java code that successfully uses either. But:

  1. The code was generated by Claude.ai with a lot of back and forth, I am not a java programmer so I have no idea of the quality of the code, so no I am not going to make a pull request. Nor are there appropriate tests etc etc. Just that it is doable and in fact works.
  2. I would put the example code here except the folder is quite large, over 100MB, but if there is another way that I can get the files to someone would gladly do so. I likely can not zip it and mail it, because my mailer doesn't like zipped files with executable code in it.

Anyway a sample session for some netcdf4 files where I used Python to create the Kerchunk parquet file:

java --enable-native-access=ALL-UNNAMED \
   -jar ~/kerchunk-zarr-reader/target/kerchunk-zarr-reader-1.0-SNAPSHOT.jar \
   /Users/rmendels/kerchunk-test/VHN2015056_2015056_chla.parquet chla
Detected VirtualiZarr 2.x partitioned format
Materializing virtual store → /var/folders/46/jyz1mm5x5bvbf59b5g8f7f580000gn/T/zarr_virtual_5248494852501128730
6 variable(s): [altitude, chla, latitude, longitude, time, time_bnds]
altitude              shape=[1]  chunks=[1]
  → 0 chunk(s)
chla                  shape=[1, 1, 11985, 9338]  chunks=[1, 1, 2997, 2335]
  → 16 chunk(s)
latitude              shape=[11985]  chunks=[11985]
  → 0 chunk(s)
longitude             shape=[9338]  chunks=[9338]
  → 0 chunk(s)
time                  shape=[1]  chunks=[1]
  → 0 chunk(s)
time_bnds             shape=[1, 2]  chunks=[1, 2]
  → 1 chunk(s)
Materialization complete.
Detected Zarr v2 store — using direct chunk reader
Variable  : chla
Shape     : [1, 1, 11985, 9338]
Dtype     : <f4
Chunks    : [1, 1, 2997, 2335]
Compressor: none
Filters   : zlib
DimSep    : "."

=== Data summary: chla ===
Shape : [1, 1, 11985, 9338]
Size  : 111915930 elements
First 8: -999.0000 -999.0000 -999.0000 -999.0000 -999.0000 -999.0000 -999.0000 -999.0000
Min=-999.0000  Max=392.4024  Mean=-868.0739  NaN=0
Temp store deleted.
(work) ➜  ~ java --enable-native-access=ALL-UNNAMED \
   -jar ~/kerchunk-zarr-reader/target/kerchunk-zarr-reader-1.0-SNAPSHOT.jar \
   /Users/rmendels/kerchunk-test/VHN2015056_2015056_chla.parquet chla \
   0,0,0,0  1,1,100,100
Detected VirtualiZarr 2.x partitioned format
Materializing virtual store → /var/folders/46/jyz1mm5x5bvbf59b5g8f7f580000gn/T/zarr_virtual_14639025837829272872
6 variable(s): [altitude, chla, latitude, longitude, time, time_bnds]
altitude              shape=[1]  chunks=[1]
  → 0 chunk(s)
chla                  shape=[1, 1, 11985, 9338]  chunks=[1, 1, 2997, 2335]
  → 16 chunk(s)
latitude              shape=[11985]  chunks=[11985]
  → 0 chunk(s)
longitude             shape=[9338]  chunks=[9338]
  → 0 chunk(s)
time                  shape=[1]  chunks=[1]
  → 0 chunk(s)
time_bnds             shape=[1, 2]  chunks=[1, 2]
  → 1 chunk(s)
Materialization complete.
Detected Zarr v2 store — using direct chunk reader
Variable  : chla
Shape     : [1, 1, 11985, 9338]
Dtype     : <f4
Chunks    : [1, 1, 2997, 2335]
Compressor: none
Filters   : zlib
DimSep    : "."

=== Data summary: chla ===
Shape : [1, 1, 100, 100]
Size  : 10000 elements
First 8: -999.0000 -999.0000 -999.0000 -999.0000 -999.0000 -999.0000 -999.0000 -999.0000
Min=-999.0000  Max=-999.0000  Mean=-999.0000  NaN=0

If there is anyway this capability can be added to zarr-java it would be really neat, but perfectly understand time is limited and it is not something I can do myself

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions