This repository contains implementation of a JSON-Iceberg streaming plugin for Arcane. Use this app to livestream Json files to an Iceberg table, backed by Trino as a streaming batch merge consumer and Lakekeeper as a data catalog.
This source continuously ingests files with multiline-JSON content into a target Iceberg table. In order to configure the stream, you must provide the following:
- Desired AVRO schema for the source. Note that this schema should conform with JSON created after JSON pointers and array explode have been applied. All fields in the schema must be defined as
nullable. You can use this handy tool to generate the schema. - Source S3 path
- JSON pointer expression, if desired data is a subset of a source json. For example, given
{
"colA": "a",
"colB": {
"colC": "c",
"propA": 1,
"propB": "ABC"
}
}and jsonPointerExpression set to /colB, source will be transformed to:
{
"colC": "c",
"propA": 1,
"propB": "ABC"
}- JSON pointers for array explode, if any. For example, given
{
"colA": "a",
"colB": [{
"colC": "c1",
"propA": 1,
"propB": "ABC1"
},{
"colC": "c2",
"propA": 2,
"propB": "ABC2"
}]
}and jsonArrayPointers set to "/colB": {}, source will be transformed to:
{"colC": "c1", "propA": 1, "propB": "ABC1"}
{"colC": "c2", "propA": 2, "propB": "ABC2"}emitting 2 rows from 1 source file entry.
Install the following tools:
mise- for managing tooling versions, environment variables: https://github.com/jdx/misejust- for orchestrating tasks: https://github.com/casey/just- Docker/Docker compose - for integration testing: https://www.docker.com/products/docker-desktop/
Once the above are installed, run mise install.
It will install other necessary tools (e.g. JDK and SBT) at recommended versions for this project only.
In order to build, test and run the project, GITHUB_TOKEN environment variable needs to be set.
It is used to authenticate against GitHub Maven package registry, specifially for JAR dependencies under
https://maven.pkg.github.com/SneaksAndData/arcane-framework-scala.
Create new personal access token PAT (Personal Access Token). For example, fine-grained token with "Public repositories" access and without explicit permissions.
Export GITHUB_TOKEN environment variable before running any sbt commands.
For example, put export GITHUB_TOKEN=github_pat_xxx line in your .zshrc/.bashrc file.
- Building the project (fat JAR):
just build - Building Docker image:
just docker-build [tag] - Running integration tests:
just it - Running streaming application locally:
- via
just stream [--debug]orjust backfill [--debug](backfill mode). Note:dev.envis required, seedev.env.examplefor an example application configuration.
- via
- Cleaning build artifacts:
just clean - Code style check:
just check
Local K8S cluster (i.e. Kind) can be used to verify that Arcane operator and its dependencies coming from Helm charts are correctly setup.
Furthermore, Arcane is lightweight enough so that actual streams can be deployed on the local K8S cluster to, for example, try out or test features in a dev setup.
Kind itself should be already installed if you ran mise install. Next steps:
- Create Kind cluster:
kind create cluster --name arcane-json-dev - Create namespace:
kubectl create namespace arcane --context kind-arcane-json-dev - Install required CRDs:
helm install arcane-crd oci://ghcr.io/sneaksanddata/helm/arcane-crd \
--version vX.Y.Z \
--namespace arcane \
--kube-context kind-arcane-json-dev- Install Arcane operator:
helm install arcane oci://ghcr.io/sneaksanddata/helm/arcane-operator \
--version vX.Y.Z \
--namespace arcane \
--kube-context kind-arcane-json-dev- Build a Docker image for this project:
mise docker-build kind-dev - Load the Docker image to Kind cluster:
kind load docker-image \
ghcr.io/sneaksanddata/arcane-stream-json:kind-dev \
--name arcane-json-dev- Install chart from this project:
helm upgrade --install arcane-json ./.helm \
--kube-context kind-arcane-json-dev \
--namespace arcane \
--set image.repository=ghcr.io/sneaksanddata/arcane-stream-json \
--set image.tag=kind-dev \
--set image.pullPolicy=IfNotPresentTo be added...
Project uses Scala 3.8.3 and tested on JDK 25.