Mastering Healthcare Analytics

Healthcare analytics is transforming patient care.

Nessie would effectively create a fourth data stack, or more accurately, a specific and advanced version of the open-source cloud/hybrid stack.

It introduces a new philosophical layer focused on data governance and version control.

The core philosophy of a data lakehouse built with open-source tools like Nessie: You can deploy and host the entire stack on-premise or in any cloud (Azure, GCP, AWS), while actively avoiding vendor lock-in.

Why Nessie’s Stack Avoids Vendor Lock-in 🔒

The key to this stack’s portability is its decoupled and open-source nature.

Each component can be swapped out or moved to a different environment, giving you complete control.

Open Formats and APIs: The stack is built on open standards. Apache Iceberg is an open table format, and MinIO uses the S3 API, which is an open standard for object storage. Nessie itself provides an open REST API for its catalog. This means your data is stored in non-proprietary formats and is accessible by any tool that understands these standards, regardless of the vendor.
Separate Compute and Storage: The architecture intentionally separates the storage layer (MinIO) from the query engine (Trino/PySpark). This is a stark contrast to monolithic, vendor-specific data warehouses like BigQuery or Snowflake, where compute and storage are tightly integrated and managed as a single service.
Self-Hosted Components: You have the option to host all the components yourself. While it’s easier to run Trino on a managed service like GCP’s Dataproc, you can also run it on a set of Virtual Machines on GCP or on your own on-premise servers. The same applies to Nessie, PySpark, and MinIO, which can all be run as Docker containers or on Kubernetes (k8s) clusters.

Hosting the Nessie Stack on a Public Cloud

Even when you choose to run this stack on a public cloud like GCP or Azure, you’re not locked into their proprietary services.

You’re simply using their infrastructure as a service (IaaS).

Cloud Infrastructure: You can use virtual machines (VMs) or container services (like GKE on GCP or AKS on Azure) to host the various components. This is similar to hosting a website on a VM; you’re using the cloud provider’s hardware, but the software running on it is your choice.
Managed Services as a “Crutch”: You can use managed services like Cloud Composer (GCP’s managed Airflow) to reduce operational overhead, but you can always switch to a self-hosted Airflow instance on your own VMs if you decide to migrate away. This is a deliberate trade-off between convenience and portability.

In summary, the Nessie data stack is designed to be portable and anti-lock-in.

You get the flexibility to choose where your data resides, which tools you use to process it, and which platforms you use to host it.

This “Nessie Stack” would be a sophisticated variation of the Snowflake, dbt, Superset/Redash stack, but with a different core.

Instead of a centralized data warehouse, it’s a data lakehouse architecture where the data resides in low-cost storage, but is managed with the quality and control of a warehouse.

The Nessie Data Stack

This stack prioritizes flexible data storage, transactional integrity, and collaborative versioning.

Layer	Component	Who Uses It	What They Do
Data Lake	MinIO or S3	Data Engineers	The scalable, low-cost storage layer for raw data files (e.g.,
Parquet, Avro). It’s the foundation of the data lakehouse.
Table Format	Apache Iceberg	Data Engineers	An open-source table format
that adds a transactional, structured layer on top of your data lake files. It enables features like schema evolution
and time-travel.
Data Catalog	Project Nessie	Analytics Engineers, Data Scientists	A
Git-like data catalog that manages the metadata for Iceberg tables. It enables **branching, committing, and
tagging** of data, allowing for isolated work and version control.
Query & Transformation	Trino or PySpark	Data Engineers, Data Analysts	High-performance query engines
that connect to Nessie. They read the table metadata from Nessie and execute queries directly against the data files in
the data lake.
BI & Visualization	Superset/Redash	Data Analysts, Business Users	Connects to Trino or PySpark. These
tools provide the user interface for building dashboards and reports on top of the tables managed by Nessie.

How It’s Different from the Other Stacks

Decoupled Architecture: This stack is defined by its separation of compute (Trino/PySpark) and storage (MinIO/S3). In contrast, BigQuery and Snowflake are monolithic, with compute and storage tightly integrated.
Metadata-Driven: The entire stack revolves around the data catalog (Nessie) and the table format (Iceberg). The catalog isn’t just a discovery tool; it’s a central control plane for managing the entire data lake.
Version Control: The most unique feature is the application of Git-like principles to data. This allows for isolated experimentation and a high degree of data governance that is difficult to achieve in other stacks without manual processes.

In short, the Nessie stack represents a true data lakehouse architecture, bringing the best of a data warehouse’s functionality to a data lake’s flexible and low-cost storage.

Nessie and Apache Iceberg

Nessie is an open-source, Git-like data catalog that adds version control semantics to your data lake.

Its namespaces are a key part of this, acting like file folders to organize tables within the catalog.

Instead of being a separate tool, Nessie fits into a data stack as a central metadata service that provides a single, versioned view of your data to various query engines.

Nessie’s Place in the Data Stack 🗺️

Nessie isn’t a replacement for a data warehouse, a transformation tool, or a BI platform.

It’s a foundational component that sits on top of your data lake (like GCS, S3, or MinIO) and provides a crucial layer of metadata and version control.

Here’s how Nessie fits into the open-source stack you mentioned:

Data Lake (MinIO/S3): This is where your actual data files are stored (e.g., in Parquet, Avro, or ORC format).
Table Format (e.g., Apache Iceberg): This layer adds a structured, transactional layer on top of your data files. It provides features like schema evolution and time travel.
Data Catalog (Nessie): This is where Nessie fits in. It acts as the central metastore that tracks the metadata for your Iceberg tables. It tells query engines like Trino or Spark where the data for each table is located, but its unique feature is that it does so with Git-like semantics like branches, commits, and tags.
Query Engines (Trino, PySpark): These engines connect to Nessie to find out which tables exist, their schemas, and where the data is stored. Nessie’s versioning allows a data scientist to create a branch of the catalog to test a new model without affecting the main branch.
BI & Visualization (Superset/Redash): These tools would connect to a query engine (like Trino) which, in turn, is using Nessie to resolve its queries.

What are Nessie Namespaces?

In Nessie, a namespace is a logical container for tables, similar to a database or a schema in a traditional relational database.

For example, in the table name production.marketing.sales, production.marketing is the namespace and sales is the table name.

Namespaces are important because they provide:

Organization: They help you organize a large number of tables into a logical hierarchy, which is crucial for data discovery.
Access Control: You can apply permissions and security policies at the namespace level, making it easier to manage who can access what data.

A key feature of Nessie is that its namespaces are versioned along with the rest of the catalog. If you add a new table to a namespace, that change is a commit to the catalog’s history.

This allows you to track all changes to your data, not just the data itself, but also its organization and structure.

What it is Apache Iceberg?

Apache Iceberg is an open-source table format for data lakes. Think of it as an intelligent layer that sits on top of your data lake files (e.g., in S3 or GCS), bringing database-like capabilities to your data.

It addresses the limitations of simply having a directory of files by providing a robust way to manage large, analytical tables.

Key Features of Apache Iceberg 🧊

Iceberg solves critical problems that arise from traditional file-based data lakes:

ACID Transactions: It ensures that write operations (like appends, deletes, and updates) are atomic and consistent, so you don’t get corrupt data from concurrent jobs.
Time Travel: Iceberg keeps a history of table snapshots, allowing you to query data as it existed at a specific point in time. This is invaluable for auditing or reproducing past analyses.
Schema and Partition Evolution: It allows you to safely change your table’s schema (add, remove, or rename columns) and even change how the data is partitioned without a full table rewrite.
Hidden Partitioning: Iceberg manages partitioning for you. You can define a partition strategy, but users don’t have to specify partition columns in their queries. This makes queries more flexible and less error-prone.

How Apache Iceberg Relates to Project Nessie

Apache Iceberg and Project Nessie are separate, but complementary projects. You can’t have a Nessie stack without Iceberg. Their relationship is best understood as Table Format vs. Catalog.

Iceberg is the Table Format: It’s the specification for how your data files are organized and how metadata is stored to provide ACID transactions and time travel for a single table. It creates its own set of metadata files to manage snapshots, manifests, and data file locations.
Nessie is the Catalog: Nessie is an open-source catalog that serves as a central registry for all your Iceberg tables. Its key innovation is bringing Git-like version control to the entire data catalog, not just a single table.

Feature	Apache Iceberg	Project Nessie
Purpose	Table Format	Data Catalog
Focus	Single-table versioning and metadata management (snapshots).	Multi-table versioning and metadata
management (branches, commits, tags).
Analogy	A file system with a built-in log of changes.	The Git repository that manages multiple file
systems.

Nessie’s value comes from its ability to orchestrate and version multiple Iceberg tables together. For example:

You can create a branch to perform a series of transformations on multiple Iceberg tables in an isolated environment.
You can then commit those changes to the main branch, making the entire set of transformations visible as a single, atomic update.
You can tag a specific state of your catalog to mark a “production release” for reporting.

In short, Iceberg provides the low-level, transactional guarantee for a single table, while Nessie provides the high-level, multi-table version control and collaborative framework that makes a data lakehouse a true production-ready environment.

DataNova Core

InsightFlow Cloud

Mastering Healthcare Analytics

Nessie and Apache Iceberg

Nessie’s Place in the Data Stack 🗺️

What are Nessie Namespaces?

What it is Apache Iceberg?