Databricks Unity Catalog 101: A Complete Overview (2024)

Data is all around us and plays a huge role in our daily lives in countless ways. With so much data and information floating around, it's really important to keep data safe, accurate, and well-organized. That's where data governance comes in! It's basically about setting up guidelines and using the right tools to make sure data is secure, accurate, and well-organized throughout its lifecycle, which involves keeping an eye on who can access the data, understanding where it comes from, and making sure it's protected. To lend a hand with all this, Databricks created Unity Catalog—a unified governance layer for data and AI within the Databricks platform. It lets users and organizations keep everything in one place, from structured and unstructured data to ML models, notebooks, dashboards, and files, all while working across any cloud or platform.

In this article, we will cover everything about Databricks Unity Catalog, its architecture, features, and best practices for effective data governance within Databricks platform. Plus, we've got you covered with a simple step-by-step guide to set up and manage Unity Catalog with ease.

What Is the Unity Catalog in Databricks?

Databricks announced Unity Catalog at the Data and AI Summit in 2021 to address the complexities of data governance by providing fine-grained access control within the Databricks ecosystem. Before Databricks Unity Catalog, data governance in Databricks was typically handled by various third party and open source tools, which, while effective, often lacked seamless integration with the Databricks ecosystem. These tools often lacked the integration and granular security controls specifically tailored for data lakes and were sometimes limited to certain cloud platforms. This limitation highlighted the need for a data governance solution that could provide more fine-grained access control and work across different cloud platforms while integrating seamlessly with the Databricks ecosystem. Thus, Databricks Unity Catalog was born to exactly address these challenges.

Databricks Unity Catalog serves as a centralized governance layer within the Databricks Data Intelligence Platform, streamlining the management and security of various data and AI assets. It supports a wide range of assets including files, tables, machine learning models, notebooks, and dashboards. Unity Catalog uniquely identifies each asset type, simplifying access control and ensuring that only authorized users can interact with specific data elements.

Databricks Unity Catalog diagram (Source: databricks.com)

Now, If you’re transitioning from Hive metastore to Databricks Unity Catalog, rest assured that the Databricks Unity Catalog enhances the Hive metastore experience. It not only supports the same functionalities but also introduces advanced governance features. Unity Catalog enables fine-grained access control, allowing you to designate column-level permissions to safeguard sensitive data, such as Personally Identifiable Information (PII). Also, it provides visibility into data lineage, tracing the flow and transformation of your data across different processes.

But that's not all! Databricks Unity Catalog also allows you to share selected tables across different platforms and clouds using Delta sharing.

Key features and benefits of Databricks Unity Catalog:

Define once, secure everywhere: Databricks Unity Catalog offers a single place to administer data access policies that apply across all workspaces and user personas, ensuring consistent governance.
Standards-compliant security model: Databricks Unity Catalog's security model is based on standard ANSI SQL, allowing administrators to grant permissions using familiar syntax at various levels (catalogs, schemas, tables, and views).
Built-in auditing and lineage: Databricks Unity Catalog automatically captures user-level audit logs and data lineage, tracking data access and usage patterns for compliance and troubleshooting purposes.
Data discovery: Databricks Unity Catalog enables tagging and documenting data assets, providing a search interface to help users easily find and understand the data they need.
System tables (Public Preview): Databricks Unity Catalog provides access to operational data, including audit logs, billable usage, and lineage, through system tables.
Managed storage: Databricks Unity Catalog supports managed storage locations at the metastore, catalog, or schema levels, enabling organizations to isolate data physically in their cloud storage based on governance requirements.
External data access: Databricks Unity Catalog allows registering and governing access to external data sources, such as cloud object storage, databases, and data lakes, through external locations and Lakehouse Federation.

Source: Databricks

Databricks Unity Catalog Architecture Breakdown

Here’s an architecture breakdown of Databricks Unity Catalog:

1) Unified Governance Layer

Databricks Unity Catalog offers a unified governance layer for both structured and unstructured data, tables, machine learning models, notebooks, dashboards and files across any cloud or platform. This layer enables organizations to govern their data and AI assets seamlessly, ensuring regulatory compliance and accelerating data initiatives.

Unified Governance Layer of Databricks Unity Catalog (Source: databricks.com)

2) Data Discovery

Finding the right data for your needs is simple with Databricks Unity Catalog's discovery features. You can tag and document your data assets, and then use the search interface to locate the specific data you need based on keywords, tags, or other metadata.

Databricks Unity Catalog's data discovery features

3) Access Control and Security

Databricks Unity Catalog simplifies access management by providing a single interface to define access policies on data and AI assets. It supports fine-grained control on rows and columns and manages access through low-code attribute-based access policies that scale seamlessly across different clouds and platforms.

Access Control and Security - Databricks Unity Catalog

4) Auditing and Lineage

Databricks Unity Catalog automatically captures audit logs that record who accessed which data assets and when. It also tracks the lineage of your data, so you can see how assets were created and how they're being used across different languages and workflows. This lineage information is crucial for understanding data flows and dependencies.

Auditing and Lineage - Databricks Unity Catalog

Databricks Unity Catalog integrates with open source Delta Sharing, which allows you to securely share data and AI assets across clouds, regions, and platforms without relying on proprietary formats or complex ETL processes.

Databricks Delta Sharing - Databricks Unity Catalog

6) Object Model

Databricks Unity Catalog organizes your data and AI assets into a hierarchical structure: Metastore ► Catalog ► Schema ► Tables, Views, Volumes, and Models. At the top level, you have the metastore, which contains your schemas. Within each schema, you can have tables, views, or volumes (for unstructured data). To reference any asset, you use a three-part naming convention: <catalog>.<schema>.<asset>. We will dive deeper into the object model in Unity Catalog in next section

7) Operational Intelligence

Databricks Unity Catalog provides AI-powered monitoring and observability capabilities that give you deep insights into your data and AI assets. You can set up active alerts, track data lineage at the column level, and gain comprehensive visibility into how your assets are being used and managed.

Databricks Unity Catalog (Source: databricks.com)

Databricks Unity Catalog Object Model

Databricks Unity Catalog follows a hierarchical architecture with the following components: Metastore ► Catalog ► Schema ► Tables, Views, Volumes, and Models.

Metastore: The top-level container for metadata, exposing a three-level namespace (catalog.schema.table) to organize data assets.
Catalog: The first layer of the object hierarchy, used to organize data assets logically, often aligned with organizational units or data domains.
Schema: Also known as databases, schemas are the second layer of the object hierarchy, containing tables, views, and volumes.
Tables, Views, and Volumes: The lowest level in the data object hierarchy, where:
- Table: A structured data asset that represents a collection of rows with a defined schema.
- View: A virtual table that is defined by a query.
- Volume: A container for unstructured (non-tabular) data files.
Models: Machine learning models registered in the MLflow Model Registry can also be managed within Databricks Unity Catalog.

What Are the Supported Compute and Cluster Access Modes of Databricks Unity Catalog?

To fully leverage the capabilities of Unity Catalog, clusters must operate on compatible Databricks Runtime versions and be configured with appropriate access modes.

Databricks Unity Catalog is supported on clusters running Databricks Runtime 11.3 LTS or later. All SQL warehouse compute versions inherently support Unity Catalog, ensuring seamless integration with the latest data governance features. Clusters running on earlier Databricks Runtime versions may not provide support for all Databricks Unity Catalog features and functionality.

Unity Catalog is secure by default, meaning that if a cluster is not configured with one of the Unity Catalog-capable access modes, it cannot access data in Unity Catalog. Not all access modes are compatible with Unity Catalog. Here's a breakdown of the supported modes:

Supported Access Modes:

Shared Access Mode: This is recommended for sharing clusters among multiple users. It provides a balance between isolation and collaboration.
Single User Access Mode: This mode is ideal for automated jobs and machine learning workloads where a single user is responsible for the computations. It offers the highest level of isolation.

Unsupported Modes:

No-Isolation Shared Mode: This is a legacy mode that doesn't meet Unity Catalog's security requirements.

Databricks recommends using compute policies to simplify configuration for managing Unity Catalog access on clusters. There are certain limitations associated with using Databricks Unity Catalog in different access modes. These limitations can affect aspects like init scripts, libraries, network access, and file system access.

Check out this article for compute access mode limitations for Databricks Unity Catalog

What Are the Supported Regions and Data File Formats of Databricks Unity Catalog?

Databricks Unity Catalog is designed to be region-agnostic, meaning it adapts to the specific region where your Databricks workspace is located. Given that Databricks provides workspaces across a range of cloud service providers, Unity Catalog can be conveniently leveraged wherever your workspace is deployed.

Check out this official documentation on the list of supported Databricks clouds and regions.

When it comes to data formats, Databricks Unity Catalog offers distinct options for managed and external tables:

Here are the Supported Data File Formats of Databricks Unity Catalog:

Managed Tables: Managed tables in Databricks Unity Catalog must use the Delta table format.
External (Unmanaged) Tables: External tables in Databricks Unity Catalog can use various file formats, including Delta, CSV, JSON, Avro, Parquet, ORC, and Text.

What Is the Difference Between Unity Catalog and Hive Metastore?

Databricks Unity Catalog and Hive Metastore are both metadata management systems, but they serve different purposes and have distinct functionalities within their respective ecosystems. Here's a table that highlights the key differences between Databricks Unity Catalog and Hive Metastore:

Databricks Unity Catalog	Hive Metastore
Databricks Unity Catalog is a centralized service for managing data governance and access control across workspaces in the Databricks	Hive Metastore is central repository for storing metadata about Hive databases, tables, partitions, and other objects in the Apache Hive data warehousing system
Databricks Unity Catalog supports a wide range of data sources, including Apache Spark tables, Delta Lake tables, AWS S3, Azure Blob Storage, HDFS, and more.	Hive Metastore is primarily designed for Hive tables and databases, but can also store metadata for external data sources like HDFS or cloud storage
Databricks Unity Catalog provides APIs and tools for managing and updating metadata, enabling automated metadata capture and synchronization with external metadata sources	Metadata management is primarily done through Hive commands or directly interacting with the underlying database
Databricks Unity Catalog offers fine-grained access control and data lineage tracking, allowing administrators to define and enforce policies for data access and modification	Access control is typically handled through Hadoop permissions or external tools like Apache Ranger
Databricks Unity Catalog is designed specifically for Databricks, offering seamless integration and collaboration within the platform	Hive Metastore is primarily designed for Hadoop-based environments, but can be used with other systems that support the Hive Metastore interface
Databricks Unity Catalog facilitates data sharing and collaboration by allowing users to grant and revoke access to data assets across different environments and teams	In Hive Metastore data sharing is typically achieved through Hadoop permissions or external tools like Apache Ranger
Databricks Unity Catalog is tightly integrated with the Databricks Unified Analytics Platform and other components of the Databricks ecosystem	Hive Metastore integrates with the Apache Hive ecosystem and can be used with other tools like Apache Spark, Apache Impala, and Apache Ranger
Databricks Unity Catalog is designed to handle large-scale data and metadata operations with high performance and scalability	In Hive meta store scalability and performance can vary depending on the underlying database and configuration
Databricks Unity Catalog provides a searchable interface for data discovery and exploration	Metadata management is typically done through Hive commands or directly interacting with the underlying database

How to Create Unity Catalog Metastore (AWS)

To create a Unity Catalog metastore in the AWS cloud environment, follow these prerequisites and steps:

Prerequisites

Databricks account with Admin privileges (Premium plan or above)
Make sure you have the necessary permissions and IAM roles to create resources in your AWS account.
Determine the AWS region where you want to create the Databricks Unity Catalog metastore.
Decide on a unique name for your Unity Catalog metastore.
Prepare an AWS S3 bucket location to store managed data for the metastore (optional).

Step 1—Configure Storage

Before creating a Databricks Unity Catalog metastore, you may want to create an S3 bucket to store data that is managed at the metastore level. This is optional, as you can use the default bucket provided by Databricks.

Follow the Create your first S3 bucket guide if needed.

Step 2—Create an IAM Role to Access the Storage Location

Next, create an AWS IAM role to allow Databricks to access the S3 bucket you created in Step 1. This role should have the necessary permissions to read and write data to the bucket.

Follow the Creating IAM roles guide if needed.

Step 3—Create the Metastore and Attach a Workspace

Finally, create the Unity Catalog metastore in the Databricks account console. To do so:

1) Log in to the Databricks console and navigate to the Data option.

Navigating to Databricks console and navigate to the Data option - Databricks Unity Catalog

2) Click "Create Metastore".

Creating Metastore - Databricks Unity Catalog

3) Provide a name and choose the region for the metastore (same region as your workspaces).

Providing a name and choosing the region for the metastore - Databricks Unity Catalog

4) Optionally, attach the storage location you created in step 1.

5) Assign the metastore to your workspace.

Assign metastore to workspace - Databricks Unity Catalog

Check out this documentation for a full guide on creating a Databricks Unity Catalog metastore in AWS.

Step-By-Step Guide to Enable Your Workspace for Databricks Unity Catalog

Enabling Databricks Unity Catalog on your workspace is a straightforward process that can be done through the Databricks account console or during workspace creation. Follow these step-by-step instructions to enable Databricks Unity Catalog:

Step 1—Log in to the Databricks Account

Step 2—Click on “Data” Option

Click on the "Data" option in the left-hand navigation panel.

Navigating to “Data” Option - Databricks Unity Catalog

Step 3—Access the Metastore

Access the metastore by clicking on the metastore name.

Accessing the Metastore - Databricks Unity Catalog

Step 4—Navigate to the Workspaces Tab

Within the metastore, head over to the "Workspaces" tab.

Navigate to the “Workspaces” Tab within metastore - Databricks Unity Catalog

Step 5—Assign to Workspaces

Click the "Assign to workspaces" button to enable Databricks Unity Catalog on one or more workspaces.

Step 6—Choose One or More Workspaces to Enable

In the "Assign Workspaces" dialog, select one or more workspaces you want to enable for Databricks Unity Catalog.

Choosing one or more workspaces to enable - Databricks Unity Catalog

Step 7—Assign and Confirm

Click "Assign", and then confirm by clicking "Enable" on the dialog that appears.

Optional—Enable Databricks Unity Catalog When Creating a Workspace

If you are creating a new workspace, you can enable Databricks Unity Catalog during the workspace creation process:

1) Toggle the "Enable Unity Catalog" option

Toggling the "Enable Unity Catalog" option - Databricks Unity Catalog

2) Select the metastore you want to associate with the new workspace

3) Confirm by clicking "Enable"

4) Complete the process by providing the necessary configuration settings and clicking "Save"

Step 8—Confirm the workspace assignment

Confirm the workspace assignment by checking the "Workspaces" tab within the metastore. The workspace(s) you enabled should now be listed.

Confirming workspace assignment - Databricks Unity Catalog

If you follow these steps thoroughly, you have successfully enabled Databricks Unity Catalog for your workspace(s), allowing you to take advantage of its governance capabilities within the Databricks ecosystem.

Step-By-Step Guide to Set up and Manage Unity Catalog

Setting up and managing Databricks Unity Catalog involves several steps to ensure proper configuration, access control, and object management. Here's a step-by-step guide to help you through the process:

Prerequisites

Databricks account with Admin privileges (Premium plan or above)
Unity Catalog metastore already created and associated with your workspace(s)
Familiarity with SQL and data management concepts
Appropriate roles and privileges, depending on the workspace enablement status (account admin, metastore admin, or workspace admin).

Step 1—Check That Your Workspace Is Enabled for Unity Catalog

To verify if your Databricks workspace is enabled for Databricks Unity Catalog, sign in to your Databricks account as an account admin. Click on the Workspaces icon and locate your workspace. Check the Metastore column—if there's a metastore name listed, your workspace is connected to a Databricks Unity Catalog metastore, indicating Unity Catalog is enabled.

Checking workspace is enabled for Databricks Unity Catalog

Alternatively, run a quick SQL query in the SQL query editor or a notebook connected to a cluster:

SELECT CURRENT_METASTORE();

If the query result shows a metastore ID, your workspace is attached to a Unity Catalog metastore.

Step 2—Add Users and Assign Roles

As a workspace admin, you can add and invite users, assign admin roles, and create service principals and groups. Account admins can manage users, service principals, and groups in your workspace and grant admin roles.

Check out this documentation for more information on managing Databricks users.

Step 3—Set up Compute Resources for Running Queries and Creating Objects

Unity Catalog workloads require compute resources like SQL warehouses or clusters. These resources must meet certain security requirements to access data and objects within Unity Catalog. SQL warehouses are always compliant, but cluster access modes may vary.

As a workspace admin, decide whether to restrict compute resource creation to admins or allow users to create their own SQL warehouses and clusters. To ensure compliance, set up cluster policies that guide users in creating Unity Catalog-compliant resources.

See “Supported Compute and Cluster Access Modes of Databricks Unity Catalog” section above for more details.

Step 4—Grant User Privileges

To create and access objects in Databricks Unity Catalog catalogs and schemas, users must have proper permissions. Let's discuss default user privileges and how to grant additional privileges.

Default User Access:

If your workspace launched with an auto-provisioned workspace catalog, all workspace users can create objects in the default schema of that catalog by default.
If your workspace was manually enabled for Unity Catalog, it has a main catalog. Users have the USE CATALOG privilege on this main catalog by default, allowing them to work with objects but not create new ones initially.
Some workspaces have no default catalogs or user privileges set up.

Default Admin Privileges:

If auto-enabled for Databricks Unity Catalog, workspace admins can create new catalogs, objects within them, and grant access to others. There's no designated metastore admin initially.
If manually enabled, workspace admins have no special Unity Catalog privileges by default. Metastore admins must exist and have full control.

Granting More Access:

To allow a user group to create new schemas in a catalog you own, run this SQL command:

GRANT CREATE SCHEMA ON <my-catalog> TO `data-consumers`;

or for the auto-provisioned workspace catalog:

GRANT CREATE SCHEMA ON <workspace-catalog> TO `data-consumers`;

You can also manage privileges through the Catalog Explorer UI.

Note: You can only grant privileges to account-level groups, not the workspace-local admin/user groups.

Step 5—Create Catalogs and Schemas

To start using Unity Catalog, you need at least one catalog, which is the primary way to organize and isolate data in Unity Catalog. Catalogs contain schemas, tables, volumes, views, and models. Some workspaces don't have an automatically-provisioned catalog, and in these cases, a workspace admin must create the first catalog.

Other workspaces have access to a pre-provisioned catalog that users can immediately use, either the workspace catalog or the main catalog, depending on how Unity Catalog was enabled for your workspace. As you add more data and AI assets to Databricks, you can create additional catalogs to logically group assets, simplifying data governance.

If you follow these steps, you'll have a well-configured Unity Catalog environment with users, computing resources, and a data organization structure ready to go.

How to Control Access to Data and Other Objects in Databricks Unity Catalog?

Databricks Unity Catalog provides a comprehensive access control model for governing data and objects within the Lakehouse Platform. Let's explore key concepts and approaches for managing access privileges in Databricks Unity Catalog.

1) Understanding Admin Privileges

At the highest level, Databricks account admins, workspace admins, and designated metastore admins have full control over the Unity Catalog environment. These admin roles can manage the entire catalog hierarchy as well as configure security settings.

2) Object Ownership Model

Every securable object (e.g. catalogs, schemas, tables, views, functions) in Unity Catalog has an assigned owner. By default, the creator of an object becomes its owner. Object owners have complete privileges over that asset, including the ability to grant access to other users or groups.

3) Privilege Inheritance Hierarchy

Databricks Unity Catalog leverages a privilege inheritance model, where access permissions propagate from higher to lower levels. For instance, granting a privilege at the catalog level automatically applies it to all schemas, tables, and views within that catalog. Similarly, schema-level privileges apply to all contained objects.

4) Basic Object Privileges

You can grant or revoke fundamental privileges like SELECT, INSERT, DELETE, and ALTER on securable objects using standard SQL syntax (GRANT, REVOKE statements). Metastore admins, object owners, or owners of the parent catalog/schema can execute these statements.

5) Transferring Ownership

If needed, you can transfer ownership of a catalog, schema, table, or view to another user or group. This can be done via SQL commands or the Catalog Explorer UI.

6) Managing External Locations and Storage Credentials

For datasets in external storage systems like AWS S3 or Azure Blob Storage, configure external locations and storage credentials within Unity Catalog. This enables secure access to those data sources.

7) Dynamic Views for Row/Column-Level Security

Going beyond object-level permissions, dynamic views in Unity Catalog allow you to filter or mask data at the row or column level based on the accessing user's identity. This ensures that users only see the subset of data they are authorized to view.

If you leverage these key capabilities, you can implement a comprehensive data access control strategy tailored to your organization's security policies and compliance requirements.

Step-By-Step Guide for Getting Started With Databricks Unity Catalog

Prerequisites

Databricks workspace enabled for Unity Catalog.
Access to compute resources (SQL warehouse or cluster) that support Unity Catalog.
Appropriate privileges on Databricks Unity Catalog objects (USE CATALOG, USE SCHEMA, and CREATE TABLE).
Users and groups added to the workspace.

Namespace Overview:

Now, let's talk about Unity Catalog's three-level namespace. It organizes your data into catalogs, schemas (databases), and tables or views. Think of it like a filing cabinet with drawers (catalogs), folders (schemas), and documents (tables or views).

When referring to a table, you'll use this format:

<catalog>.<schema>.<table>

If you have data in your Databricks workspace's local Hive metastore or an external Hive metastore, it becomes a catalog called hive_metastore, and you can access tables like this:

hive_metastore.<schema>.<table>

Step 1—Create a New Catalog

Create a new catalog using the CREATE CATALOG command with spark.sql. To create a catalog, you must be a metastore admin or have the CREATE CATALOG privilege on the metastore.

CREATE CATALOG IF NOT EXISTS <catalog>;