HOW TO: Use Databricks INSERT INTO for Appending Data (2025)

Loading data into tables is one of the most frequent tasks you tackle as a data engineer, analyst or data scientist. So, whether you’re prepping raw CSVs, logs or streaming events, you need those records in a table before you can query them. On Databricks, you use the Databricks INSERT INTO command to append new rows to existing Databricks tables without dropping or overwriting existing data. INSERT INTO command is more than just an "add data" instruction; it comes with various options and specific behaviors. Behind the scenes, it hooks into Delta Lake’s versioned storage, respects your table schema and updates partitions as needed.

In this article, we will cover everything you need to know about using the Databricks INSERT INTO command, including syntax, some real-world examples, and best practices for implementation.

How to Insert Data Into Databricks Tables?

Before diving into the Databricks INSERT INTO command and populating tables with data, you need a Databricks table. Let's review Databricks tables. Databricks tables store structured data and support various file formats. These tables are registered in a metastore (like Unity Catalog or the legacy Hive metastore), making them discoverable and queryable via SQL.

Databricks heavily emphasizes, and defaults to, using Delta Lake as the format for tables. Delta Lake is an open source storage framework that brings ACID (Atomicity, Consistency, Isolation, Durability) transactions to your data lakes. It builds upon open formats like Parquet and adds a transaction log, enabling features such as:

These Delta Lake features directly and significantly impact how INSERT INTO behaves, making it far more robust and versatile than inserts into traditional data lake tables (e.g., tables based purely on Parquet or ORC files without Delta Lake). The fact that an INSERT INTO operation on a Delta Lake table is a transactional operation, recorded in a log, is a fundamental difference from simply adding a new file to a directory.

Save up to 50% on your Databricks spend in a few minutes!

Enter your work email
Enter your work email

Let's create a simple Databricks table

Naturally, before you can insert any data, a table must exist. The Databricks CREATE TABLE statement is the Data Definition Language (DDL) command used for this purpose. The structure you define with Databricks CREATE TABLE, including column names, data types, nullability constraints, default values (for Delta Lake tables in Databricks SQL and Databricks Runtime 10.4 LTS and above), and partitioning schemes. It directly dictates how Databricks INSERT INTO will function and what kind of data the table will accept.

The general syntax for creating a table in Databricks SQL is:

CREATE TABLE [IF NOT EXISTS] table_name (column_definitions) [USING format] [OPTIONS (key = value, ...)] [PARTITIONED BY (col_name, ...)] [LOCATION path] [COMMENT 'table_comment'] [TBLPROPERTIES (key = value, ...)];

For Delta Lake tables, which are the default and recommended format, you explicitly specify USING DELTA (though it might be the default in your environment). Advanced features like schema enforcement, schema evolution, support for DEFAULT values, and reliable partitioning are best supported (or exclusively supported) by Delta Lake and are crucial for Databricks INSERT INTO operations.

Here's a quick example of creating a simple table in Databricks:

CREATE TABLE my_simple_delta_table (
  id INT,
  name STRING,
  value DOUBLE
) USING DELTA; -- Explicitly stating DELTA, though often the default

Output:

Creating Databricks Table - Databricks INSERT INTO

Here you can see that this command creates a managed Databricks Delta table named my_simple_delta_table with three columns.

For the examples that follow in this article, we'll use a bit more complex students table, which includes a DEFAULT value and a PARTITIONED BY clause.

CREATE TABLE IF NOT EXISTS students_delta (
    name string,
    student_address string DEFAULT 'unknown', -- Note the DEFAULT value
    student_id INT
)
USING DELTA -- stating Delta
PARTITIONED BY (student_id); -- Partitioning by student_id

Output:

Creating Databricks Table - Databricks INSERT INTO

Databricks INSERT INTO—Syntax & Variations

Now, let's dive into the core of our discussion: the Databricks INSERT INTO command. This is your fundamental tool for adding new data into your tables.

The INSERT INTO command in Databricks SQL and Apache Spark SQL (which Databricks Runtime uses) presents a couple of primary structures:

The general form for appending or overwriting data is:

INSERT { OVERWRITE | INTO } [ TABLE ] table_name
    [ PARTITION clause ]
    [ ( column_name [, ...] ) | BY NAME ]
    query

And for a specific conditional overwrite operation on Delta Lake tables:

INSERT INTO table_name
    REPLACE WHERE predicate
    query

As you can see, these structures is the fundamental approach to an insert operation: you can either append data (using INTO), replace data (using OVERWRITE), or perform a special conditional replacement (using Databricks REPLACE WHERE) which is specific to Databricks Delta tables.

Note that the optional TABLE keyword is purely for syntactic convenience; 

INSERT INTO my_table 

And

INSERT INTO TABLE my_table 

Both of them are functionally identical. It's also worth noting that for basic append operations, the INTO keyword itself is optional, meaning INSERT my_table... is also valid syntax.

Dissecting the Parameters

The flexibility of the Databricks INSERT INTO command lies in its parameters. Let's break them down:

INTO — Specifies that rows should be appended to the existing table data. This is often the default behavior if no mode (like OVERWRITE) is specified. The INTO keyword itself is optional for basic appends.

OVERWRITE — Removes existing data before insertion

  • Without a PARTITION clause — Truncates the entire table (deletes all existing rows) before inserting the new rows from the query.
  • With a PARTITION clause — Truncates only the partitions matching the PARTITION specification before inserting new rows from the query into those partitions. For dynamic partition overwrites, the table must be a Delta Lake table or specific configurations must be enabled.

table_name

  • Identifies the target table.
  • Must not include a temporal specification (FOR TIMESTAMP AS OF).
  • Can include an OPTIONS specification in Databricks Runtime 16.0 and above.
  • Cannot reference a foreign table for INSERT operations directly (limitation from Databricks Runtime 16.0 onwards).

PARTITION ( partition_spec )

  • Optional parameter specifying target partitions.
  • Can be a partial or full specification (PARTITION (col1=val1, col2=val2) or PARTITION (col1=val1)).
  • For static partitions (PARTITION (column = value)), the partition column(s) must not be repeated in the insert column list if one is provided.
  • For dynamic partition INSERT OVERWRITE, the table_name must be a Delta Lake table.

( column_name [, ...] )

  • Optional list of columns in the target table to insert data into.
  • Each column can appear at most once. The order of columns in this list dictates the mapping from the query.
  • If omitted, columns are mapped by position from the query to the table columns.
  • If a target column is not in this list (or if the list is omitted and the query provides fewer columns positionally, though this is less common for modern Spark SQL):
    • Databricks SQL (12.2 LTS and above) assigns the DEFAULT value if defined.
    • If no DEFAULT exists and the column is nullable, NULL is assigned.
    • If no DEFAULT exists and the column is not nullable, an error occurs.
  • Omitting the column list is generally equivalent to specifying all table columns in their defined order (excluding static partition columns if PARTITION clause is used).

BY NAME

  • Available in Databricks Runtime 13.3 LTS and above.
  • Uses column names from the query results to match target table columns. The order of columns in the query does not need to match the target table's column order.
  • Each column name from the query must exist in the target table. Duplicate names in the query's output for matching are problematic.
  • DEFAULT values (or NULL/error logic) are used for target table columns not matched by name from the query.
  • Matches attributes of structs by name as well.
  • No columns from the query can match columns specified in a static PARTITION clause or generated columns in the target table.
  • Source columns from the query that do not match any target table column name are ignored.

Databricks REPLACE WHERE boolean_expression

  • Available for Delta Lake tables in Databricks Runtime 12.2 LTS and above.
  • Atomically deletes rows in the target table matching the boolean_expression, then inserts rows from the query that also match the same boolean_expression.
  • Rows in the query that do not match the boolean_expression are ignored and not inserted.
  • Supports any expression that evaluates to a boolean result, typically referencing columns of the target table.

query

  • A SQL query (e.g., SELECT ..., VALUES ..., TABLE ...) that produces the rows to be inserted.
  • Must match the number of columns in the specified or implied column list (unless BY NAME is used).
  • Data types must be safely castable to target column types.
  • Can include the DEFAULT keyword in VALUES clauses or SELECT statements (Databricks Runtime 11.3 LTS and above) to explicitly use a column's default value.
  • Supports schema evolution for Delta Lake tables if the table is configured for it (TBLPROPERTIES('delta.autoMerge.enabled' = 'true')) or if ALTER TABLE ADD COLUMNS is used prior to insertion. This is distinct from the DataFrame writer's mergeSchema option.

Here is the summary of Databricks INSERT INTO Command Parameters:

Parameter Description Behavior
INTO Appends new rows to the table. Existing data is preserved. The INTO keyword itself is optional for basic appends.
OVERWRITE Replaces existing data. Truncates the entire table if no PARTITION clause is used. If a PARTITION clause is present, only the specified partitions are truncated. Dynamic partition overwrites require Delta Lake.
table_name Identifies the target table. Must be an existing table. Cannot include a temporal specification. Must not be a foreign table (for Databricks Runtime 16.0 and above). Can include an OPTIONS specification.
PARTITION clause Optional. Specifies a target partition for the insert. Can be a full or partial specification. For static partitions (column = value), the partition column must not be repeated in the insert column list. For dynamic partition INSERT OVERWRITE, the table_name must be a Delta Lake table.
(column_name [...]) Optional. An explicit list of columns in the target table to insert data into. The order of columns in this list matters and must correspond to the order of columns in the query. If columns are omitted (Databricks Runtime 12.2 LTS+), Databricks SQL assigns DEFAULT value if defined, then NULL if nullable, otherwise an error. Implies all columns if absent.
BY NAME Optional (Databricks Runtime 13.3 LTS+). Matches columns from the query to the target table_name based on column names from the query. If a column in table_name is not in query's output, its DEFAULT value (or NULL/error) is used. Matches struct attributes by name. Query columns cannot match partition/generated columns. Extra source columns ignored.
Databricks REPLACE WHERE predicate Delta Lake tables only (Databricks Runtime 12.2 LTS+). Atomically deletes rows matching predicate, then inserts rows from query that also match predicate. Rows in the query not matching predicate are ignored. Provides granular, conditional overwrite.
query A SQL query (SELECT, VALUES, TABLE) that produces the rows to be inserted. Column count from query must match insert list (unless BY NAME). Types must be castable. Can use DEFAULT keyword (Databricks Runtime 11.3 LTS+). Supports schema evolution for Delta tables (if table configured).

Databricks INSERT INTO Variations

Databricks provides several ways to utilize the INSERT statement, each designed for specific data appending requirements. These variants allow for flexibility, from inserting individual rows to performing bulk data movements or conditional replacements.

Once you know these variant types, it will help you in picking the most efficient and appropriate method for a given task. The ability to combine different clauses, such as INSERT OVERWRITE with a PARTITION clause and a SELECT query, allows you to create sophisticated ETL/ELT logic directly within SQL.

On top of that, the capability to INSERT INTO or OVERWRITE a Delta table referenced by its directory path (e.g: delta.\/path/to/table) underscores Delta Lake's architecture, where tables are fundamentally a combination of data files in cloud storage and a transaction log that governs them.

Let's explore these variants with practical examples, using the students_delta table we created earlier.

1) Basic Databricks INSERT INTO (using VALUES, subquery, TABLE clause, into a directory)

This is the most common form, used for appending data.

a) Using VALUES  for single or multiple rows:

INSERT INTO students_delta VALUES ('Elon Musk', 'Texas', 1);

Output:

Inserting data to student_delta Databricks table - Databricks INSERT INTO

This is the simplest way to add a specific record.

INSERT INTO students_delta (name, student_id) VALUES ('Jeff Bezos', 2);

Output:

Inserting data to student_delta Databricks table - Databricks INSERT INTO

Here the student_address will be 'unknown' due to the DEFAULT constraint.

INSERT INTO students_delta VALUES ('Sam Altman', NULL, 3);

Output:

Inserting data to student_delta Databricks table - Databricks INSERT INTO

This is also similar to the previous one if unknown was not the default, but here NULL is explicit.

INSERT INTO students_delta VALUES
   ('Mark Zuckerberg', 'California', 4),
   ('Larry Ellision', 'Palo Alto', 5);

Output:

Inserting data to student_delta Databricks table - Databricks INSERT INTO

Inserting multiple rows this way is more efficient than separate single-row inserts.

Here is what our table might look like after these inserts (order may vary):

Inserting data to student_delta Databricks table - Databricks INSERT INTO

b) Using a Subquery

This is powerful for ETL processes where data is transformed or selected from another source.

First, let's create a trainees table and populate it:

CREATE TABLE trainees (name string, address string, trainee_id INT);

INSERT INTO trainees VALUES ('Warren Buffett', 'Washington', '123');

SELECT * FROM trainees;
Inserting data to student_delta Databricks table - Databricks INSERT INTO
INSERT INTO students_delta PARTITION (student_id = 04) -- Static partition assignment
   SELECT name, address FROM trainees WHERE name = 'Warren Buffett';
Inserting data to student_delta Databricks table - Databricks INSERT INTO

Here, data for student 'Warren Buffett' is selected from the trainees table and inserted into the students_delta table, specifically into the partition where student_id is 6. The student_id column is not part of the SELECT list because its value is provided by the PARTITION clause.

c) Using TABLE clause

The TABLE clause provides a concise way to insert all rows from another table.Assuming an exchange_students table exists with a compatible schema:

CREATE TABLE exchange_students (name STRING, address STRING, student_id INT);

INSERT INTO exchange_students VALUES ('Anil Ambani', 'Mumbai, India', 777777);

INSERT INTO students_delta TABLE exchange_students;

Output:

Inserting data to student_delta Databricks table - Databricks INSERT INTO

Here you can see that it effectively copies all data from the exchange_students table into the students_delta table.

c) Insert into a directory (for Databricks Delta tables)

This syntax is used when the Delta Lake table is identified by its path in cloud storage.

Let's assume a table students_at_path is managed at a specific location:

INSERT INTO delta.`/tmp/path/to/students_table` VALUES ('Steve Ballmer', 'United States, San Jose', 11);

This directly inserts data into the Delta Lake table managed at the specified file system path. Ensure the path exists and is a Delta table.

2) Insert with a column list

This variant allows you to specify which columns you are inserting data into and in what order, which can differ from the table's actual column order.

INSERT INTO students_delta (student_address, name, student_id) VALUES
   ('Bangalore, India', 'Ram', 8);

Output:

Inserting data to student_delta Databricks table - Databricks INSERT INTO

Here, the values in the VALUES clause correspond to address, then name, then student_id, respectively, matching the provided column list.

3) Insert with both a partition spec and a column list

This combines the specificity of inserting into a particular partition with the flexibility of a column list.

INSERT INTO students PARTITION (student_id = 9) (student_address, name) VALUES
   ('Delhi, India', 'Hari');
Inserting data to student_delta Databricks table - Databricks INSERT INTO

Data is inserted into the partition where student_id is 9. The VALUES provide data only for student_address and name; student_id is taken from the PARTITION clause.

4) INSERT using the BY NAME clause

Available in Databricks Runtime 13.3 LTS and above, this matches columns from the source query to the target table by their names, rather than their positions.Create a target table, perhaps with a different column order or additional columns with defaults.

CREATE TABLE source_for_by_name (
  id INT,
  full_name STRING,
  city STRING,
  country STRING DEFAULT 'USA'
);
INSERT INTO source_for_by_name VALUES (13, 'Sundar Pichai', 'Mountain View', 'USA');

Now, insert into students_delta using BY NAME:

INSERT INTO students_delta -- target table has (name, student_address, student_id)
BY NAME
SELECT
  full_name AS name, -- Matches 'name' in students_delta
  id AS student_id,  -- Matches 'student_id' in students_delta
  city AS student_address, -- Matches 'student_address' in students_delta
  country            -- This column from source_for_by_name is ignored as no 'country' in students_delta
FROM source_for_by_name WHERE id = 13;

5) Databricks REPLACE WHERE

This Delta Lake-specific feature (Databricks Runtime 12.2 LTS+) allows for atomic, conditional overwrites. It first deletes rows in the target table that match the WHERE condition, and then inserts rows from the source query that also match that same condition.

Let's first create a staffs table and populate it:

CREATE TABLE staffs (
  name       STRING,
  department STRING
);


-- Databricks INSERT INTO

INSERT INTO staffs VALUES
  ('Badri',   'Finance'),
  ('Bob',     'Engineering'),
  ('Kedar', 'Finance');

At this point, your table holds:

Creating staffs table and populating it - Databricks INSERT INTO
INSERT INTO staffs
REPLACE WHERE department = 'Finance'
VALUES
  ('Prayag', 'Finance'),
  ('Raj',  'Finance');

Output:

Selecting all from staffs table - Databricks INSERT INTO

6) Databricks INSERT OVERWRITE (using VALUES, subquery, TABLE clause, into a directory) 

Dtabricks INSERT OVERWRITE is similar to the Databricks INSERT INTO counterparts but replaces data instead of appending.

a) Databricks INSERT OVERWRITE using a VALUES clause

INSERT OVERWRITE students_delta VALUES
    ('Ashua Hill', 'Cupertino', 1),
    ('Brian Reed', 'Palo Alto', 2);

SELECT * FROM students_delta;

This replaces the entire content of the students_delta table with the two provided rows.

Output:

Overwriting students_delta and querying the table - Databricks INSERT INTO

b) INSERT OVERWRITE using a subquery (with partitioning)

INSERT OVERWRITE students_delta PARTITION (student_id = 2) -- Overwrites only partition student_id=2
    SELECT name, address FROM trainees WHERE name = 'Warren Buffett'; -- Assuming Warren Buffett's ID should be mapped to student_id 2 for this example

It overwrites only the data in the partition where student_id is 2. Data in other partitions remains untouched.

Output:

Overwriting student_id=2 partition and querying trainees records - Databricks INSERT INTO

c) INSERT OVERWRITE using a TABLE clause

INSERT OVERWRITE students_delta TABLE exchange_students;

SELECT * FROM students_delta;

The entire students_delta table is replaced with the content of exchange_students.

Output:

Overwriting students_delta with exchange_students and querying the table

d) INSERT OVERWRITE a directory

INSERT OVERWRITE delta.`/tmp/path/to/students_table` VALUES
    ('New Path Person', 'San Jose', 111);

Each INSERT variant is designed for particular scenarios, from adding single, explicit records with VALUES, to bulk data movements using subqueries or the TABLE clause, surgical updates with REPLACE WHERE, and complete data refreshes with OVERWRITE.

What Is the Difference Between INSERT INTO and INSERT OVERWRITE in Databricks? Let's Clear That Up.

A key distinction in Databricks SQL is how data is added:

  • INSERT INTO appends rows to a table or partition without altering existing data. Therefore, use INSERT INTO when you want to keep past data and add new records.
  • INSERT OVERWRITE, on the other hand, replaces data. It truncates the target table or partitions before inserting, so old rows are completely removed. Consequently, use Databricks INSERT OVERWRITE to replace old records with fresh data.
Databricks INSERT INTO 🔮 Databricks INSERT OVERWRITE
Appends new rows Behavior Deletes existing rows, then inserts new rows
Preserves existing data; new data is added alongside it. Effect on Existing Data Deletes existing data (either the full table or only specified partitions) before inserting the new data.
Add incremental or batch data to table Use-case Refresh or overwrite all/some data (e.g. daily refresh)
Not naturally idempotent (running the same command twice typically adds the data twice, leading to duplicates). Idempotency Can be idempotent, especially with partition overwrites, if the source data for the replacement is consistent (re-running replaces the target with the same data, not duplicating).
Adds data to the appropriate partitions based on values. Creates new partitions if they don't exist (Delta). Partition Handling Full Table: Truncates all data. Partitioned: Truncates data only from specified partitions. Data in other partitions is safe. Dynamic partition overwrite needs Delta.
Generally efficient for appends. Performance Can involve significant data rewrite; use when a full replacement is intended for the scope.
Low (primarily if inserting incorrect data). Delta Time Travel can mitigate. Risk of Data Loss Higher if used incorrectly (e.g: an OVERWRITE without a PARTITION clause on a large table by mistake). Delta Time Travel is a key mitigation for this risk.
INSERT INTO [TABLE] table_name … (optionally with PARTITION, BY NAME, REPLACE WHERE) Syntax INSERT OVERWRITE [TABLE] table_name … (optionally with PARTITION)

Databricks also supports REPLACE WHERE, which is like a filtered INSERT INTO: it atomically replaces only rows matching a predicate. More on that later.

Databricks INSERT INTO Command—In-Depth Practical Examples: Let's Get Our Hands Dirty!

Theory and syntax are essential, but seeing the Databricks INSERT INTO command in action truly solidifies understanding. Let's deep dive into a series of practical examples of Databricks INSERT INTO.

Prerequisites:

  • Access to a Databricks Workspace. You'll need an active Databricks workspace, typically hosted on a cloud platform like AWS, Azure, or GCP.
  • A running Databricks Cluster or Databricks SQL Warehouse.
  • A foundational understanding of SQL.
  • Familiarity with the concept of Databricks tables, especially Delta Lake tables and their general characteristics (like ACID properties and schema management), will be beneficial.
  • Appropriate permissions within your Databricks workspace to create tables (if you're starting from scratch for these examples) and, crucially, to insert data into them within your chosen catalog and schema.

Before we proceed with the example, make sure your Databricks environment is properly configured. Let’s quickly walk through how to do that. First things first …. running the following command requires either the Databricks SQL editor (connected to a SQL warehouse) or a Databricks notebook cell (attached to an interactive Databricks cluster). Here in this article, we’ll use Databricks Runtime instead of a SQL warehouse, so let’s set up the Databricks cluster.

Step 1—Create a Databricks Cluster

First things first, log into your Databricks workspace and create a new Databricks cluster. In the Databricks UI, click Compute and then Create Compute.

Creating Databricks Cluster - Databricks INSERT INTO

Give the cluster a name, pick a Databricks Runtime, and select the number and type of worker nodes. Finally, click Create compute.

Creating Databricks Cluster - Databricks INSERT INTO

You will see that the Databricks cluster will spin up and show status Running when ready.

Step 2—Create and Attach a Databricks Notebook

With the Databricks cluster running, create a new Databricks notebook to run commands. In the sidebar, go to Workspace > your user folder (or the shared workspace), click Create > Databricks Notebook.  Databricks Notebooks provide an interactive environment for running code and SQL queries. They're perfect for exploring data and testing schema changes.

Creating Databricks Notebook - Databricks ADD COLUMN - Databricks INSERT INTO

Name it something like 1) Databricks_INSERT_INTO. For language, choose Python or SQL (we'll use both).

Naming Databricks Notebook - Databricks INSERT INTO

After creating the Databricks notebook, attach it to the Databricks cluster you just created by selecting the cluster name in the top toolbar. You now have an interactive workspace to run Databricks PySpark and SQL commands.

Attaching Databricks Cluster to Databricks Notebook - Databricks INSERT INTO

Now lets jump to the example section.

Practical Demo of Databricks INSERT

For our examples, we'll use a fresh table new_students_delta:

CREATE TABLE IF NOT EXISTS new_students_delta (
    name STRING,
    student_address STRING DEFAULT 'unknown',
    student_id INT
)
USING DELTA
PARTITIONED BY (student_id);

Also, let's create a source table named new_students_source to use in some of the more complex insert examples:

CREATE TABLE IF NOT EXISTS new_students_source (
    full_name STRING,
    city STRING,
    id INT,
    country_code STRING DEFAULT 'USA'
);

INSERT INTO new_students_source VALUES
('Warren Buffett', 'Omaha, Nebraska', 777777, 'N/A'),
('Sergey Brin', 'California', 888888, 'USA'),
('Amancio Ortega', 'Coruna, Galicia', 888889, 'ES'),
('Larry Ellison', 'Hawaii', 777778, 'N/A');
Creating and inserting data in the new_students_source Databricks table

Now, let's proceed with the examples.

Databricks INSERT INTO Example 1—Simple Single-Row Insert with VALUES

Lets add one student record with all values explicitly provided.

INSERT INTO new_students_delta VALUES ('Alice Wonderland', 'Rabbit Hole', 100001);

To Verify:

SELECT * FROM new_students_delta WHERE student_id = 100001;
Inserting and verifying data in the new_students_delta Databricks table

This is the most basic form of inserting a single, complete row.

Databricks INSERT INTO Example 2—Multi-Row Insert with VALUES

Here we’ll add a couple of student records simultaneously using a single INSERT statement.

INSERT INTO new_students_delta VALUES
('Bob The Builder', '1 Construction Lane', 200002),
('Charlie Brown', '5 Cartoon St', 300003);

To Verify:

SELECT * FROM new_students_delta WHERE student_id IN (200002, 300003) ORDER BY student_id;
Inserting multi-row data in the new_students_delta Databricks table - Databricks INSERT

Multi-row method is more efficient than executing multiple single-row INSERT statements.

Databricks INSERT INTO Example 3—Inserting with an Explicit Column List

Here we’ll add a student by providing only their name and ID, allowing the address column to use its predefined DEFAULT value ('unknown').

INSERT INTO new_students_delta (name, student_id) VALUES ('Dora Explorer', 400004);

To Verify:

SELECT name, student_address, student_id FROM new_students_delta WHERE student_id = 400004;
Inserting data with an explicit column list in the new_students_delta Databricks table

Databricks INSERT INTO Example 4—Inserting with DEFAULT Keyword in VALUES

Here is our main Goal: Explicitly instruct Databricks to use the DEFAULT value for the address column.

INSERT INTO new_students_delta VALUES ('Eve Online', DEFAULT, 500005);

To Verify:

SELECT name, student_address, student_id FROM new_students_delta WHERE student_id = 500005;

Databricks INSERT INTO Example 5—Inserting from a Subquery

Now let’s insert students into the students table by selecting and mapping data from the new_students_source table.

INSERT INTO new_students_delta (name, student_address, student_id)
SELECT full_name, city, id
FROM new_students_source
WHERE country_code = 'USA';

To Verify:

SELECT * FROM new_students_delta;
Inserting USA students into new_students_delta and querying new_students_source

As you can see, this is a common ETL pattern, where data is transformed or filtered from a source before being loaded into a target.

Databricks INSERT INTO Example 6—INSERT INTO a Specific Partition (Static Partitioning)

Now here in this example, we’ll add a new student record directly into a designated student_id partition.

INSERT INTO new_students_delta PARTITION (student_id = 600006) (name, student_address)
VALUES ('Blobby', 'Transylvania');

Notice that student_id is not included in the VALUES list or the explicit column list (name, address). Its value is fixed by the PARTITION (student_id = 600006) clause for this insert operation.

To Verify:

SELECT * FROM new_students_delta WHERE student_id = 600006;

This method is efficient for loading data into known partitions.

Inserting data into student_id=600006 partition in new_students_delta - Databricks INSERT

Databricks INSERT INTO Example 7—INSERT OVERWRITE (Entire Table)

Here we will completely replace all existing data in the students table with records from the new_students_source table.

INSERT OVERWRITE TABLE new_students_delta (name, student_address, student_id)
SELECT full_name, city, id
FROM new_students_source;

To Verify:

SELECT * FROM new_students_delta ORDER BY student_id;
Overwriting new_students_delta with data from new_students_source

 It is a powerful command for full data refreshes. Remember, with Databricks Delta Lake, you can use Databricks Time Travel to revert if this was a mistake. For subsequent examples, we might need to repopulate the students table or assume a clean state.  

Databricks INSERT INTO Example 8—INSERT OVERWRITE a Specific Partition (Static Partitioning)

First, let's make sure there's some data in a partition we want to overwrite and some data in other partitions that should remain.

Repopulate students for this example

DELETE FROM new_students_delta;
INSERT INTO new_students_delta VALUES ('Old Record In777', 'Old Address A', 777777);
INSERT INTO new_students_delta VALUES ('Other Student', 'Other Address', 123456);

Replace the data only for the partition student_id = 777777 using data from new_students_source.

INSERT OVERWRITE TABLE new_students_delta PARTITION (student_id = 777777)
SELECT full_name, city -- student_id is derived from the PARTITION clause
FROM new_students_source
WHERE id = 777777; 

To Verify:

SELECT * FROM new_students_delta ORDER BY student_id;

As you cqn see, this is how to refresh a specific slice of data without affecting other partitions.

Databricks INSERT INTO Example 9—INSERT... BY NAME

Here we will insert data from new_students_source into a differently structured target table, relying on column name matching. This makes the insert more resilient to column order differences.

Let's set up the target table students_by_name with columns in a different order and an extra column with a default value:

CREATE TABLE IF NOT EXISTS students_by_name (
    student_id INT,    -- Note: student_id is first here
    name STRING,
    address STRING DEFAULT 'unknown',
    enrollment_date DATE DEFAULT CURRENT_DATE() -- New column with a default
) USING DELTA;

INSERT INTO students_by_name BY NAME
SELECT
  id AS student_id,    -- Aliasing source 'id' to match target 'student_id'
  full_name AS name,   -- Aliasing source 'full_name' to match target 'name'
  city AS student_address,     -- Aliasing source 'city' to match target 'address'
  country_code         -- This column exists in source but not in target, so it will be ignored by BY NAME
FROM new_students_source;

The enrollment_date column in students_by_name is not present in the SELECT statement, so it will receive its DEFAULT value. The country_code from the source is ignored as there's no matching column name in the target.

To Verify:

SELECT * FROM students_by_name ORDER BY student_id;

BY NAME is particularly useful for ETL pipelines where source and target schemas might not perfectly align in order, or when target tables have additional metadata columns with defaults.

Databricks INSERT INTO Example 10—INSERT INTO... REPLACE WHERE (Delta Lake Specific)

Here we will atomically update specific records based on a condition. This is more granular than partition overwrites.

First, let's set up the new_students_delta table with some specific data:

DELETE FROM new_students_delta; -- Clear the table for a fresh start
INSERT INTO new_students_delta VALUES
('Alice Wonderland', '12 Rabbit Hole', 100001),
('Bob The Builder', '1 Construction Lane', 200002),
('Old Alice', 'Old Wonderland', 100001); -- Another record with student_id = 100001

Now, suppose we want to replace all records where student_id = 100001 with a single, new corrected record for Alice.

INSERT INTO new_students_delta REPLACE WHERE student_id = 100001
VALUES ('Alice Wonderland Corrected', 'New Wonderland Address', 100001);

This command will first delete both 'Alice Wonderland' and 'Old Alice' (since they both match student_id = 100001). Then, it will insert the new record for 'Alice Wonderland Corrected' because this new record also matches the student_id = 100001 condition.

To Verify:

SELECT * FROM students ORDER BY student_id, name;
Deleting new_students_delta data, inserting values, and replacing student_id=100001

Databricks REPLACE WHERE is a powerful tool for such surgical updates in Delta tables. While it offers precise conditional replacement, for more complex scenarios involving different actions for matched versus non-matched rows (like updating if matched, inserting if not matched), the Databricks MERGE INTO command (an alternative to INSERT INTO) provides a more comprehensive set of "upsert" capabilities. 

These examples cover many common use cases for Databricks INSERT INTO. The choice of which variant and which clauses to use depends heavily on the specific requirements of your data loading task.

Programmatic Insertion via DataFrame API (Databricks PySpark/Scala)

SQL is great for working with data, but in some cases, especially within larger data pipelines or when you need to do complex transformations, you'll probably be using DataFrames in Databricks PySpark or Scala. Databricks offers a solid DataFrameWriter API that lets you write DataFrame data into tables, which is often a more flexible option than using Databricks INSERT INTO commands directly.

Let's assume spark is an available SparkSession and students_df is a PySpark DataFrame with new student data.

%python
data = [
    ("Elon Musk", "Palo Alto", 1001),
    ("Bernard Arnault", "Paris", 1002),
    ("Jeff Bezos", "Seattle", 1003),
    ("Mark Zuckerberg", "Menlo Park", 1004)
]
schema = ["name", "address", "student_id"]

students_df = spark.createDataFrame(data, schema)

# Define the target Databricks Delta table name
target_table_name = "students_delta_pyspark"

Option 1—Using saveAsTable with append mode (can create if not exists, but better to pre-create for schema control)

Let's now create a table if it doesn't exist, or append if it does.

%python
try:
    students_df.write \
       .format("delta") \
       .mode("append") \
       .saveAsTable(target_table_name)
    print(f"Data appended to {target_table_name} using saveAsTable.")
except Exception as e:
    print(f"Error using saveAsTable: {e}")
Creating DataFrame and appending data to students_delta_pyspark table

To verify:

spark.sql(f"SELECT * FROM students_delta_pyspark).show()
Querying and displaying data from students_delta_pyspark table

Option 2—Using insertInto (table must exist, preferred for appending to existing tables)

Note that for this to work, the table students_delta_pyspark must already exist. If you ran the saveAsTable above, it does. Otherwise, create it first:

%python
spark.sql(f"""
CREATE TABLE IF NOT EXISTS {target_table_name} (
  name       STRING,
  address    STRING,
  student_id INT
)
USING DELTA
PARTITIONED BY (student_id)
""")

more_data = [("Bill Gates", "Medina", 900009)]
more_students_df = spark.createDataFrame(
    more_data,
    schema=["name", "address", "student_id"]
)

try:
    more_students_df.write \
        .format("delta") \
        .mode("append") \
        .saveAsTable(target_table_name)
    print(f"Data appended to {target_table_name} using saveAsTable append.")
except Exception as e:
    print(f"Error appending data: {e}")
Creating table and appending data to target_table_name - Databricks INSERT INTO

To verify:

spark.sql(f"SELECT * FROM {target_table_name} WHERE student_id = 900009").show()
Querying and displaying data from target_table_name - Databricks INSERT INTO

This example demos appending data. saveAsTable is convenient as it creates the table if it's missing, while insertInto strictly requires the table to exist.

Scala Example—Overwriting a Databricks Delta Table (with Schema Evolution)

Let's assume spark is an available SparkSession and studentsDF is a Scala DataFrame. We'll add a new column to demonstrate schema evolution.

%scala
val studentsData = Seq(
  ("Elon Musk",       "Palo Alto", 1001),
  ("Bernard Arnault", "Paris",     1002),
  ("Jeff Bezos",      "Seattle",   1003),
  ("Mark Zuckerberg", "Menlo Park",1004),
  ("Bill Gates",      "Medina",    1005)
)
val studentsDF = studentsData.toDF("name", "address", "student_id")

val targetScalaTableName = "students_delta_scala"

try {
  studentsDF.write
    .format("delta")
    .mode("overwrite")
    .option("overwriteSchema", "true")
    .saveAsTable(targetScalaTableName)

  println(s"Table $targetScalaTableName overwritten successfully with new schema.")
} catch {
  case e: Exception =>
    println(s"Error overwriting table $targetScalaTableName: ${e.getMessage}")
}
Creating DataFrame and overwriting targetScalaTableName with new schema

To verify:

spark.sql(s"DESCRIBE TABLE $targetScalaTableName").show(false)
spark.sql(s"SELECT * FROM $targetScalaTableName").show()
Showing table description and records from targetScalaTableName - Databricks INSERT

Here in this Scala example, .mode("overwrite") combined with .option("overwriteSchema", "true") allows the saveAsTable operation to not only replace the data but also update the table's schema to include the new house column.

When the DataFrame API is the better pick?

  • You’ve got complex pre-insert logic, cleaning, joins, enrichment or custom functions, that maps more naturally to code than to a single SQL statement
  • Table names, schemas, partition columns or write options must be set dynamically at runtime based on parameters or config, something you can’t do inside a static SQL string as cleanly.
  • This insert is one step in a broader Spark app written in Python or Scala: sticking with df.write… keeps your codebase consistent and maintainable.
  • You want to wrap logic in functions or classes and run unit tests against them—DataFrame-based code is far easier to test than detached SQL scripts.
  • You need advanced Spark features—structured streaming, custom UDFs or MLlib pipelines—that aren’t exposed through plain SQL.
  • You need fine-grained control of the execution plan (e.g. caching, Tungsten optimizations) that only the DataFrame API surface exposes.

Remember: DataFrameWriter.insertInto(tableName) requires your DataFrame’s schema to exactly match the target table’s schema.

When a plain Databricks INSERT INTO makes sense?

  • You’ve got a simple, ad-hoc load or you’re already deep in a SQL-centric workflow—typing a one-liner INSERT INTO … SELECT … is just quicker
  • Your team is more comfortable writing SQL than coding in PySpark or Scala. SQL is more readable.

🔮 There’s no hidden speed gap: Spark SQL and the DataFrame API both compile down to the same execution engine (Catalyst + Tungsten), so equivalent operations run with equal efficiency.

Best Practices, Notes, and Limitations of Databricks INSERT INTO

Here are some tips and caveats when using Databricks INSERT INTO:

Best Practices:

1) Use Delta Tables

Use Databricks Delta tables. Databricks SQL INSERT is primarily meant for Delta tables (ACID transactional). Some legacy table formats (like Hive Avro with timestamp-millis) might not be supported

2) Manage Small Files

Frequent, small INSERT INTO operations (common in streaming micro-batches or row-by-row insertions from some sources) can lead to an accumulation of many small files in your Delta Lake table. Consider using Databricks OPTIMIZE command which compacts small files into larger, more optimally sized ones, improving read throughput.

3) Leverage Partitioning

When inserting large amounts, leverage partitioning to parallelize writes. Static partitions can be faster if you know the target partition.

4) No Automatic Deduplication

Databricks INSERT INTO INTO appends data. If you need to insert new records or update existing ones (upsert), use the MERGE INTO command.

5) OVERWRITE Caution

Be extremely careful with INSERT OVERWRITE. An accidental overwrite without a proper PARTITION clause on a large table can lead to significant data loss. Delta Lake's Time Travel can help recover, but prevention is better.

6) Use REPLACE WHERE for Precision

For targeted, atomic replacement of rows in Delta tables based on a condition, REPLACE WHERE is generally preferred over a manual DELETE followed by an INSERT due to its atomicity and conciseness.

Important Notes to Keep in Mind:

  • For basic appends, you don’t need the INTO keyword. Writing INSERT my_table SELECT * FROM src is identical to INSERT INTO my_table SELECT * FROM src.
  • When you specify an explicit column list in INSERT INTO and leave out some columns, Databricks SQL (12.2 LTS+) fills them this way:
    • Use the DEFAULT value defined in the table schema.
    • If there’s no DEFAULT and the column is nullable, insert NULL.
    • If the column is NOT NULL with no DEFAULT, the insert fails with an error
  • The BY NAME clause maps source query columns to target table columns by name
    • If a target column isn’t in the source, the default/null/error logic applies.
    • If the source has extra columns, they’re silently ignored.
  • All Databricks INSERT INTO operations on Delta Lake tables use the Delta protocol’s ACID support. That means each insert is atomic, consistent, isolated, and durable—you either get all rows or none, with no partial writes.
  • You can Databricks INSERT INTO Parquet or CSV tables, but you lose ACID guarantees, schema evolution, and replace-where logic.
    • Appends just drop in new files.
    • Overwrites replace existing files.
    • There’s no transaction log to roll back partial writes
  • If a Delta table isn’t registered in the metastore, you can use its path:
INSERT INTO delta.`/path/to/delta_table` ...

It writes straight to the table’s storage location, treating the path as the table identifier.

Limitations of Databricks INSERT INTO

Understanding the limitations helps in avoiding errors and choosing the right approach:

  1. Databricks does not support INSERT operations for Hive Avro tables if the timestamp-millis data type is present in the table schema.
  2. The table_name specified in an INSERT statement must not be a foreign table (this limitation applies from Databricks Runtime 16.0 onwards). Data should typically be ingested into a staging Databricks Delta table first, then inserted into the final target if the ultimate destination involves a federated system.
  3. BY NAME Clause Restrictions:
    • Columns in the source query cannot share names with columns specified in a static PARTITION clause when BY NAME is used.   
    • Similarly, columns in the query cannot match generated columns in the target table when using BY NAME.
  1. Databricks REPLACE WHERE clause is exclusively for use with Delta Lake tables.
  2. Any data inserted via INSERT INTO must satisfy all enforced NOT NULL and CHECK constraints defined on the target Databricks Delta table. Violation of these constraints will cause the insert operation to fail.
  3. Direct Databricks INSERT INTO operations are not supported for evolving the schema or directly manipulating data in streaming tables. These tables are managed by their Delta Live Tables (DLT) flow definitions.

Alternatives to Databricks INSERT INTO: Other Ways to Fill Your Tables

Databricks INSERT INTO is a versatile and commonly used command for adding data to tables, Databricks provides a suite of other tools and commands that can be more suitable for specific data ingestion scenarios.

1) Databricks CREATE TABLE AS SELECT (CTAS)

Databricks CTAS creates and populates a new table from a query result. It is useful for ad-hoc snapshots or one-time transformations.

Here is the syntax of Databricks CREATE TABLE AS SELECT (CTAS):

CREATE TABLE new_table
USING DELTA
AS SELECT * FROM source_table

2) Databricks COPY INTO

Databricks COPY INTO is a SQL command specifically designed for idempotent and efficient bulk loading of data from files residing in cloud object storage (like Amazon S3, Azure Data Lake Storage Gen2, Google Cloud Storage) or Databricks Unity Catalog volumes directly into a Databricks Delta table.

Here is the syntax of Databricks COPY INTO:

COPY INTO my_delta
FROM 's3://bucket/path/'
FILEFORMAT = PARQUET
FORMAT_OPTIONS ('mergeSchema' = 'true');

Databricks COPY INTO is excellent for bulk ingestion pipelines + handles schema inference/evolution.

3) Databricks MERGE INTO

Databricks MERGE INTO (often referred to as "upsert") is a powerful SQL command that allows you to conditionally update existing rows, insert new rows, and optionally delete rows in a target DatabricksDelta table based on a comparison with a source table or query result.

Here is the syntax of Databricks MERGE INTO:

MERGE INTO target AS T
USING updates AS S
ON T.id = S.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

Databricks MERGE INTO vs Databricks INSERT INTO

Databricks INSERT INTO solely adds new rows. Databricks MERGE INTO is far more versatile, capable of combining inserts, updates, and deletes in one operation. While INSERT INTO... REPLACE WHERE offers a form of conditional overwrite, MERGE INTO provides a much richer set of conditions and actions for more complex synchronization tasks.

4) DataFrameWriter API (Programmatic Inserts)

As discussed in a previous section, the DataFrameWriter API in Databricks PySpark and Scala allows you to write the contents of a DataFrame to a table using methods like .saveAsTable("table_name") or .insertInto("table_name") with various modes (append, overwrite).

Want to take Chaos Genius for a spin?

It takes less than 5 minutes.

Enter your work email
Enter your work email

Conclusion

And that’s a wrap! The Databricks INSERT INTO command is your go‑to for adding or replacing data in tables. You can append new rows, overwrite old ones, or use Delta Lake’s REPLACE WHERE to swap out specific parts. Remember, INSERT INTO won’t change your table’s layout unless you turn on schema updates first. For quick jobs, plain SQL works; for tougher cases, switch to Spark’s DataFrame API. To keep things running smoothly, batch your inserts, pick sensible partitions, and run Databricks OPTIMIZE on your Delta tables. Nail these steps and your data will flow cleanly into reports, models, or dashboards.

In this article, we have covered:

… and so much more!

FAQs

Can I INSERT into non-Delta tables?

Databricks INSERT INTO only works on Databricks SQL tables, primarily Delta tables. Some older formats (e.g. Hive Avro with timestamp-millis) are not supported. You cannot INSERT INTO a view or temporary table; the target must be a full table.

What happens if I omit a column with no default?

If you use an explicit column list in your Databricks INSERT INTO statement and omit a target table column:

  • If the omitted column has a DEFAULT value defined in the table schema, that default value will be used.   
  • If there's no DEFAULT value and the column is nullable, NULL will be inserted.   
  • If there's no DEFAULT value and the column is defined as NOT NULL, the INSERT INTO operation will fail with an error

What if the table schema changes (new column)?

By default, adding a new column to a Delta table requires setting schema evolution on. If enabled, Databricks INSERT INTO can add new columns at the end. Otherwise you must alter the table first.

Does INSERT INTO guarantee no duplicates?No, it simply appends. If you need to avoid duplicates or update rows, consider using MERGE (upsert) or INSERT INTO ... REPLACE WHERE for specific keys.

What is the REPLACE WHERE clause?It’s a Databricks extension that lets you atomically replace only those rows matching a predicate. It’s like “overwrite these rows only”.

Can Databricks INSERT OVERWRITE and REPLACE WHERE be used together?

No, Databricks INSERT OVERWRITE and REPLACE WHERE are distinct mechanisms for modifying data and cannot be used together in a single INSERT statement.

What is the difference between INSERT and INSERT INTO? Is INTO optional?

In Databricks SQL, for basic append operations, the INTO keyword is optional. So, both INSERT INTO my_table SELECT... and INSERT my_table SELECT... are valid and will append data.

How can we insert data into a view?

Generally, you cannot directly INSERT INTO a standard SQL view in Databricks. Standard views are virtual tables defined by a query; they don't store data themselves. Data modifications must happen on the base tables from which the view derives its data.

Can I insert data into a partition that doesn’t yet exist?

Yes. If using static partition, specifying a new partition value will create it. For dynamic partitions, Spark will create partitions as needed.

How to insert with null or default values?

If you omit columns in the INSERT list, Databricks uses column defaults or NULL if allowed. Make sure your table has defaults or nullable columns as needed.

Alternative commands?

Besides INSERT, Databricks supports Databricks COPY INTO for file ingestion, Databricks MERGE for upsert, and DataFrame .save() operations. For full refresh, CREATE OR REPLACE TABLE AS SELECT can be simpler.

How do I rollback if something goes wrong?

If on Delta Lake, use Delta’s time travel (RESTORE) or transactions. INSERT INTO is a transactional write, so on failure it won’t partially commit.