HOW TO: Identify and Eliminate Duplicate Snowflake Data

Snowflake can handle and analyze large volumes   of data. But, like any other cloud data warehouses/dbs, Snowflake is not immune to data duplication. Like any other warehouse and DBs Snowflake does support unique, primary, and foreign keys but it does not enforce these constraints (except for NOT NULL constraints). This means that duplicate rows can be inserted into Snowflake tables, which could lead to certain records being inserted more than once. There are a number of strategies that can be used to mitigate this issue.

In this article, we will cover everything you need to know about identifying and eliminating duplicate data in Snowflake. We will start by defining duplicate data and explaining why it is a problem. Then, we will discuss the different methods for identifying duplicate data in Snowflake. We will also cover some preventive measures that you can take to avoid duplicate data in the future. Lastly, we will provide you with a hands-on example on removing duplicate data while loading CSV to Snowflake.

Let's dive right in!!

Understanding Duplicate Data

Duplicate data is one of the main data quality issues in data warehouses like Snowflake. Having multiple copies of the same data row reduces data accuracy and reliability. It is important to regularly identify and remove duplicates

Common causes of Snowflake data duplication:

There are a few common causes of Snowflake data duplication:

1) Snowflake data duplication Cause 1—Human error:

Data duplication can occur due to human mistakes during data entry, such as accidentally entering the same information multiple times or copying and pasting data incorrectly.

2) Snowflake data duplication Cause 2—System glitches or software bugs:

Technical issues within a system or software can sometimes lead to duplicate data creation. It can happen when there are errors in data synchronization or when system processes fail to handle data properly.

3) Snowflake data duplication Cause 3—Data integration or migration processes:

During data integration or migration from one system to another, data duplication can occur if the mapping or transformation rules are not properly defined or if there are inconsistencies in data formats between the source and target systems.

4) Snowflake data duplication Cause 4—System integration:

When importing or bringing data together from multiple sources, duplicates can be created if there is no any kind of matching process. The same entity may exist in both systems but with slightly different details.

5) Snowflake data duplication Cause 5—Lack of constraints:

Primary keys and unique constraints can help prevent duplicates

Note: Snowflake does not enforce these constraints, but if they are not properly implemented, duplicates can emerge.

Impacts of Snowflake data duplication:

The impacts of duplicate Snowflake data are significant:

1) Snowflake data duplication Impact 1— Inaccurate Analysis and Poor Metrics:

Aggregate metrics and KPIs will be inflated and inaccurate with duplicate data. Analysis will also yield incorrect insights.

2) Snowflake data duplication Impact 2—Very poor data quality:

Duplicate data reduces the overall accuracy, completeness and reliability of data which reduces data quality, which can significantly undermine confidence in the data.

3) Snowflake data duplication Impact 3—Operational inefficiencies:

Additional time & resources are required to handle and resolve the duplicate data which reduces operational efficiency.

4) Snowflake data duplication Impact 4—Higher costs:

The costs to identify, resolve, and prevent duplicate data can be quite high especially if done manually. These costs reduce the ROI on data initiatives.

5) Snowflake data duplication Impact 5—Bad decision making:

Inaccurate data and analysis due to data duplication can lead to bad decision making. When businesses make decisions based on incomplete or incorrect information, it can have a negative impact on key business outcomes. For example, let’s say a company/business may make a decision to launch a new product based on the assumption that there is a large market fit for it, when in reality the market is much smaller than they thought. This can lead to lost sales + wasted marketing spend and other financial losses.

Methods for Eliminating Snowflake Data Duplication

To ensure accurate and reliable flow of Snowflake data, it is crucial to have effective methods for eliminating duplicate data. Snowflake offers various techniques and functionalities to address this issue. Here are some of the key methods available in Snowflake for eliminating duplicate data:

1) Using Snowflake DISTINCT

In Snowflake DISTINCT keyword is used in conjunction with the SELECT statement to retrieve only distinct (unique) values from a dataset. So by only selecting columns with the DISTINCT keyword, Snowflake returns only the unique records, eliminating any duplicates.

For example;

Let's begin by creating a sample table and inserting some dummy data for testing purposes.

CREATE OR REPLACE TABLE STUDENT_RECORD (
STUDENT_ID NUMBER(6,0),
FIRST_NAME VARCHAR2(20),
LAST_NAME VARCHAR2(20),
AGE NUMBER(3,0),
ADDRESS VARCHAR2(100),
PHONE_NUMBER VARCHAR2(20),
GRADE VARCHAR2(10)
);

INSERT INTO STUDENT_RECORD(STUDENT_ID, FIRST_NAME, LAST_NAME, AGE, ADDRESS, PHONE_NUMBER, GRADE) VALUES
(1, 'John', 'Cena', 18, '123 Main St, City', '123-456-7890', 'A'),
(1, 'John', 'Cena', 18, '123 Main St, City', '123-456-7890', 'A'),
(2, 'Rock', 'Bottom', 17, '456 Second St, Town', '987-654-3210', 'B'),
(2, 'Rock', 'Bottom', 17, '456 Second St, Town', '987-654-3210', 'B'),
(2, 'Rock', 'Bottom', 17, '456 Second St, Town', '987-654-3210', 'B'),
(3, 'James', 'Johnson', 16, '789 Oak St, Village', '456-123-7890', 'C'),
(3, 'James', 'Johnson', 16, '789 Oak St, Village', '456-123-7890', 'C'),
(4, 'Sarah', 'Williams', 18, '321 Pine St, County', '789-123-4560', 'A');

SELECT * FROM STUDENT_RECORD;
Creating a table and inserting multiple records - Snowflake data
Creating a table and inserting multiple records.

Let’s  retrieve unique records from the table . To do so, simply type in the following query.

SELECT DISTINCT * FROM STUDENT_RECORD;
Retrieving unique records from the table - Snowflake data quality
Retrieving unique records from the table

Now, let’s go a bit further and refine the student records and remove any duplicates.

First, let's create a separate table called "refined_student_record" by selecting distinct records from the "student_record" table. This will create a new table with unique records.

CREATE OR REPLACE TABLE refined_student_record AS SELECT DISTINCT * FROM student_record;

Next, we will rename the original "student_record" table to "student_record_not_refined". This step is optional but can be useful if you want to keep a backup of the original data.

ALTER TABLE student_record RENAME TO student_record_not_refined;

Then, we will rename the "refined_student_record" table to "student_record", effectively replacing the original table with the refined version.

ALTER TABLE refined_student_record RENAME TO student_record;

Lastly, we can verify the changes by selecting all records from the "student_record" table, which now contains the refined and deduplicated data.

SELECT * FROM student_record;
Selecting all records from the table - Snowflake data quality
Selecting all records from the table

2) Using Snowflake ROW_NUMBER() function

Snowflake ROW_NUMBER() function is an analytic function that assigns a unique sequential number to each row within a specified partition of the result set. Using this function, you can assign row numbers to records and then filter for rows with a row number of 1, effectively removing duplicates. This method is useful when you want to eliminate duplicates based on specific column combinations.

Now, let's write a query utilizes the Snowflake ROW_NUMBER() analytic function to assign row numbers to each unique set of records based on the specified columns:

 SELECT 
    STUDENT_ID,
    FIRST_NAME,
    LAST_NAME,
    AGE,
    ADDRESS,
    PHONE_NUMBER,
    GRADE,
  ROW_NUMBER() OVER(PARTITION BY STUDENT_ID, FIRST_NAME, LAST_NAME, AGE, ADDRESS, PHONE_NUMBER, GRADE ORDER BY STUDENT_ID) AS ROW_NUMBER
FROM student_record_not_refined;
Retrieving refined student records from the table student_record_not_refined table - Snowflake data quality
Retrieving refined student records from the table student_record_not_refined table

After assigning row numbers to the records, you can extract the unique records from the table by querying the rows with a row number of 1. The following SQL query demonstrates how to extract unique records from the student_record_not_refined table using the ROW_NUMBER analytic function:

SELECT
  STUDENT_ID,
  FIRST_NAME,
  LAST_NAME,
  AGE,
  ADDRESS,
  PHONE_NUMBER,
  GRADE
FROM (
  SELECT
    STUDENT_ID,
    FIRST_NAME,
    LAST_NAME,
    AGE,
    ADDRESS,
    PHONE_NUMBER,
    GRADE,
    ROW_NUMBER() OVER (PARTITION BY STUDENT_ID, FIRST_NAME, LAST_NAME, AGE, ADDRESS, PHONE_NUMBER, GRADE ORDER BY STUDENT_ID) AS ROW_NUMBER
  FROM student_record_not_refined
) AS subquery
WHERE ROW_NUMBER = 1;

As you can see, the inner subquery selects all columns from the student_record_not_refined table and includes the additional column ROW_NUMBER() OVER (...) AS ROW_NUMBER, which assigns row numbers to each unique set of records based on the specified columns. The outer query then retrieves the columns STUDENT_ID, FIRST_NAME, LAST_NAME, AGE, ADDRESS, PHONE_NUMBER, and GRADE from the subquery, and finally the WHERE clause filters the results to only include rows where the ROW_NUMBER column is equal to 1, effectively extracting the unique records.

3) Using Snowflake SWAP WITH command

Snowflake SWAP WITH command in Snowflake allows for the swapping of content and metadata between two specified tables, including integrity constraints, which involves creating a new table with the same structure as the original table, inserting unique records into the new table, and then swapping the data between the two tables. This approach is particularly suitable for removing duplicates from large datasets efficiently.

Let’s create a new table with the same structure as the source table.

Use the CREATE TABLE LIKE statement to create the STUDENT_RECORD_MIRROR table with the same structure as the STUDENT_RECORD_NOT_REFINED table, without any data.

CREATE OR REPLACE TABLE student_record_mirror LIKE student_record_not_refined;
Create a new table with the same structure as student_record_not_refined - Snowflake data quality
Create a new table with the same structure as student_record_not_refined

Now, insert unique records into the newly created table. Make sure to use the INSERT INTO statement to select distinct records from the STUDENT_RECORD_NOT_REFINED table and insert them into the STUDENT_RECORD_MIRROR table.

INSERT INTO student_record_mirror SELECT DISTINCT * FROM 
student_record_not_refined;
Inserting distinct records - snowflake data quality
Inserting distinct records

Finally, Swap the data between the two tables. Use the SWAP WITH command to swap the data and metadata between the STUDENT_RECORD_NOT_REFINED and STUDENT_RECORD_MIRROR tables. This operation is performed in a single transaction.

ALTER TABLE student_record_mirror SWAP WITH student_record_not_refined;
Swapping the contents and structure of the table - Snowflake data quality
Swapping the contents and structure of the table
select * from student_record_not_refined;
Selecting all record from the table - Snowflake data quality
Selecting all record from the table - Snowflake data quality

4) Using OVERWRITE command

Alternatively, you can use the INSERT statement with the OVERWRITE command to insert unique records directly into the student record table. This method eliminates the need for creating an intermediate table and swapping data. The existing data in the table will be truncated before inserting the unique records.

Let’s first test  our table has un-refined data or not. To do so, just select all from our unrefined table.

select * from student_record_not_refined_2;
Selecting all record from the table - Snowflake data quality
Selecting all record from the table

Now, lets overwrite it.

INSERT OVERWRITE INTO student_record_not_refined_2 SELECT DISTINCT * FROM student_record_not_refined_2;
select * from student_record_not_refined_2;
Inserting distinct records from the table student_record_not_refined_2 and overwriting the existing data - Snowflake data quality
Inserting distinct records from the table student_record_not_refined_2 and overwriting the existing data

5) Using Third-party tools to check Snowflake data quality

Apart from the built-in features, functionalities and commands provided by Snowflake, there are various third-party Snowflake tools available that specialize in data quality and deduplication. These tools, such as:

These tools offer exceptional functionalities to efficiently identify and eliminate duplicate data. So by carefully leveraging these tools, you can benefit from flexibility and customization options tailored to their specific data management needs. These third-party tools provide enhanced capabilities and can boost Snowflake's native capabilities, allowing for comprehensive and efficient duplicate data management.

Preventing Duplicate Data in the Future

So, how do you prevent data duplication issues in Snowflake? Welp!! There are several strategies you can employ, they are:


1) Standardizing data entry:

Establishing guidelines and enforcing consistent data entry formats can minimize the chaos of duplicate entries.

2) Implementing validation rules:

Applying validation rules and constraints during data entry can prevent duplicate records from being created.

3) Implementing unique identifiers:

Assigning unique identifiers to records ensures their uniqueness and helps in identifying and eliminating duplicates.

4) Setting up automated data cleaning processes:

Automating data cleaning processes can significantly reduce the occurrence of duplicate data, which involves implementing data quality tools or scripts that regularly scan and clean the data, identifying and removing duplicates.

5) Importance of regular data audits:

Conducting regular data audits is crucial for identifying and addressing duplicate data. Audits help in detecting duplicate records, understanding the root causes, and implementing necessary measures to prevent their occurrence in the future.

Hands-on Example:

Now, let's explore a hands-on example of loading a demo CSV file that contains duplicate records. We will refine the data before loading it into the table. Follow these steps carefully to proceed:

Here is the sample of what our CSV record looks like:

Sample CSV records - Snowflake data quality
Sample CSV records

To begin, we need to load our CSV file into Snowflake. There are multiple methods available for this task. We can use Snowsight to directly upload the file, utilize a Snowflake external stage, or employ a Snowflake managed stage (internal stage). For the purpose of this tutorial, we will use Snowsight to create a stage and load the CSV file into it.

Here are the steps to follow to upload CSV to Snowflake using stage:

Step 1: Log in to Snowsight.

Step 2: Head over to the navigation menu, select "Data" and then click on "Databases", which will display a list of available databases.

Select 'Data' > 'Databases' to view available databases - snowflake import csv - Snowflake data quality
Select 'Data' > 'Databases' to view available databases - snowflake csv

Step 3: Select the desired database and schema from the list to set the context for your data loading process.

Step 4: Choose the ‘stage’ option and click on “Snowflake Managed” stage.

Selecting 'stage' -> 'Snowflake Managed' option - upload CSV to snowflake - snowflake data quality
Selecting 'stage' -> 'Snowflake Managed' option - upload CSV to snowflake

Step 5: Enter a Stage Name. In the "Stage Name" field, provide a descriptive name for your stage.

Entering a descriptive Stage Name - upload CSV to snowflake
Entering a descriptive Stage Name - upload CSV to snowflake

Step 6: Click the "Create" button to create the Snowflake Managed stage.

Step 7: Choose the CSV file to load into the Snowflake stage.

Choosing the CSV to load into Snowflake stage - snowflake import csv - Snowflake data quality
Choosing the CSV to load into Snowflake stage - snowflake import csv

Step 8: After successfully uploading the CSV file, navigate to the worksheet and enter the following command to verify if the CSV file has been loaded or not.

list '@"MY_DATABASE"."MY_SCHEMA"."STUDENT_RECORD_STAGE"/Student_record.csv';
Listing all the records to verify if the CSV file has been loaded - Snowflake data quality
Listing all the records to verify if the CSV file has been loaded - Snowflake data quality

Step 9: Now, let's create a table using the CREATE TABLE statement that matches the structure of your CSV file.

Run the following command in SnowSQL to create the table:

CREATE OR REPLACE TABLE student_record_demo_csv_load_data (
    student_id INTEGER,
    first_name VARCHAR(50),
    last_name VARCHAR(50),
    date_of_birth DATE,
    email VARCHAR(100),
    address VARCHAR(200)
);
Creating a new table - Snowrflake data quality
Creating a new table

Step 10: Create a FILE FORMAT. Run the following command in SnowSQL to create a file format:

CREATE or REPLACE FILE FORMAT student_record_csv_file_format
type = 'csv' 
compression = 'none' 
field_delimiter = ',' 
record_delimiter = '\n' 
skip_header = 1 
Creating a File Format - Snowflake data quality
Creating a File Format

Step 11: Now that you have created the file format, let's check the table to see if there is any data. If the console displays "query produced no result," it means there is no data in the table.

select * from student_record_demo_csv_load_data;
Retrieving all records from the table - snowflake data quality
Retrieving all records from the table

Step 12: Finally, to load the data from the stage into a Snowflake table you can use the COPY INTO command. Choose the appropriate command based on your requirements:

  • To load data as it is organized in the CSV file:
copy into student_record_demo_csv_load_data
    from '@"MY_DATABASE"."MY_SCHEMA"."STUDENT_RECORD_STAGE"/Student_record.csv'
    file_format = student_record_csv_file_format
    force = true 
    on_error = 'continue';
loading data as it is organized in the CSV file - Snowflake data quality
loading data as it is organized in the CSV file

Step 13: Now, if you select everything from the table, you will be able to see all of your data. However, the issue of duplication still persists. In the next step, we will address this problem.

select * from student_record_demo_csv_load_data;
Selecting all the record - Snowflake data quality
Selecting all the record

Step 14: Truncate the table to remove the entire record from the table.

truncate table student_record_demo_csv_load_data;
Truncating the table to remove the entire record from the table - Snowflake data quality
Truncating the table to remove the entire record from the table - Snowflake data quality

Step 15: To remove the duplicates while copying the data from the CSV file to the Snowflake table, execute the following query:

copy into student_record_demo_csv_load_data
    from 
    (   
    select distinct * from '@"MY_DATABASE"."MY_SCHEMA"."STUDENT_RECORD_STAGE"/Student_record.csv'
    )
    file_format = student_record_csv_file_format
    force = true 
    on_error = 'continue';
Removing the duplicates while copying the data - Snowflake data quality
Removing the duplicates while copying the data

This particular query removes duplicates by selecting distinct records from the CSV file and then copies the refined data into the designated Snowflake table.

select * from student_record_demo_csv_load_data;
Selecting all the record - Snowflake data quality
Selecting all the record

Congratulations! As you can see, there are no more duplicate data entries. Just like this, you can effortlessly eliminate duplicate data from your dataset using the provided method. It's a simple and effective way to ensure data integrity and accuracy.

Conclusion

In conclusion, managing duplicate data in Snowflake, much like tending a garden, requires a keen eye for identifying the unwanted weeds (duplicates) and the right tools to eliminate them. To address this issue, we explored methods such as using the DISTINCT keyword, the Snowflake ROW_NUMBER() function, and the Snowflake SWAP WITH command to identify and eliminate duplicates in Snowflake. So by carefully following these strategies, you can ensure your data garden remains healthy and productive. Remember, prevention is better than cure; setting up preventive measures is akin to building a fence around your garden, keeping the weeds at bay. With these insights, you're now equipped to maintain the integrity of your data in Snowflake, ensuring it remains a reliable source for your analytical needs.

Want to take Chaos Genius for a spin?

FAQs

What is duplicate data?

Duplicate data are records that contain the same or similar information. They can occur in Snowflake tables for a variety of reasons, such as human error, data entry errors, or system failures.

Why is it important to identify and eliminate duplicate data?

Duplicate data can cause a number of problems, including:

  • Inaccuracy: Duplicate data can lead to inaccurate reports and analyses.
  • Wasted space: Duplicate data can take up unnecessary space in Snowflake tables.
  • Performance issues: Duplicate data can slow down queries and operations.
  • Data security risks: Duplicate data can increase the risk of data breaches and other security incidents.

What is duplicate data in Snowflake?

Duplicate data in Snowflake refers to identical rows or records that are inserted into Snowflake tables more than once. This can occur due to a lack of enforcement of unique, primary, and foreign key constraints.

Why is duplicate data a problem in Snowflake?

Duplicate data can lead to inaccuracies in data analysis and reporting. It can also consume unnecessary storage space, leading to increased costs and inefficient data management.

What are some preventive measures to avoid duplicate data in Snowflake?

Preventive measures include implementing data validation checks before insertion, using deduplication tools, and regularly monitoring and cleaning your data.

Does Snowflake support unique, primary, and foreign keys?

Yes, Snowflake supports unique, primary, and foreign keys. However, it does not enforce these constraints (except for NOT NULL constraints), which can lead to the insertion of duplicate rows.

Pramit Marattha

Technical Content Lead

Pramit is a Technical Content Lead at Chaos Genius.

People who are also involved

Great! You've successfully subscribed.
Great! Next, complete checkout for full access.
Welcome back! You've successfully signed in.
Success! Your account is fully activated, you now have access to all content.