Snowflake Query Tuning and Optimization: A Comprehensive Guide (Part 2)
Welcome to the second part of the Snowflake query tuning and optimization best practices series. In this two-part series, we will show you how to optimize Snowflake queries. Part 1 discussed various Snowflake features that can enhance Snowflake query performance. In this part (Part 2), we will dig deeper into some common mistakes to avoid while writing Snowflake queries.
Now let's get started!
4 Best Practices for Snowflake query tuning and Optimization
1) Select only required data columns (avoid select *)
A common, but effective, way to optimize queries is to avoid using the "SELECT *" statement in queries. Since Snowflake uses columnar storage, retrieving additional columns increases the query execution time significantly. When the "SELECT *" statement is used in Snowflake, it retrieves all columns from the table or tables involved in the query rather than just the columns required for the query, causing Snowflake to scan and transfer more data than necessary, increasing Snowflake query processing time.
To minimize query processing time and optimize Snowflake query performance, SELECTing only the required data columns in Snowflake queries is the best practice, which means specifying the exact columns needed for the query rather than SELECTing the entire thing (*).
You can use a “SELECT *” in a nested query or a CTE and specify the exact columns in the outer query, or vice versa. This will be optimized by Snowflake and it will not scan the columns which are not used.
Here are some quick HOT tips and tricks for SELECTing only the required data columns in Snowflake queries:
- Explicitly specify the required columns in the SELECT statement rather than "SELECT *"
- Avoid including unnecessary columns in the WHERE, JOIN, GROUP BY and ORDER BY clauses of the query.
- Use table aliases to simplify the query and make it easier to specify only the columns that are required.
- Use subqueries to retrieve only the required data columns from nested tables or views (if necessary).
2) Optimize Sorting
Sorting is an important aspect of Snowflake query tuning and optimization. If a query requires sorting, it can result in additional compute load, resulting in a delay in the completion of the query and therefore leading to increased Snowflake costs.
Sorting can impact snowflake query performance in the following ways:
Massive Resource utilization
Sorting huge datasets necessitates a significant amount of CPU and memory resources; the more resources required, the more expensive the query.
Whenever data is sorted, it may need to be written to disk, resulting in additional I/O operations that might hinder or slow down Snowflake query performance.
Network data transfer
Whenever a large dataset is sorted, the data needs to be distributed across multiple different nodes in the cluster so that each node can Sort its own data. For this method of distributing data across nodes, data needs to be shuffled, which means moving data from one node to another. Hence, because of this, more data is sent over the network, which can slow down the Snowflake query performance.
Here are some best practices for optimizing Snowflake Sorting:
Clustering is a technique used to group related data together on disk. So, by clustering tables based on the sort key, the data will be sorted internally and the ORDER BY can be completely avoided.
When loading data into Snowflake, consider Sorting it before loading it into the database, which can drastically reduce the amount of SORTing required during query execution.
So by incorporating these Snowflake query tuning practices, users can easily optimize Snowflake sorting and enhance their query performance.
3) Using JOINS instead of Subqueries?
Subqueries can be useful, but they can also slow down your queries. Try to avoid using subqueries whenever possible or rewrite them as joins.
Suppose you have two tables, orders and customers. Let's say you wish to access the order details for all customers in a specific region. One method is to use a subquery to acquire the customer IDs for customers in the specified region, and then join that result to the orders table to obtain the order details:
SELECT o.order_id, o.order_date, o.order_total, c.customer_name, c.customer_region FROM orders o JOIN ( SELECT customer_id FROM customers WHERE customer_region = 'East' ) cids ON o.customer_id = cids.customer_id JOIN customers c ON o.customer_id = c.customer_id ORDER BY o.order_date;
BUT, you can completely avoid using a subquery in this case by using a join to link the orders and customer tables directly on the customer_region column:
SELECT o.order_id, o.order_date, o.order_total, c.customer_name, c.customer_region FROM orders o JOIN customers c ON o.customer_id = c.customer_id AND c.customer_region = 'East' ORDER BY o.order_date;
4) Using performance optimization tools like Chaos Genius to get recommendations on Snowflake Query Performance
Chaos Genius is a Snowflake observability tool that can help identify and resolve performance issues in Snowflake by providing end-to-end visibility into your Snowflake data operations across various services and get recommendations on improving your Snowflake performance including suggestions for query tuning and optimizing query performance.
How Chaos Genius can help you improve Snowflake Query performance:
Monitoring Snowflake Query Performance: Chaos Genius is an exceptional tool for query tuning as it utilizes proprietary technology to analyze millions of queries by pattern identification and make smart recommendations to improve performance of your most expensive queries, leading to faster and more efficient data retrieval and improving the overall Snowflake query performance.
Query Recommendations: Chaos Genius helps you identify your most important query groups by various performance parameters like Total Execution Time, Spillage, Cost, Partitions Scanned and many others; It then identifies your most expensive query groups, which contribute to more than 50% of your Snowflake costs, and gives automated recommendations on query tuning and how to improve their performance.
Query Profile: You can get access to Snowflake query profile directly from the Chaos Genius dashboard w/o having to go to Snowflake UI and manually trying to get to those query profiles.
Monitoring & Alerting: You can get automated alerts on various anomalies and failures daily.
Snowflake can handle millions of queries, but it's important to remember that queries can't be run in any way to get the best optimal performance out of it. Utilizing best practices for Snowflake query tuning and optimization is the only viable option for running queries efficiently.
In Part 1 of this two-part article on how to optimize Snowflake queries, we talked about a number of Snowflake features that might make queries run faster. But, in this part (Part 2), we went over several typical pitfalls and common mistakes to avoid while writing Snowflake queries. Here's a summary of what we covered:
- Selecting relevant data columns only (avoid Select *)
- Optimize sorting
- Using JOINs instead of subqueries
- Leveraging Chaos Genius for query tuning and enhanced performance insights
Query tuning and optimization of Snowflake queries is like taking care of a houseplant—it's not a one-time job and needs to be done over and over again. Your queries are like a potted houseplant that needs care and attention to flourish, such as regular monitoring, fine-tuning, adjustment, trial and error, and analysis to ensure they run optimally.
How can sorting impact Snowflake query performance?
Sorting in Snowflake queries can result in increased compute load, disk I/O operations, and network data transfer, which can slow down query performance and increase costs.
Should I use subqueries or joins in Snowflake queries?
It is recommended to use joins instead of subqueries whenever possible. Subqueries can slow down queries, and rewriting them as joins can improve query performance.
How do you handle long running queries in a Snowflake?
- You can use the
QUERY_HISTORYfunction to identify long running queries. This function returns a table of all queries that have been run in your account, along with the start time, end time, and duration of each query.
- You can use the
SYSTEM$CANCEL_QUERYprocedure to cancel a long running query. This procedure takes a query ID as its argument.
- You can set a query timeout to prevent long running queries from monopolizing resources. This can be done using the
- You can create a dedicated warehouse for long running queries. This will ensure that these queries have enough resources to run without impacting other users.