Finding Expensive Queries in Snowflake by Calculating Cost per Query

by Vantage Team

Snowflake is famous for inventing the cloud data warehouse, a scalable “Data Cloud” with usage-based pricing. Users only pay when queries are running or data is moving. A classic example of the benefits of pay-as-you-go pricing comes from the company itself:

We have one customer that, every Friday afternoon, spins up a massive number of virtual warehouses to do their processing as a billing company. They spin up thousands of nodes, it runs for several hours, and then they spin them back down again.

The trade-off with usage-based pricing is that costs vary, even if they are lower. In Snowflake’s case, rogue queries can rack up bills, and any query is a target for optimization to lower costs.

By pulling together multiple tables and information about Snowflake’s pricing model, we can calculate cost per query. For advanced users, we can also employ query tags to get the most granular cost tracking available for the Data Cloud.

Querying for Query Cost

Some math is needed to find out how much each query costs in Snowflake. As described in the Vantage costs by query documentation:

Snowflake Compute is only billed when a query is running, rounded up to the hour. However, if multiple queries are running simultaneously the cost is still only reported for that hour.

Therefore we have to combine query execution time, warehouse size, and some other details to get the cost of each query. Snowflake offers an example query which ties together these metrics:

SET credit_price = 3.00;
SELECT ( QH.execution_time / (1000 * 3600) ) * WS.credits *
       ( $credit_price ) AS
       relative_performance_cost
FROM   query_history QH
       JOIN warehouse_size WS
         ON WS.wh_size = QH.warehouse_size

This query pulls from the QUERY_HISTORY table, part of the ACCOUNT_USAGE schema in the warehouse. The WAREHOUSE_SIZE table is a generated table to equate Snowflake credits to warehouse size. Compute is billed per credit, e.g. $3 credit price, and so the cost is based on the amount of credits that are used for a certain warehouse size:

SELECT wh_size, credits FROM (
      SELECT 'XSMALL'  AS wh_size, 1   AS credits UNION ALL
      SELECT 'SMALL'   AS wh_size, 2   AS credits UNION ALL
      SELECT 'MEDIUM'  AS wh_size, 4   AS credits UNION ALL
      SELECT 'LARGE'   AS wh_size, 8   AS credits UNION ALL
      SELECT 'XLARGE'  AS wh_size, 16  AS credits UNION ALL
      SELECT '2XLARGE' AS wh_size, 32  AS credits UNION ALL
      SELECT '3XLARGE' AS wh_size, 64  AS credits UNION ALL
      SELECT '4XLARGE' AS wh_size, 128 AS credits
)

These credits are mapped to warehouse sizes per hour. For example, the XLARGE warehouse consumes 16 credits for a full hour of usage. By combining the EXECUTION_TIME of each query with the warehouse size it ran on, we can get the cost in dollars of each query.

Grouping by Query Tag in the [console](https://console.vantage.sh/signup).

Example Snowflake queries ordered by their individual cost in the console.

There are two other details needed to rank Snowfalke queries by cost. Many queries will use the same structure but different values in the WHERE clause, or a different order of columns. In the table above, queries are stripped of their values (values are replaced with $1, $2, etc.) to make grouping easier. This also has the security beneift of not exposing sensitive values to analysts or engineers investigating costs.

Note: Startup times and Caching

When a warehouse first spins up, you are charged for 1 minute worth of credits even if the query only runs for 30 seconds. Opendoor engineering accounted for this in the query they built to track costs, as does the console. Additionally, Snowflake employs a global query cache which can have a dramatic impact on costs. We plan to cover these features in another post.

Using Query Tags

The method above gets cost per query, per warehouse, per user, and so forth, but often we are interested in tracking per-unit costs in Snowflake. Per-unit costs are the costs for an entire unit of the business - like all of the queries run by the production backend service, or all queries run by a data science team, or all the queries run for a consulting client.

Query tags are a nifty feature for tracking per-unit costs in Snowflake. Just like in AWS, tags are useful for mapping business costs to cloud spend. Snowflake says:

Tags enable data stewards to track sensitive data for compliance, discovery, protection, and resource usage use cases through either a centralized or decentralized data governance management approach.

Resource usage is our use case for query tags here. In any query session, i.e. writing queries in the Snowflake console or using the Python connector, it is possible to set a query_tag for the session.

ALTER SESSION SET
    QUERY_TAG = 'Monthly Close Books Process'
;

Then, all queries run subsequently in that session will be tagged with “Monthly Close Books Process”. Tags can also be set by user and by account.

ALTER USER vantage SET
    QUERY_TAG = 'Cloud Costs'
;

Setting the query tag at the session level is the most granular tagging possible and overrides any tags set at the user or account level.

Grouping by Query Tag in the [console](https://console.vantage.sh/signup).

Grouping by Query Tag in the console.

Query tags appear in the same QUERY_HISTORY table we used for cost per query, so tags can be used as another dimension to group costs by, along with organization, user, and warehouse.

Conclusion

The advent of a near-infinite scaling, cloud data warehouse has given organizations more control over how they consume their data. With Snowflake, using the platform maps more tightly to a company’s natural business processes than an on-prem data warehouse. With the right tables, engineering teams can drill down to an individual query to track the cost of using data for the business.

For more granular tracking, Snowflake has query tags, which are supported in the Snowflake integration with Vantage. Whichever method and tools you choose, tracking cost by query in Snowflake is a data engineering efficiency practice that is also great for the bottom line.