Finding Expensive Queries in Snowflake by Calculating Cost per Query

by Vantage Team


Originally published August 5, 2022.

Snowflake is famous for inventing the cloud data warehouse, a scalable “Data Cloud” with usage-based pricing. Users only pay when queries are running or data is moving. A classic example of the benefits of pay-as-you-go pricing comes from the company itself:

We have one customer that, every Friday afternoon, spins up a massive number of virtual warehouses to do their processing as a billing company. They spin up thousands of nodes, it runs for several hours, and then they spin them back down again.

The trade-off with usage-based pricing is that costs vary, even if they are lower. In Snowflake’s case, rogue queries can rack up bills, and any query is a target for optimization to lower costs.

By pulling together multiple tables and information about Snowflake’s pricing model, we can calculate cost per query. For advanced users, we can also employ query tags to get the most granular cost tracking available for the Data Cloud.

Querying for Snowflake Query Cost

Some math is needed to find out how much each query costs in Snowflake. As described in the Vantage “Costs per Query” documentation:

Snowflake Compute is billed only when a query is running, rounded up to the hour; however, if multiple queries are running simultaneously, the cost is still only reported for that hour.

Therefore, we have to combine query execution time, warehouse size, and some other details to get the cost of each query. Snowflake offers an example query that ties together these metrics:

SET credit_price = 3.00;
SELECT ( QH.execution_time / (1000 * 3600) ) * WS.credits *
       ( $credit_price ) AS
       relative_performance_cost
FROM   query_history QH
       JOIN warehouse_size WS
         ON WS.wh_size = QH.warehouse_size

This query pulls from the QUERY_HISTORY table, part of the ACCOUNT_USAGE schema in the warehouse. The WAREHOUSE_SIZE table is a generated table to equate Snowflake credits to warehouse size. Compute is billed per credit (e.g., $3 per credit), and so the cost is based on the number of credits that are used for a certain warehouse size:

SELECT wh_size, credits FROM (
      SELECT 'XSMALL'  AS wh_size, 1   AS credits UNION ALL
      SELECT 'SMALL'   AS wh_size, 2   AS credits UNION ALL
      SELECT 'MEDIUM'  AS wh_size, 4   AS credits UNION ALL
      SELECT 'LARGE'   AS wh_size, 8   AS credits UNION ALL
      SELECT 'XLARGE'  AS wh_size, 16  AS credits UNION ALL
      SELECT '2XLARGE' AS wh_size, 32  AS credits UNION ALL
      SELECT '3XLARGE' AS wh_size, 64  AS credits UNION ALL
      SELECT '4XLARGE' AS wh_size, 128 AS credits
)

These credits are mapped to warehouse sizes per hour. For example, the XLARGE warehouse consumes 16 credits for a full hour of usage. By combining the EXECUTION_TIME of each query with the warehouse size it ran on, we can get the cost in dollars of each query.

There are some other details needed to rank Snowflake queries by cost. Many queries will use the same structure but different values in the WHERE clause or a different order of columns.

Note: Startup Times and Caching

When a warehouse first spins up, you are charged for one minute’s worth of credits—even if the query only runs for 30 seconds. Opendoor engineering accounted for this in the query they built to track costs, as does the console. Additionally, Snowflake employs a global query cache, which can have a dramatic impact on costs. These features are covered in this deep dive on Snowflake Compute costs.

Using Query Tags

The method above gets cost per query, per warehouse, per user, and so forth, but often, we are interested in tracking per-unit costs in Snowflake. Per-unit costs are the costs for an entire unit of the business—like all the queries run by the production backend service, or all queries run by a data science team, or all the queries run for a consulting client.

Query tags are a nifty feature for tracking per-unit costs in Snowflake. Just like in AWS, tags are useful for mapping business costs to cloud spend. Snowflake says:

Tags enable data stewards to monitor sensitive data for compliance, discovery, protection, and resource usage use cases through either a centralized or decentralized data governance management approach.

Resource usage is our use case for query tags here. In any query session—for example, writing queries in the Snowflake console or using the Python connector—it is possible to set a QUERY_TAG for the session.

ALTER SESSION SET
    QUERY_TAG = 'Monthly Close Books Process'
;

Then, all queries subsequently run in that session will be tagged with Monthly Close Books Process. Tags can also be set by user and by account.

ALTER USER vantage SET
    QUERY_TAG = 'Cloud Costs'
;

Setting the query tag at the session level is the most granular tagging possible and overrides any tags set at the user or account level.

Grouping by Query Tag in the [Vantage console](https://console.vantage.sh/signup).

Grouping by Query Tag in the console.

Query tags appear in the same QUERY_HISTORY table we used for cost per query, so tags can be used as another dimension to group costs by, along with organization, user, and warehouse.

Conclusion

The advent of a near-infinite scaling, cloud data warehouse has given organizations more control over how they consume their data. With Snowflake, using the platform maps more tightly to a company’s natural business processes than an on-premises data warehouse. With the right tables, engineering teams can drill down to an individual query to track the cost of using data for the business.

For more granular tracking, Snowflake has query tags, which are supported in the Snowflake integration with Vantage. Whichever method and tools you choose, tracking cost per query in Snowflake is a data engineering efficiency practice that is also great for the bottom line.