Distributed SQL query engines like Trino and its commercial versions like Amazon Athena excel at running fast interactive queries on large amounts of data. The problem: these queries can lead to a spike in storage system costs, particularly with Amazon S3. Fortunately, there are tools you can use to quicken queries and cut costs.

Recap of Distributed Query Engines, Trino, and Amazon Athena

Query engines like Trino run interactive analytic queries against various data sources. These sources can include storage systems, such as S3 and Google Cloud Storage; databases, like MySQL, MongoDB, and Apache Cassandra; data streams, like Kafka and Amazon Kinesis; and more.

Trino is a powerful tool because of its advanced query execution and optimization, as well as its ease of connectivity with many data sources and data formats, all in one place. Trino has been widely adopted and is available in commercial versions, such as Athena. Athena is particularly useful due to its easy integration with other Amazon services, like S3 or AWS Glue.

Why Distributed Query Engines Lead to Increased S3 Costs

Pricing for these query engines tends to be pretty straightforward. For example, Athena charges $5.00 per TB of data scanned. However, it’s important to keep in mind other hidden costs. One major one is that when querying from S3, you are also charged for S3 GET requests. S3 Standard GET requests are priced low at $0.0004 per 1,000 requests. Yet, with single queries that scan hundreds of thousands of objects, the cost can add up. In some cases, customers using these systems end up with S3 request costs exceeding S3 storage costs.

Apache Hive

These high S3 costs are largely due to the way data is stored not being optimally aligned with the processing methods used by query engines. Athena uses Apache Hive to create tables. If you’re not super familiar with Hive—it’s a SQL interface implementation of Hadoop that consists of advanced querying capabilities and a Metastore. Its querying functionalities were once widely used but have now been replaced by newer technologies (like Trino!).

The Metastore aspect of Hive (which AWS Glue is built on) is still used to manage metadata about tables, columns, partitions, and their corresponding locations. However, while the storage mechanism was sufficient in the past, it lacks certain features and optimizations, and Apache Iceberg is a faster and more cost-effective solution (see next section for more information).

Optimizing S3 Costs with Apache Iceberg & Caching

You can save on S3 costs through strategies such as the strategic selection of your table format and caching.

Apache Iceberg

Iceberg was designed to improve the Hive table format by adding update and delete capabilities to data lakes. Previously, you could only do so with data warehouses. What many people don’t realize is that adopting the Iceberg format can also help minimize those hidden S3 costs.

You can use Iceberg to configure larger file sizes. Doing so, each partition will have fewer objects, thereby querying less data and reducing S3 GET requests. A blog written by Deniz Parmaksız, a Senior Machine Learning Engineer at Insider, further explains how their team used Iceberg to reduce costs by 90%.

Reduce GET Requests with a Caching Strategy

Most applications that access data stored in S3 have a common set of objects that are repetitively queried. If you see that a workload is sending repetitive GET requests, it is best practice to implement a caching strategy with technologies like Amazon CloudFront or Amazon ElastiCache. A successful data caching strategy not only results in low latency requests but can also help reduce the number of direct requests sent to S3, reducing costs.

It is important to note that an effective caching strategy can be difficult to implement if the repetitive queries are hard to predict. This is especially relevant when you’re dealing with multi-dimensional data that can be filtered in multiple ways—think fraud detection, telemetry data, or log analysis. In these examples, you’re often joining the core data with “helper” tables for relevant information like product names, location, etc. So, the best practice is to cache those helper tables since they are accessed so much.

In these cases, we recommend using technology like Starburst’s Warp Speed data lakehouse engine that dynamically and autonomously updates the cache based on analytical workload patterns. This means you don’t have to spend time manually implementing and monitoring your cache. Starburst customers like Junglee Games have seen Warp Speed reduce their S3 request costs by 70% and the subsequent reports run 30% faster.

Conclusion

Since the early days of data management, the landscape has undergone significant changes, with older technologies constantly being replaced by newer and improved ones. Trino represents the new generation of powerful query engines. However, it can still be inefficient in certain aspects, such as by causing a surge in S3 costs. New technologies and strategies, such as implementing Iceberg as your table format instead of Hive and indexing and caching with tools like Starburst Warp Speed, can decrease S3 costs and optimize query performance, leading to more efficient and cost-effective data analytics workflows.