Content • Oct 25, 2024

Recursive AWS Lambda Horror Stories and How to Avoid Them

by Emily Dunenfeld

Infinite recursion in Lambda can cause costs to spiral out of control. Following best practices can help you avoid getting caught up in your own Lambda horror story.

There’s no shortage of cloud horror stories where costs spiral out of control; common AWS ones include forgetting to turn off an expensive EC2 instance, publishing your security keys, and, of course, infinite recursion with AWS Lambda. Infinite recursion in Lambda is particularly gruesome because of its near-instant invocation and ability to scale infinitely, which can cause costs to skyrocket exponentially instead of growing linearly like most other services. While there’s no way to fully eliminate this risk, following best practices can help you avoid getting caught up in your own Lambda horror story.

Causes and Examples of Infinite Recursive Loops in Lambda

If you research AWS Lambda, you don’t have to scroll far to come across dozens of AWS users who accidentally triggered infinite recursion with Lambda. Sometimes, by the time they log back in after the weekend, the bill exceeds what they were expecting to pay for the entire year. Oftentimes, it’s students or those new to AWS, however, no group is immune, and even seasoned senior developers slip up and make these mistakes, too.

Lambda Architectural Design

A common cause is architectural design, like when Lambda gets triggered by a service and then writes to the same service, either directly or indirectly. This creates a feedback loop where the Lambda function continuously triggers itself. An example for S3 is when Lambda writes to an S3 bucket, that S3 bucket invokes the Lambda function, Lambda writes to S3 again, and behold—an infinite loop. This can happen with other AWS services as well, such as SNS, SQS, and DynamoDB, where Lambda’s interaction with these services triggers recursive invocations. As we will go over later, AWS’s built-in recursive loop detection will break some of these infinite loops, but it doesn’t cover all possible scenarios.

Lambda Retry Mechanisms

Another common cause stems from retry mechanisms—whether manually coded or due to external services. An example of a manually coded error comes from a blog detailing how a single function call unexpectedly cost them hundreds of dollars. The cause was writing a retry directly in the code that was called every time Lambda was invoked, which caused the Lambda function to keep being invoked.

A different user debugged a retry error caused by external services. Their Lambda was being triggered by a Telegram webhook, but API Gateway’s independent 30-second timeout was returning a 504 error before Lambda could complete. Telegram then resent the request since it didn’t receive a 200 OK response, and an infinite loop begun. Fortunately, in this case, the Lambda would sometimes complete before the timeout, so it wasn’t truly infinite, though that did make it more difficult for the user to debug.

Refunds and Prevention

Understanding these common patterns of failure can help developers avoid costly mistakes in their serverless architectures. If you’re reading this because you’re currently dealing with a Lambda horror story of your own, don’t worry, for many of the cloud-cost horror stories we’ve seen, AWS either fully or partially reimburses the charges, especially for first-time offenders. Though note—relying on refunds is risky and not a substitute for good architectural practices, AWS offers no guarantees of refunds, and the process can take time.

How to Avoid Infinite Recursive Loops in Lambda

While there’s no guarantee of never encountering an infinite loop, and AWS doesn’t offer an automatic shut-off option based on costs, there are best practices that can reduce the risk and mitigate the damage.

Awareness, Researching Lambda Best Practices, and Testing

The first step is awareness—understanding infinite recursion is possible and how detrimental and costly it is will make you more cautious. Knowing common causes helps you know what to look out for and reduces the likelihood of you making the same mistakes as others. Then, before writing any Lambda code, research and plan your architecture carefully, including the services Lambda interacts with and potential risk areas.

For example, if your Lambda function interacts with other AWS services, a best practice is ensuring the service that triggers your Lambda differs from the resource it writes to can prevent a feedback loop. In the case of S3-triggered Lambda functions, using separate S3 buckets for input and output would avoid recursive triggers. Or, in cases where you must use the same resource, AWS recommends using a positive trigger, like an S3 object trigger based on a naming convention or meta tag that activates only on the initial invocation.

Similarly, in the case of Lambda retry mechanisms, the author of the blog where the retry was written directly in the code recommends, when possible, not to invoke Lambdas from other Lambdas. Instead, use AWS step functions. Also, remember—Lambda automatically retries function errors twice so further retry mechanisms may not be needed. In the example of API Gateway, that poster recommends skipping API Gateway entirely for their problem by using the Lambda Function URL.

Most novice Lambda users wouldn’t inherently know the best practices specific to their implementation, which is why it is so important to research before implementing.

Additionally, thorough testing—both automated and manual—can catch many potential issues before they hit production. Simulate edge cases, test timeouts, and analyze how your Lambda behaves when invoked under different conditions. This will help ensure that you’re not creating a loop inadvertently.

Lambda Recursive Loop Detection

AWS has built-in recursive loop detection that detects and stops loops after roughly 16 cycles for loops involving S3, SQS, and SNS. However, it is not a catch-all, for example, when other AWS services are a part of the loop, such as DynamoDB, the loop may remain infinite. It is also not available for all runtimes, including custom runtimes.

Reserved Concurrency in Lambda

Reserved concurrency in Lambda is the maximum number of concurrent instances that can run at any given time. By carefully estimating your function’s concurrency needs and setting a conservative limit, you can control resource consumption and prevent your functions from scaling out infinitely. However, it does not guarantee complete protection against runaway costs since recursive invocations within the set concurrency can still accumulate charges quickly.

Setting Low Timeouts in Lambda

Setting low timeouts in Lambda doesn’t prevent recursive loops from happening but it can limit the financial damage. When a function has a low timeout, it stops sooner if it’s stuck in an unexpected loop, reducing the duration (and thus cost) of each invocation.

Lambda Monitoring and Alerting

Finally, after deploying your Lambda function, use tools like Amazon CloudWatch to continuously monitor metrics like invocation counts, error rates, and execution duration to ensure everything is working as expected. Setting the alarm on the concurrency metric can help catch runaway loops early, as recommended by experts who have seen the impact of unmonitored Lambdas on cloud bills.

Additionally, set up cost anomaly alerts and budgets using AWS Cost Explorer or use Vantage for more granular visibility and automatic ML-powered anomaly alerts for your Lambda costs. These tools provide early warnings when costs exceed expected levels, enabling you to act quickly before costs get too out of hand.

Conclusion

Though recursive loop detection is expanding to cover more services, it does not catch everything, and infinite recursion in Lambda is still very much a costly possibility. While there’s no one built-in fail-safe to end infinite loops in Lambda, with proper awareness and best practices, you can significantly mitigate the risks.

Cost Reporting

Kubernetes

Virtual Tagging

Network Flow Reports

Cost Allocation

Budgeting

Resource Reports

Usage-Based Reporting

Anomaly Detection

Autopilot for Savings Plans

Cost Recommendations

Commitment Reports

Unit Costs

Savings Planner

Issues

Team Access

Terraform

Jira

MCP

AWS

Azure

Google Cloud

Oracle Cloud

Kubernetes

Datadog

Snowflake

Fastly

MongoDB

Databricks

New Relic

Confluent

PlanetScale

Coralogix

GitHub

Linode

OpenAI

Grafana Cloud

ClickHouse Cloud

Temporal Cloud

Twilio

Custom Providers

About

Blog

Customers

Podcasts & Talks

Newsroom

Events

Slack Community

EC2Instances.info

cur.vantage.sh

Cloud Cost Reports

FOCUS Converter

Cloud Cost Handbook

Cloud Cost Leaderboard

Product Changelog

Vantage University

API Documentation

MSPs

Partnerships

Our Partners