AWS Lambda is Not a Magic Reliability Wand

A 100 line Lambda function runs fine for months, then goes down for two hours, and finally recovers on its own. Cost savings or reliability — pick one.

I recently got an email alert about a certain Lambda having an elevated error %. The key error message in the logs was: “getaddrinfo EMFILE”. No, that is not a DNS server failure — that is Node.js saying it can’t allocate any more file descriptors and is stuck. API Gateway was returning HTTP 502 errors for all requests.

We looked at the source code but found nothing obvious (disclaimer: I don’t write Node.js). The code has minimal dependencies. It creates a MySQL connection during each invocation, and there are no global variables referencing it, so presumably garbage collection should eventually close the socket…?

After two hours, the problem just went away. My guess is the Lambda container was recycled.

The next day I decided to do more testing and added this instrumentation code to the main function (of course _getActiveHandles is undocumented):

let handles = process._getActiveHandles()“HANDLE COUNT: “ + handles.length + “\n”)“HANDLES\n” + JSON.stringify(handles, null, 2))

And sure enough, when calling the Lambda in a loop, the handle count increases until over 900 (no, sadly it didn’t get over 9000) and then continually fails with FunctionError: Unhandled. The Lambda file descriptor limit is 1024, so this makes sense.

The mysql2 and mysql docs for Node.js had no example on ensuring file descriptors were closed in an exception safe way with code using await. But we added a try/finally which manually closed the database connection, and that fixed the leak.

let conn = await MysqlDb.connect();
try {
await do_queries_with_connection(conn);
} finally {
// Without this, sockets are leaked

I have learned to be very wary of “connection pools” and “caches” when making reliable services. These add hard to test, timing-dependent edge cases. Connection caching causes problems with load balancing (not shifting load quickly to the least loaded servers) and DNS fail over (not honoring the TTL). I have seen downtime due to a popular open-source connection pool getting stuck when it got a weird TLS error the developers never encountered. In contrast, I admire the Route 53 design concept of “constant work”, which is the opposite of caching. I have learned that “premature optimization is evil”.

But the Lambda docs recommend connection pooling and caching, and don’t point out the drawbacks. Lambda itself caches your warm containers. Sure, it improves performance and reduces cost. But there is always a cost somewhere — in this case, a big reliability and testing cost. How many of you test that a warm Lambda succeeds after 1024 invocations, or that it gracefully handles a database failover?

So writing “serverless” Lambda code is, sadly, just like any other “serverful” programming you have done: you have to carefully ensure all your file descriptors are closed after every request, which even garbage collected languages struggle with, or ensure you have a connection pool that is reliable. Neither option is trivial.

“Adjusting to the requirement for perfection is, I think, the most difficult part of learning to program.” — The Mythical Man Month

The recent RDS Proxy service acknowledges this problem:

With RDS Proxy, you no longer need code that handles cleaning up idle connections and managing connection pools. Your function code is cleaner, simpler, and easier to maintain.

I can attest that is indeed simpler, but only for languages that dispose of sockets sanely… I wish more languages used RAII or refcounted GC to force immediate cleanup because a language should serve us, and not be a source of constant foot-guns.

Ironically, we were using provisioned concurrency on this Lambda — we were running it like a “serverful” instance (with higher cost) but had no way to SSH in and debug it when it was hung. Be extra careful when running in this mode, because your container is even less likely to be recycled, and ask yourself why you’re not just using ECS or EC2.

Perhaps Lambda needs a container-level shallow health check, just like we have for EC2 and ECS. This could check if the file descriptors or memory usage were >50% used, and if so, force a container recycle. Because if it walks like a server, quacks like a server, and hangs like a server…

Update 2020-Oct: Lambda Insights (in preview) supports logging file descriptor usage counts. I’m glad AWS realizes this is important!

Professional rider of the technology hype-cycle since 1999.