Our production system works as expected and maintains a nearly 24/7/365 service level.
However, in the last month I’ve begun to see some (perhaps 0.1-0.3% of) users who get a permanent HTTP 502: Bad Gateway error when attempting to access our application, in any of its public endpoints. This starts happening during times we are not deploying (perhaps also when we do deploy - unclear), and (probably) only after the user has already used our system for a while.
Using Chrome’s Incognito mode solves this, on the same computer in the same browser, as does clearing the site data via the Chrome DevTools.
Looking in the CloudWatch Monitoring Details for the ALB, I see that indeed we’ve been getting sporadic 502s here and there.
I am suspecting the issue might be related to the AWSALB cookie set on the user’s browser.
My questions:
How can I see what issue specifically caused the 502 error? Is there some sort of ALB error logs stored by Convox that we can look at? Or do we need to enable logging somewhere?
Any thoughts on how to proceed with investigating this problem, how to approach it?
There indeed is an issue with HTTP 502 errors in some versions of node.js (our web server) that started about a year ago. Details here. However, our version of node.js was patched and did not have this problem.
Anyways, we had another 502 a few hours ago.
I discovered that Convox (or was it me?) created the main ALB with access logging. I found the logs in the S3 bucket (by the way - location seems wrong/buggy as I have both the rack and app access logs in the same folder…) and got to investigating. Downloaded the logs with aws CLI tool, gunzipped and put it all into a nice CSV, where I could see some lines like:
Now, based on this, it seems that the response_processing_time is -1 if the load balancer can’t send the request to the target. This can happen if the target closes the connection before the idle timeout or if the client sends a malformed request.
I looked at the Target Group metrics in AWS and could not see any 502s, 500s or any error at all, during the time we last saw the 502. Not even Backend Connection Errors:
I don’t think the issue is with the target microservice, as it is working persistently for almost everyone, and the 502 happens persistently to some users - unless they clear their cookies/app storage.
So the issue is either a malformed client request (doesn’t make too much sense - some of the requests that receive 502 are for getting an icon file… but well, it’s possible) or a connection idle timeout.
Connection idle timeout - as they recommend, I will try to increase the idle timeout and also verify that HTTP Keep-Alive is on for our web servers. That being said, this doesn’t make too much sense to me… as the HTTP 502 are persistent and immediate for clients, and are solved with cookie deletion.
Found this post from 6 months ago talking about such issues.
Hence, it makes more sense that the client request is somehow malformed. I am guessing - potentially too large a cookie? too many values in the cookie headers? something of that sort. I think the ALB rejects the request, not sure why 502 is the error it chooses.
We will keep researching and I’ll update here, but it seems the issue is indeed probably related to our client-side behavior and not to the backend or ALB itself.
I can confirm that the issue is that AWS ALB returns an HTTP 502 Bad Gateway error when the request headers are too large, instead of the expected 431 Request Header Fields Too Large.
Luckily we already were in the process of minimizing our usage of request headers to the bare minimum, moving away from a PoC. So the solution for us is clear both in the short term and the long term.