Some updates, might help others in the future.
There indeed is an issue with HTTP 502 errors in some versions of node.js (our web server) that started about a year ago. Details here. However, our version of node.js was patched and did not have this problem.
Anyways, we had another 502 a few hours ago.
- I discovered that Convox (or was it me?) created the main ALB with access logging. I found the logs in the S3 bucket (by the way - location seems wrong/buggy as I have both the rack and app access logs in the same folder…) and got to investigating. Downloaded the logs with aws CLI tool, gunzipped and put it all into a nice CSV, where I could see some lines like:
2019-11-21T06:18:15.993187Z app/porky-Route-8A77P039SDT4/1ef08b5c0115f899 184.108.40.206:58111 10.0.3.111:33111 0.001 0.001 -1 502 - 5671 764 "GET https://node.internet.com:443/home HTTP/2.0" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:us-east-1:222006111111:targetgroup/porky-Balan-17Y0W7ML03NUR/9e0873f5a983eb5f "Root=1-5dd62c27-deaa40cc6413111111111111" "node.internet.com" "arn:aws:acm:us-east-1:222006111111:certificate/11111111-1111-1111-1111-65d96b80fa84" 20180 2019-11-21T06:18:15.990000Z "forward" "-" "-" "10.0.3.111:33111" "-"
But, that come from many different requests to different hosts, both node web servers. Interesting parts are:
Now, based on this, it seems that the response_processing_time is -1 if the load balancer can’t send the request to the target. This can happen if the target closes the connection before the idle timeout or if the client sends a malformed request.
- I looked at the Target Group metrics in AWS and could not see any 502s, 500s or any error at all, during the time we last saw the 502. Not even Backend Connection Errors:
I don’t think the issue is with the target microservice, as it is working persistently for almost everyone, and the 502 happens persistently to some users - unless they clear their cookies/app storage.
So the issue is either a malformed client request (doesn’t make too much sense - some of the requests that receive 502 are for getting an icon file… but well, it’s possible) or a connection idle timeout.
Connection idle timeout - as they recommend, I will try to increase the idle timeout and also verify that HTTP Keep-Alive is on for our web servers. That being said, this doesn’t make too much sense to me… as the HTTP 502 are persistent and immediate for clients, and are solved with cookie deletion.
Found this post from 6 months ago talking about such issues.
Hence, it makes more sense that the client request is somehow malformed. I am guessing - potentially too large a cookie? too many values in the cookie headers? something of that sort. I think the ALB rejects the request, not sure why 502 is the error it chooses.
We will keep researching and I’ll update here, but it seems the issue is indeed probably related to our client-side behavior and not to the backend or ALB itself.