Hello, I’ve been running a v3 rack in production for the last few months, and I’ve just noticed that I have a lot of unhealthy pods and pending instances:
$ convox ps
ID SERVICE STATUS RELEASE STARTED COMMAND
web-5b7d9797c9-6gvcq web unhealthy RLDZCFNCLKA 3 weeks ago run_as_deploy foreman start -m web=1
web-5b7d9797c9-b985h web unhealthy RLDZCFNCLKA 3 weeks ago run_as_deploy foreman start -m web=1
web-5b7d9797c9-h82qq web unhealthy RLDZCFNCLKA 1 month ago run_as_deploy foreman start -m web=1
web-5b7d9797c9-j84kx web unhealthy RLDZCFNCLKA 3 weeks ago run_as_deploy foreman start -m web=1
web-5b7d9797c9-kx9mf web unhealthy RLDZCFNCLKA 1 month ago run_as_deploy foreman start -m web=1
web-5b7d9797c9-nbbdk web unhealthy RLDZCFNCLKA 1 month ago run_as_deploy foreman start -m web=1
web-6f4b894fcf-bplzz web unhealthy RHERMQWZFTE 1 month ago run_as_deploy foreman start -m web=1
web-6f4b894fcf-rctrn web unhealthy RHERMQWZFTE 1 month ago run_as_deploy foreman start -m web=1
web-c5bd884cb-f6jsf web running RWHBNVNMJJA 1 week ago run_as_deploy foreman start -m web=1
web-c5bd884cb-gc5vj web running RWHBNVNMJJA 1 week ago run_as_deploy foreman start -m web=1
web-c5bd884cb-kgph8 web running RWHBNVNMJJA 1 week ago run_as_deploy foreman start -m web=1
web-c5bd884cb-kpqmz web running RWHBNVNMJJA 1 week ago run_as_deploy foreman start -m web=1
web-c5bd884cb-kttsb web running RWHBNVNMJJA 1 week ago run_as_deploy foreman start -m web=1
worker-77f69c5d49-nh6c8 worker running RWHBNVNMJJA 1 day ago run_as_deploy foreman start -m worker=1
worker-77f69c5d49-vj65w worker running RWHBNVNMJJA 1 week ago run_as_deploy foreman start -m worker=1
worker-77f69c5d49-xqvhx worker running RWHBNVNMJJA 1 week ago run_as_deploy foreman start -m worker=1
$ convox instances
ip-10-1-109-105.ec2.internal pending 1 month ago 8 0.00% 0.00% 10.1.109.105
ip-10-1-146-11.ec2.internal running 3 weeks ago 8 0.00% 0.00% 10.1.146.11
ip-10-1-152-231.ec2.internal pending 3 weeks ago 6 0.00% 0.00% 10.1.152.231
ip-10-1-185-2.ec2.internal running 1 month ago 6 0.00% 0.00% 10.1.185.2
ip-10-1-194-125.ec2.internal running 1 day ago 6 0.00% 0.00% 10.1.194.125
ip-10-1-232-161.ec2.internal pending 1 month ago 6 0.00% 0.00% 10.1.232.161
ip-10-1-235-211.ec2.internal running 1 week ago 6 0.00% 0.00% 10.1.235.211
ip-10-1-66-58.ec2.internal running 2 months ago 17 0.00% 0.00% 10.1.66.58
ip-10-1-82-5.ec2.internal running 1 week ago 6 0.00% 0.00% 10.1.82.5
I’ve tried running convox ps stop ...
to stop containers, and convox instances terminate ...
to terminate instances, but these commands are not working.
Does anyone know why this might have started happening, and is there anything I can adjust to prevent the unhealthy pods from sticking around? The application is running fine and I’m still able to deploy, but I just want to get rid of these unhealthy pods and pending instances. Thanks!
I found out that the three EC2 instances were actually running, even though they were marked as pending
in the Convox CLI output.
So I terminated them. Three new instances were then started to replace them, and these ones started fine.
$ convox instances
ID STATUS STARTED PS CPU MEM PUBLIC PRIVATE
ip-10-1-146-11.ec2.internal running 3 weeks ago 8 0.00% 0.00% 10.1.146.11
ip-10-1-176-139.ec2.internal running 1 minute ago 4 0.00% 0.00% 10.1.176.139
ip-10-1-185-2.ec2.internal running 1 month ago 6 0.00% 0.00% 10.1.185.2
ip-10-1-194-125.ec2.internal running 1 day ago 6 0.00% 0.00% 10.1.194.125
ip-10-1-235-211.ec2.internal running 1 week ago 6 0.00% 0.00% 10.1.235.211
ip-10-1-246-22.ec2.internal running 1 minute ago 4 0.00% 0.00% 10.1.246.22
ip-10-1-66-58.ec2.internal running 2 months ago 17 0.00% 0.00% 10.1.66.58
ip-10-1-82-5.ec2.internal running 1 week ago 6 0.00% 0.00% 10.1.82.5
ip-10-1-94-108.ec2.internal running 1 minute ago 4 0.00% 0.00% 10.1.94.108
Terminating the pending instances also knocked out quite a few of the pending
processes, with only 2 left:
$ convox ps
ID SERVICE STATUS RELEASE STARTED COMMAND
web-5b7d9797c9-h82qq web unhealthy RLDZCFNCLKA 1 month ago run_as_deploy foreman start -m web=1
web-5b7d9797c9-kx9mf web unhealthy RLDZCFNCLKA 1 month ago run_as_deploy foreman start -m web=1
web-c5bd884cb-f6jsf web running RWHBNVNMJJA 1 week ago run_as_deploy foreman start -m web=1
web-c5bd884cb-gc5vj web running RWHBNVNMJJA 1 week ago run_as_deploy foreman start -m web=1
web-c5bd884cb-kgph8 web running RWHBNVNMJJA 1 week ago run_as_deploy foreman start -m web=1
web-c5bd884cb-kpqmz web running RWHBNVNMJJA 1 week ago run_as_deploy foreman start -m web=1
web-c5bd884cb-kttsb web running RWHBNVNMJJA 1 week ago run_as_deploy foreman start -m web=1
worker-77f69c5d49-nh6c8 worker running RWHBNVNMJJA 1 day ago run_as_deploy foreman start -m worker=1
worker-77f69c5d49-vj65w worker running RWHBNVNMJJA 1 week ago run_as_deploy foreman start -m worker=1
worker-77f69c5d49-xqvhx worker running RWHBNVNMJJA 1 week ago run_as_deploy foreman start -m worker=1
I inspected the processes with convox ps info
:
$ convox ps info web-5b7d9797c9-h82qq
Id web-5b7d9797c9-h82qq
App docspring
Command run_as_deploy foreman start -m web=1
Instance ip-10-1-185-2.ec2.internal
Release RLDZCFNCLKA
Service web
Started 1 month ago
Status unhealthy
$ convox ps info web-5b7d9797c9-kx9mf
Id web-5b7d9797c9-kx9mf
App docspring
Command run_as_deploy foreman start -m web=1
Instance ip-10-1-185-2.ec2.internal
Release RLDZCFNCLKA
Service web
Started 1 month ago
Status unhealthy
They were both running on ip-10-1-185-2.ec2.internal
, so I just decided to kill that server. P.S. It would be nice to implement that command in the CLI:
$ convox instances terminate ip-10-1-185-2.ec2.internal
Terminating instance... ERROR: unimplemented
So I did it in the AWS console. This got rid of the last 2 unhealthy processes.
I’m still not sure why this happened or why everything got a bit stuck in the pending/unhealthy state. I guess I will just try to set up some monitoring and alerts so that I know when there’s a problem. Does anyone know how to do this in Datadog? (Please let me know if you have Convox + Datadog experience, I would love to pay for a consultation.)
This is still happening for me, and it’s quite frustrating (and expensive!)
If I don’t check my rack regularly, then I get a ton of unhealthy processes and lots of zombie instances that are running up costs while not being used. Example:
$ convox ps
ID SERVICE STATUS RELEASE STARTED COMMAND
web-58cf4578c5-8z22t web running RAWNGOQNRQL 6 days ago run_as_deploy foreman start -m web=1
web-58cf4578c5-bxwt7 web unhealthy RAWNGOQNRQL 6 days ago run_as_deploy foreman start -m web=1
web-58cf4578c5-flvrl web running RAWNGOQNRQL 6 days ago run_as_deploy foreman start -m web=1
web-58cf4578c5-kk6mj web running RAWNGOQNRQL 6 days ago run_as_deploy foreman start -m web=1
web-58cf4578c5-prmkf web unhealthy RAWNGOQNRQL 6 days ago run_as_deploy foreman start -m web=1
web-7b8dd5db4c-2hwnc web unhealthy RXBEATZFBKN 1 month ago run_as_deploy foreman start -m web=1
web-7b8dd5db4c-9477p web unhealthy RXBEATZFBKN 2 months ago run_as_deploy foreman start -m web=1
web-7b8dd5db4c-f7nts web unhealthy RXBEATZFBKN 1 month ago run_as_deploy foreman start -m web=1
web-7b8dd5db4c-gjk5v web unhealthy RXBEATZFBKN 1 month ago run_as_deploy foreman start -m web=1
web-7b8dd5db4c-jvtr4 web unhealthy RXBEATZFBKN 2 months ago run_as_deploy foreman start -m web=1
web-7b8dd5db4c-jwrwr web unhealthy RXBEATZFBKN 1 month ago run_as_deploy foreman start -m web=1
web-7b8dd5db4c-mq97c web unhealthy RXBEATZFBKN 2 months ago run_as_deploy foreman start -m web=1
web-7b8dd5db4c-zndfk web unhealthy RXBEATZFBKN 2 months ago run_as_deploy foreman start -m web=1
worker-686ff9764d-6s88f worker running RAWNGOQNRQL 6 days ago run_as_deploy foreman start -m worker=1
worker-686ff9764d-kl4s4 worker running RAWNGOQNRQL 6 days ago run_as_deploy foreman start -m worker=1
worker-686ff9764d-rfmqj worker running RAWNGOQNRQL 6 days ago run_as_deploy foreman start -m worker=1
How can I stop this from happening and automatically kill these unhealthy processes?
It’s also strange to see that processes from a previous release are still running and marked as “healthy”:
$ cxps
ID SERVICE STATUS RELEASE STARTED COMMAND
web-698b4484d5-29lq5 web unhealthy ROPLEALFPGZ 2 weeks ago run_as_deploy foreman start -m web=1
web-698b4484d5-rtm67 web unhealthy ROPLEALFPGZ 2 weeks ago run_as_deploy foreman start -m web=1
web-7c8fc5d498-6tqxb web running RUPYPQESDBP 22 minutes ago run_as_deploy foreman start -m web=1
web-7c8fc5d498-9b97n web running RUPYPQESDBP 6 days ago run_as_deploy foreman start -m web=1
web-7c8fc5d498-blk24 web running RUPYPQESDBP 3 days ago run_as_deploy foreman start -m web=1
web-7c8fc5d498-gth8v web running RUPYPQESDBP 6 days ago run_as_deploy foreman start -m web=1
web-7c8fc5d498-r8djj web running RUPYPQESDBP 6 days ago run_as_deploy foreman start -m web=1
web-7c8fc5d498-r9bfh web running RUPYPQESDBP 38 minutes ago run_as_deploy foreman start -m web=1
web-c5bd884cb-f6jsf web running RWHBNVNMJJA 3 weeks ago run_as_deploy foreman start -m web=1
web-c5bd884cb-kttsb web running RWHBNVNMJJA 3 weeks ago run_as_deploy foreman start -m web=1
worker-5888cc869f-4ptwh worker running RUPYPQESDBP 6 days ago run_as_deploy foreman start -m worker=1
worker-5888cc869f-528cq worker running RUPYPQESDBP 6 days ago run_as_deploy foreman start -m worker=1
worker-5888cc869f-j9qnn worker running RUPYPQESDBP 6 days ago run_as_deploy foreman start -m worker=1
worker-5888cc869f-qgmmb worker running RUPYPQESDBP 6 days ago run_as_deploy foreman start -m worker=1
worker-5888cc869f-r8fjg worker running RUPYPQESDBP 6 days ago run_as_deploy foreman start -m worker=1
worker-5888cc869f-xf4hh worker running RUPYPQESDBP 6 days ago run_as_deploy foreman start -m worker=1
Is there something going wrong with my deployments where the deploy isn’t finishing properly and processes are getting stuck?