How to fix unhealthy pods and pending instances?

Hello, I’ve been running a v3 rack in production for the last few months, and I’ve just noticed that I have a lot of unhealthy pods and pending instances:

$ convox ps
ID                       SERVICE  STATUS     RELEASE      STARTED      COMMAND
web-5b7d9797c9-6gvcq     web      unhealthy  RLDZCFNCLKA  3 weeks ago  run_as_deploy foreman start -m web=1
web-5b7d9797c9-b985h     web      unhealthy  RLDZCFNCLKA  3 weeks ago  run_as_deploy foreman start -m web=1
web-5b7d9797c9-h82qq     web      unhealthy  RLDZCFNCLKA  1 month ago  run_as_deploy foreman start -m web=1
web-5b7d9797c9-j84kx     web      unhealthy  RLDZCFNCLKA  3 weeks ago  run_as_deploy foreman start -m web=1
web-5b7d9797c9-kx9mf     web      unhealthy  RLDZCFNCLKA  1 month ago  run_as_deploy foreman start -m web=1
web-5b7d9797c9-nbbdk     web      unhealthy  RLDZCFNCLKA  1 month ago  run_as_deploy foreman start -m web=1
web-6f4b894fcf-bplzz     web      unhealthy  RHERMQWZFTE  1 month ago  run_as_deploy foreman start -m web=1
web-6f4b894fcf-rctrn     web      unhealthy  RHERMQWZFTE  1 month ago  run_as_deploy foreman start -m web=1
web-c5bd884cb-f6jsf      web      running    RWHBNVNMJJA  1 week ago   run_as_deploy foreman start -m web=1
web-c5bd884cb-gc5vj      web      running    RWHBNVNMJJA  1 week ago   run_as_deploy foreman start -m web=1
web-c5bd884cb-kgph8      web      running    RWHBNVNMJJA  1 week ago   run_as_deploy foreman start -m web=1
web-c5bd884cb-kpqmz      web      running    RWHBNVNMJJA  1 week ago   run_as_deploy foreman start -m web=1
web-c5bd884cb-kttsb      web      running    RWHBNVNMJJA  1 week ago   run_as_deploy foreman start -m web=1
worker-77f69c5d49-nh6c8  worker   running    RWHBNVNMJJA  1 day ago    run_as_deploy foreman start -m worker=1
worker-77f69c5d49-vj65w  worker   running    RWHBNVNMJJA  1 week ago   run_as_deploy foreman start -m worker=1
worker-77f69c5d49-xqvhx  worker   running    RWHBNVNMJJA  1 week ago   run_as_deploy foreman start -m worker=1

$ convox instances
ip-10-1-109-105.ec2.internal  pending  1 month ago   8   0.00%  0.00%          10.1.109.105
ip-10-1-146-11.ec2.internal   running  3 weeks ago   8   0.00%  0.00%          10.1.146.11
ip-10-1-152-231.ec2.internal  pending  3 weeks ago   6   0.00%  0.00%          10.1.152.231
ip-10-1-185-2.ec2.internal    running  1 month ago   6   0.00%  0.00%          10.1.185.2
ip-10-1-194-125.ec2.internal  running  1 day ago     6   0.00%  0.00%          10.1.194.125
ip-10-1-232-161.ec2.internal  pending  1 month ago   6   0.00%  0.00%          10.1.232.161
ip-10-1-235-211.ec2.internal  running  1 week ago    6   0.00%  0.00%          10.1.235.211
ip-10-1-66-58.ec2.internal    running  2 months ago  17  0.00%  0.00%          10.1.66.58
ip-10-1-82-5.ec2.internal     running  1 week ago    6   0.00%  0.00%          10.1.82.5

I’ve tried running convox ps stop ... to stop containers, and convox instances terminate ... to terminate instances, but these commands are not working.

Does anyone know why this might have started happening, and is there anything I can adjust to prevent the unhealthy pods from sticking around? The application is running fine and I’m still able to deploy, but I just want to get rid of these unhealthy pods and pending instances. Thanks!

I found out that the three EC2 instances were actually running, even though they were marked as pending in the Convox CLI output.

So I terminated them. Three new instances were then started to replace them, and these ones started fine.

$ convox instances
ID                            STATUS   STARTED       PS  CPU    MEM    PUBLIC  PRIVATE
ip-10-1-146-11.ec2.internal   running  3 weeks ago   8   0.00%  0.00%          10.1.146.11
ip-10-1-176-139.ec2.internal  running  1 minute ago  4   0.00%  0.00%          10.1.176.139
ip-10-1-185-2.ec2.internal    running  1 month ago   6   0.00%  0.00%          10.1.185.2
ip-10-1-194-125.ec2.internal  running  1 day ago     6   0.00%  0.00%          10.1.194.125
ip-10-1-235-211.ec2.internal  running  1 week ago    6   0.00%  0.00%          10.1.235.211
ip-10-1-246-22.ec2.internal   running  1 minute ago  4   0.00%  0.00%          10.1.246.22
ip-10-1-66-58.ec2.internal    running  2 months ago  17  0.00%  0.00%          10.1.66.58
ip-10-1-82-5.ec2.internal     running  1 week ago    6   0.00%  0.00%          10.1.82.5
ip-10-1-94-108.ec2.internal   running  1 minute ago  4   0.00%  0.00%          10.1.94.108

Terminating the pending instances also knocked out quite a few of the pending processes, with only 2 left:

$ convox ps
ID                       SERVICE  STATUS     RELEASE      STARTED      COMMAND
web-5b7d9797c9-h82qq     web      unhealthy  RLDZCFNCLKA  1 month ago  run_as_deploy foreman start -m web=1
web-5b7d9797c9-kx9mf     web      unhealthy  RLDZCFNCLKA  1 month ago  run_as_deploy foreman start -m web=1
web-c5bd884cb-f6jsf      web      running    RWHBNVNMJJA  1 week ago   run_as_deploy foreman start -m web=1
web-c5bd884cb-gc5vj      web      running    RWHBNVNMJJA  1 week ago   run_as_deploy foreman start -m web=1
web-c5bd884cb-kgph8      web      running    RWHBNVNMJJA  1 week ago   run_as_deploy foreman start -m web=1
web-c5bd884cb-kpqmz      web      running    RWHBNVNMJJA  1 week ago   run_as_deploy foreman start -m web=1
web-c5bd884cb-kttsb      web      running    RWHBNVNMJJA  1 week ago   run_as_deploy foreman start -m web=1
worker-77f69c5d49-nh6c8  worker   running    RWHBNVNMJJA  1 day ago    run_as_deploy foreman start -m worker=1
worker-77f69c5d49-vj65w  worker   running    RWHBNVNMJJA  1 week ago   run_as_deploy foreman start -m worker=1
worker-77f69c5d49-xqvhx  worker   running    RWHBNVNMJJA  1 week ago   run_as_deploy foreman start -m worker=1

I inspected the processes with convox ps info:

$ convox ps info web-5b7d9797c9-h82qq
Id        web-5b7d9797c9-h82qq
App       docspring
Command   run_as_deploy foreman start -m web=1
Instance  ip-10-1-185-2.ec2.internal
Release   RLDZCFNCLKA
Service   web
Started   1 month ago
Status    unhealthy

$ convox ps info web-5b7d9797c9-kx9mf
Id        web-5b7d9797c9-kx9mf
App       docspring
Command   run_as_deploy foreman start -m web=1
Instance  ip-10-1-185-2.ec2.internal
Release   RLDZCFNCLKA
Service   web
Started   1 month ago
Status    unhealthy

They were both running on ip-10-1-185-2.ec2.internal, so I just decided to kill that server. P.S. It would be nice to implement that command in the CLI:

$ convox instances terminate ip-10-1-185-2.ec2.internal
Terminating instance... ERROR: unimplemented

So I did it in the AWS console. This got rid of the last 2 unhealthy processes.

I’m still not sure why this happened or why everything got a bit stuck in the pending/unhealthy state. I guess I will just try to set up some monitoring and alerts so that I know when there’s a problem. Does anyone know how to do this in Datadog? (Please let me know if you have Convox + Datadog experience, I would love to pay for a consultation.)

This is still happening for me, and it’s quite frustrating (and expensive!)
If I don’t check my rack regularly, then I get a ton of unhealthy processes and lots of zombie instances that are running up costs while not being used. Example:

$ convox ps
ID                       SERVICE  STATUS     RELEASE      STARTED       COMMAND
web-58cf4578c5-8z22t     web      running    RAWNGOQNRQL  6 days ago    run_as_deploy foreman start -m web=1
web-58cf4578c5-bxwt7     web      unhealthy  RAWNGOQNRQL  6 days ago    run_as_deploy foreman start -m web=1
web-58cf4578c5-flvrl     web      running    RAWNGOQNRQL  6 days ago    run_as_deploy foreman start -m web=1
web-58cf4578c5-kk6mj     web      running    RAWNGOQNRQL  6 days ago    run_as_deploy foreman start -m web=1
web-58cf4578c5-prmkf     web      unhealthy  RAWNGOQNRQL  6 days ago    run_as_deploy foreman start -m web=1
web-7b8dd5db4c-2hwnc     web      unhealthy  RXBEATZFBKN  1 month ago   run_as_deploy foreman start -m web=1
web-7b8dd5db4c-9477p     web      unhealthy  RXBEATZFBKN  2 months ago  run_as_deploy foreman start -m web=1
web-7b8dd5db4c-f7nts     web      unhealthy  RXBEATZFBKN  1 month ago   run_as_deploy foreman start -m web=1
web-7b8dd5db4c-gjk5v     web      unhealthy  RXBEATZFBKN  1 month ago   run_as_deploy foreman start -m web=1
web-7b8dd5db4c-jvtr4     web      unhealthy  RXBEATZFBKN  2 months ago  run_as_deploy foreman start -m web=1
web-7b8dd5db4c-jwrwr     web      unhealthy  RXBEATZFBKN  1 month ago   run_as_deploy foreman start -m web=1
web-7b8dd5db4c-mq97c     web      unhealthy  RXBEATZFBKN  2 months ago  run_as_deploy foreman start -m web=1
web-7b8dd5db4c-zndfk     web      unhealthy  RXBEATZFBKN  2 months ago  run_as_deploy foreman start -m web=1
worker-686ff9764d-6s88f  worker   running    RAWNGOQNRQL  6 days ago    run_as_deploy foreman start -m worker=1
worker-686ff9764d-kl4s4  worker   running    RAWNGOQNRQL  6 days ago    run_as_deploy foreman start -m worker=1
worker-686ff9764d-rfmqj  worker   running    RAWNGOQNRQL  6 days ago    run_as_deploy foreman start -m worker=1

How can I stop this from happening and automatically kill these unhealthy processes?

It’s also strange to see that processes from a previous release are still running and marked as “healthy”:

$ cxps
ID                       SERVICE  STATUS     RELEASE      STARTED         COMMAND
web-698b4484d5-29lq5     web      unhealthy  ROPLEALFPGZ  2 weeks ago     run_as_deploy foreman start -m web=1
web-698b4484d5-rtm67     web      unhealthy  ROPLEALFPGZ  2 weeks ago     run_as_deploy foreman start -m web=1
web-7c8fc5d498-6tqxb     web      running    RUPYPQESDBP  22 minutes ago  run_as_deploy foreman start -m web=1
web-7c8fc5d498-9b97n     web      running    RUPYPQESDBP  6 days ago      run_as_deploy foreman start -m web=1
web-7c8fc5d498-blk24     web      running    RUPYPQESDBP  3 days ago      run_as_deploy foreman start -m web=1
web-7c8fc5d498-gth8v     web      running    RUPYPQESDBP  6 days ago      run_as_deploy foreman start -m web=1
web-7c8fc5d498-r8djj     web      running    RUPYPQESDBP  6 days ago      run_as_deploy foreman start -m web=1
web-7c8fc5d498-r9bfh     web      running    RUPYPQESDBP  38 minutes ago  run_as_deploy foreman start -m web=1
web-c5bd884cb-f6jsf      web      running    RWHBNVNMJJA  3 weeks ago     run_as_deploy foreman start -m web=1
web-c5bd884cb-kttsb      web      running    RWHBNVNMJJA  3 weeks ago     run_as_deploy foreman start -m web=1
worker-5888cc869f-4ptwh  worker   running    RUPYPQESDBP  6 days ago      run_as_deploy foreman start -m worker=1
worker-5888cc869f-528cq  worker   running    RUPYPQESDBP  6 days ago      run_as_deploy foreman start -m worker=1
worker-5888cc869f-j9qnn  worker   running    RUPYPQESDBP  6 days ago      run_as_deploy foreman start -m worker=1
worker-5888cc869f-qgmmb  worker   running    RUPYPQESDBP  6 days ago      run_as_deploy foreman start -m worker=1
worker-5888cc869f-r8fjg  worker   running    RUPYPQESDBP  6 days ago      run_as_deploy foreman start -m worker=1
worker-5888cc869f-xf4hh  worker   running    RUPYPQESDBP  6 days ago      run_as_deploy foreman start -m worker=1

Is there something going wrong with my deployments where the deploy isn’t finishing properly and processes are getting stuck?