~5 minutes of downtime after running "convox rack update --wait"

Hello,

I just ran convox rack update --wait to update to version 20191120232059, and all of my web containers were suddenly killed without any replacements, so I had about ~5 minutes of downtime before ECS started 3 new web containers.

One strange thing is that convox rack releases had no output before I ran the update:

$ convox rack releases
VERSION  UPDATED

After the update:

$ convox rack releases
VERSION  UPDATED
20191120232059  32 minutes ago

Here’s the logs from the update:

$ convox rack update --wait
Updating to 20191120232059...
2019-11-23T10:45:27Z system/aws/cfm my-rack UPDATE_IN_PROGRESS my-rack User Initiated
2019-11-23T10:45:32Z system/aws/cfm my-rack CREATE_IN_PROGRESS RecordSetGateway
2019-11-23T10:45:32Z system/aws/cfm my-rack UPDATE_IN_PROGRESS InstancesLifecycleHandler
2019-11-23T10:45:32Z system/aws/cfm my-rack UPDATE_IN_PROGRESS BalancerSecurity
2019-11-23T10:45:33Z system/aws/cfm my-rack UPDATE_IN_PROGRESS CustomTopic
2019-11-23T10:45:33Z system/aws/cfm my-rack UPDATE_COMPLETE InstancesLifecycleHandler
2019-11-23T10:45:33Z system/aws/cfm my-rack CREATE_IN_PROGRESS RecordSetGateway Resource creation Initiated
2019-11-23T10:45:34Z system/aws/cfm my-rack UPDATE_COMPLETE CustomTopic
2019-11-23T10:45:37Z system/aws/cfm my-rack UPDATE_IN_PROGRESS SpotLaunchConfiguration Requested update requires the creation of a new physical resource; hence creating one.
2019-11-23T10:45:37Z system/aws/cfm my-rack UPDATE_IN_PROGRESS LaunchConfiguration Requested update requires the creation of a new physical resource; hence creating one.
2019-11-23T10:45:37Z system/aws/cfm my-rack UPDATE_IN_PROGRESS SpotLaunchConfiguration Resource creation Initiated
2019-11-23T10:45:37Z system/aws/cfm my-rack UPDATE_IN_PROGRESS LaunchConfiguration Resource creation Initiated
2019-11-23T10:45:38Z system/aws/cfm my-rack UPDATE_COMPLETE SpotLaunchConfiguration
2019-11-23T10:45:38Z system/aws/cfm my-rack UPDATE_COMPLETE LaunchConfiguration
2019-11-23T10:45:41Z system/aws/cfm my-rack UPDATE_IN_PROGRESS ApiMonitorTasks Requested update requires the creation of a new physical resource; hence creating one.
2019-11-23T10:45:41Z system/aws/cfm my-rack UPDATE_IN_PROGRESS ApiMonitorTasks Resource creation Initiated
2019-11-23T10:45:42Z system/aws/cfm my-rack UPDATE_IN_PROGRESS Instances
2019-11-23T10:45:42Z system/aws/cfm my-rack UPDATE_COMPLETE ApiMonitorTasks
2019-11-23T10:45:42Z system/aws/cfm my-rack UPDATE_IN_PROGRESS SpotInstances
2019-11-23T10:45:42Z system/aws/cfm my-rack UPDATE_IN_PROGRESS ApiBuildTasks Requested update requires the creation of a new physical resource; hence creating one.
2019-11-23T10:45:42Z system/aws/cfm my-rack UPDATE_IN_PROGRESS ApiBuildTasks Resource creation Initiated
2019-11-23T10:45:42Z system/aws/cfm my-rack UPDATE_COMPLETE ApiBuildTasks
2019-11-23T10:45:44Z system/aws/cfm my-rack UPDATE_COMPLETE SpotInstances
2019-11-23T10:45:45Z system/aws/cfm my-rack UPDATE_IN_PROGRESS Instances Rolling update initiated. Terminating 3 obsolete instance(s) in batches of 1, while keeping at least 3 instance(s) in service. Waiting on resource signals with a timeout of PT5M when new instances are added to the autoscaling group.
2019-11-23T10:45:46Z system/aws/cfm my-rack UPDATE_IN_PROGRESS Instances Temporarily setting autoscaling group MinSize and DesiredCapacity to 4.
2019-11-23T10:45:48Z system/aws/cfm my-rack UPDATE_COMPLETE BalancerSecurity
2019-11-23T10:46:37Z system/aws/cfm my-rack CREATE_COMPLETE RecordSetGateway
2019-11-23T10:47:03Z system/aws/cfm my-rack UPDATE_IN_PROGRESS Instances New instance(s) added to autoscaling group - Waiting on 1 resource signal(s) with a timeout of PT5M.
2019-11-23T10:47:04Z system/aws/cfm my-rack UPDATE_IN_PROGRESS Instances Received SUCCESS signal with UniqueId i-0d3dfe5831b1c5b92
2019-11-23T10:47:04Z system/aws/cfm my-rack UPDATE_IN_PROGRESS Instances Terminating instance(s) [i-01e98d240b1f39431]; replacing with 1 new instance(s).
2019-11-23T10:47:57Z system/aws/cfm my-rack UPDATE_IN_PROGRESS Instances Successfully terminated instance(s) [i-01e98d240b1f39431] (Progress 33%).
2019-11-23T10:47:57Z system/aws/cfm my-rack UPDATE_IN_PROGRESS Instances New instance(s) added to autoscaling group - Waiting on 1 resource signal(s) with a timeout of PT5M.
2019-11-23T10:48:18Z system/aws/cfm my-rack UPDATE_IN_PROGRESS Instances Received SUCCESS signal with UniqueId i-0bb4eee17203e9c1c
2019-11-23T10:48:18Z system/aws/cfm my-rack UPDATE_IN_PROGRESS Instances Terminating instance(s) [i-01343393c303814b5]; replacing with 1 new instance(s).
2019-11-23T10:49:35Z system/aws/cfm my-rack UPDATE_IN_PROGRESS Instances Successfully terminated instance(s) [i-01343393c303814b5] (Progress 67%).
2019-11-23T10:49:35Z system/aws/cfm my-rack UPDATE_IN_PROGRESS Instances New instance(s) added to autoscaling group - Waiting on 1 resource signal(s) with a timeout of PT5M.
2019-11-23T10:49:36Z system/aws/cfm my-rack UPDATE_IN_PROGRESS Instances Received SUCCESS signal with UniqueId i-098e1034d823d26ee
2019-11-23T10:49:35Z system/aws/cfm my-rack UPDATE_IN_PROGRESS Instances Successfully terminated instance(s) [i-01343393c303814b5] (Progress 67%).
2019-11-23T10:49:35Z system/aws/cfm my-rack UPDATE_IN_PROGRESS Instances New instance(s) added to autoscaling group - Waiting on 1 resource signal(s) with a timeout of PT5M.
2019-11-23T10:49:36Z system/aws/cfm my-rack UPDATE_IN_PROGRESS Instances Received SUCCESS signal with UniqueId i-098e1034d823d26ee
2019-11-23T10:49:37Z system/aws/cfm my-rack UPDATE_IN_PROGRESS Instances Terminating instance(s) [i-0a80492157ce139d6]; replacing with 0 new instance(s).
2019-11-23T10:49:38Z system/aws/cfm my-rack UPDATE_IN_PROGRESS Instances Successfully terminated instance(s) [i-0a80492157ce139d6] (Progress 100%).
2019-11-23T10:49:40Z system/aws/cfm my-rack UPDATE_COMPLETE Instances
2019-11-23T10:49:43Z system/aws/cfm my-rack UPDATE_IN_PROGRESS ApiMonitorService
2019-11-23T10:49:44Z system/aws/cfm my-rack UPDATE_IN_PROGRESS InstancesAutoscaler
2019-11-23T10:49:45Z system/aws/cfm my-rack UPDATE_COMPLETE InstancesAutoscaler
2019-11-23T10:49:48Z system/aws/cfm my-rack UPDATE_IN_PROGRESS ApiWebTasks Requested update requires the creation of a new physical resource; hence creating one.
2019-11-23T10:49:49Z system/aws/cfm my-rack UPDATE_IN_PROGRESS ApiWebTasks Resource creation Initiated
2019-11-23T10:49:49Z system/aws/cfm my-rack UPDATE_COMPLETE ApiWebTasks
2019-11-23T10:49:53Z system/aws/cfm my-rack UPDATE_IN_PROGRESS ApiWebService
2019-11-23T10:51:44Z system/cloudformation aws/cfm my-rack UPDATE_COMPLETE ApiMonitorService
2019-11-23T10:51:54Z system/cloudformation aws/cfm my-rack UPDATE_COMPLETE ApiWebService
2019-11-23T10:51:57Z system/cloudformation aws/cfm my-rack UPDATE_COMPLETE_CLEANUP_IN_PROGRESS my-rack
2019-11-23T10:51:59Z system/cloudformation aws/cfm my-rack DELETE_IN_PROGRESS LaunchConfiguration
2019-11-23T10:51:59Z system/cloudformation aws/cfm my-rack DELETE_IN_PROGRESS ApiMonitorTasks
2019-11-23T10:51:59Z system/cloudformation aws/cfm my-rack DELETE_IN_PROGRESS ApiBuildTasks
2019-11-23T10:51:59Z system/cloudformation aws/cfm my-rack DELETE_IN_PROGRESS ApiWebTasks
2019-11-23T10:51:59Z system/cloudformation aws/cfm my-rack DELETE_IN_PROGRESS SpotInstancesLifecycleInterrupting
2019-11-23T10:51:59Z system/cloudformation aws/cfm my-rack DELETE_IN_PROGRESS SpotLaunchConfiguration
2019-11-23T10:51:59Z system/cloudformation aws/cfm my-rack DELETE_COMPLETE ApiBuildTasks
2019-11-23T10:52:00Z system/cloudformation aws/cfm my-rack DELETE_COMPLETE ApiWebTasks
2019-11-23T10:52:00Z system/cloudformation aws/cfm my-rack DELETE_COMPLETE ApiMonitorTasks
2019-11-23T10:52:00Z system/cloudformation aws/cfm my-rack DELETE_COMPLETE SpotLaunchConfiguration
2019-11-23T10:52:00Z system/cloudformation aws/cfm my-rack DELETE_COMPLETE LaunchConfiguration
2019-11-23T10:53:00Z system/cloudformation aws/cfm my-rack DELETE_COMPLETE SpotInstancesLifecycleInterrupting
2019-11-23T10:53:00Z system/cloudformation aws/cfm my-rack UPDATE_COMPLETE my-rack
OK

I had previously run convox rack update on my staging rack, and this finished without any problems, and no downtime.

But now I have one more rack to update (my main production region), and I’m very nervous about updating because I want to avoid any downtime.

I just found these lines in the logs:

2019-11-23T10:49:56Z service/web/d8f51d26-20bc-42a1-941a-f8ec8c0666f3 10:49:56 web.1     | E, [2019-11-23T10:49:56.994244 #40] ERROR -- ddtrace: [ddtrace] (/app/vendor/bundle/ruby/2.5.0/gems/ddtrace-0.28.0/lib/ddtrace/transport/http/client.rb:46:in `rescue in send_request') Internal error during HTTP transport request. Cause: Failed to open TCP connection to 172.17.0.1:8126 (Connection refused - connect(2) for "172.17.0.1" port 8126) Location: /usr/local/lib/ruby/2.5.0/net/http.rb:939:in `rescue in block in connect'
2019-11-23T10:50:03Z service/web/d8f51d26-20bc-42a1-941a-f8ec8c0666f3 10:50:03 web.1     | E, [2019-11-23T10:50:03.531876 #70] ERROR -- ddtrace: [ddtrace] (/app/vendor/bundle/ruby/2.5.0/gems/ddtrace-0.28.0/lib/ddtrace/transport/http/client.rb:46:in `rescue in send_request') Internal error during HTTP transport request. Cause: Failed to open TCP connection to 172.17.0.1:8126 (Connection refused - connect(2) for "172.17.0.1" port 8126) Location: /usr/local/lib/ruby/2.5.0/net/http.rb:939:in `rescue in block in connect'
2019-11-23T10:50:13Z service/web/d8f51d26-20bc-42a1-941a-f8ec8c0666f3 10:50:13 web.1     | E, [2019-11-23T10:50:13.757915 #55] ERROR -- ddtrace: [ddtrace] (/app/vendor/bundle/ruby/2.5.0/gems/ddtrace-0.28.0/lib/ddtrace/transport/http/client.rb:46:in `rescue in send_request') Internal error during HTTP transport request. Cause: Failed to open TCP connection to 172.17.0.1:8126 (Connection refused - connect(2) for "172.17.0.1" port 8126) Location: /usr/local/lib/ruby/2.5.0/net/http.rb:939:in `rescue in block in connect'

EDIT: Actually I don’t this is related. This web container was successfully serving web requests before and after these log messages, so the errors didn’t indicate a crash.

Argh this happened again, and the rack caused some more downtime when I was updating my main production region.

The worker containers completed the rolling deploy with no problem, but the web containers did not get started on the new instances. All web containers had been killed once the old EC2 instances were drained and shut down. Then after the rack update finished, ECS started 3 new web containers at the same time.

It would be great if this could be fixed so that I can avoid downtime during a rack update.