Issue promoting releases with release 20210310212746

I updated our racks to 20210310212746 this morning, and the next time I attempted to release encountered errors deloying the agent defined in our application. The following message repeated ad infinitum:

2021-03-16T21:17:37Z system/ecs aws/ecs (service staging-polytomic-ServiceAgent-1U8DZCW31U5JJ-Service-Hp68oJVjAcje) was unable to place a task because no container instance met all of its requirements. The closest matching (container-instance 67f69e6c1f1b490a9044a448863a6bc0) is already using a port required by your task. For more information, see the Troubleshooting section of the Amazon ECS Developer Guide.

After downgrading the rack to 20210217150446 (the release we were previously on) the promotion completes successfully.

I tried the interim releases, as well, and 20210217150446 seems to be the last release when the agent will properly deploy.

Any suggestions on what might be causing this, or where to start debugging?

1 Like

:cricket:

For future readers, we wound up working around this by first promoting a release that removed our agent, and then immediately promoting a release that added it back. After that release promotion worked as expected. It would seem that something changed between releases, and recreating the agent service solved it.

Other things we tried that did not work:

  • Setting the singleton flag on the agent service definition seemingly had no effect. That’s not a flag I’ve used before, so it was admittedly a stab in the dark.
1 Like

Setting the singleton flag on the agent service definition seemingly had no effect

My investigations indicate that the ECS service is being “replaced”, which involves spinning up the promoted release as a new service, then removing the old one as part of cleanup. This will never succeed, because the old service will still be using the ports.

It appears something changed in the CloudFormation templates generated by this rack version, which forces the update to be a replacement for the service. Thus, it’s impossible to promote a release for any agent service.

It’d be Make circuit breaker deployments optional and off by default by beastawakens · Pull Request #3445 · convox/rack · GitHub that caused this change.

For future reference, I’ve identified this as the specific change that results in service replacement - it’s in release 20210302144350.

I’ve tested further, and confirmed this causes issues in the following scenarios:

  • When using an agent configuration with ports (deployment will be stuck until rollback)
  • When there aren’t enough resources to fully place the old and new copy of the app services at the same time (deployment will be stuck until rollback)
  • When using singleton (there’ll be two instances of the Convox service running until the CloudFormation change “cleans up”)

@liam.dawson should I expect this to happen to me, updating from 20200529011310 to latest version?

I believe so. I haven’t seen anything that addresses this issue in any of the Rack updates.