Certificate errors when starting a build

We have two racks and starting this morning one of them has been failing to build correctly.

When we run convox build, we get the following error:

Packaging source... OK
Uploading source... OK
Starting build... ERROR: Get "https://10.0.6.115:2376/containers/json?all=1&filters=%!B(MISSING)%!l(MISSING)abel%3A%!B(MISSING)%!c(MISSING)om.amazonaws.ecs.task-arn%!D(MISSING)arn%!A(MISSING)aws%!A(MISSING)ecs%!A(MISSING)us-west-2%!A(MISSING)568237466542%!A(MISSING)task%!F(MISSING)staging-BuildCluster-Q2YBPOFOG8T3%!F(MISSING)537aa976c0ba4351bceda779022e2b63%5D%!D(MISSING)": remote error: tls: bad certificate

Exited with code exit status 1

The MISSING bits in the URL, in particular, seem a little… distressing?

When I run convox certs for the rack, I don’t see any expired certs.

We’re running rack version 20230208173037; any help on where to start diagnosing the problem would be appreciated.

Thanks,

NRY

I’ve tried performing an Instance Refresh for both the build and running autoscaling groups, which didn’t seem to make an impact.

I thought maybe there was an update to the CLI (we download it from http://download.convox.com/cli/linux/convox when our build runs in CI), but if that were the case I’d expect this to be impacting both racks.

I found this change which references the Docker self-signed certificate life. I checked CloudFormation and it was a year ago this morning that this was created, so that seems to be the likely culprit.

We’re not in a position to apply a rack upgrade without disrupting customers; is it possible to simply regenerate those private certificates?

I ran a rack upgrade to the latest release but it doesn’t seem to have tainted/replaced the Docker certificates.

Is there something I need to do to get those to correctly regenerate? Delete/rename them in Parameter Store?

For posterity (or more likely future me when I come looking for an answer), the final issue was that I didn’t update the CloudFormation template, only the version parameter.

Hi @nathan, we’ve just run into this as well, and we’re in a very similar situation (bad certs, not in a great position to run a rack update). I am able to find the parameter store values you’re referencing – just wanted to double check if you resolved this by renaming the parameters in parameter store + then triggering a CF stack update, or if you had to make other changes to the CF stack to point them to a new parameter group?

Hey Max,

Wish I had better news to share.

In the end I wound up effectively upgrading the rack: I wasn’t able to get CF to regenerate the certs. However, I didn’t try renaming the values in parameter store; that seems like it might be worth trying.

In our case we had long running tasks that are started outside of Convox running on the EC2 instances, so my biggest concern was managing those and avoiding termination. I wound up doing the following:

  • Creating a new, 0-instance autoscaling group with scale-in protection enabled to temporarily hold my running instances
  • Added it as a capacity provider to my ECS cluster
  • Manually detached running instances from the existing Convox autoscaling group and added them to the new autoscaling group.
  • Applied the rack upgrade: at this point I now had new instances running in the original group, as well as my old ones.
  • Set the old instances to drain in the ECS.
  • Once they’d drained, I terminated them.
  • When the holding group was finally empty, I removed the capacity provider and deleted the group.

Phew.

Hopefully the above helps in some way.

NRY

2 Likes

Hey @nathan and @max

I’m sorry to hear you were having trouble solving this issue. I thought we had mentioned this in our newsletter that month and stressed updating racks if you encountered this error, but I apologize if our messaging on that didn’t make it to you.

I would suggest checking out this blog post I wrote up a little while ago that details the different rack and CLI versioning and update best practices.

But long story short for v2 racks you are able to freely update across versions because of the way CloudFormation controls the stack simply run convox rack update -r rackName and you will be updated to the latest version.

You can also specify a version number number in the command with convox rack update versionNumber -r rackName if you would like to update to a specific release.

The only reason you would not be able to update or downgrade to a version on a v2 rack would be if AWS has depreciated some aspect of that version and will not allow CloudFormation to create the resources.



For v3 racks updating is a little more specific due to the k8s version updates. On v3 you must update to each minor version between updates. e.g. 3.13.x3.14.x3.15.x

Because you cannot downgrade back across the minor version on v3 racks convox rack update -r rackName will only update to the highest patch within the minor version as we consider this a breaking-change.

For v3 racks you must specify the version number to update to the next minor version, e.g. convox rack update 3.15.2 -r rackName if updating from a 3.14.x rack release.

I hope this helps!

Regards,
Nick