Latest Rack Autoscaled from 10 -> 700 instances

Hi,

I’m running Convox Gen 2 and have been for several years.

On Jan 7th 2021, rack auto-installed 20210106115601 on my integration rack.

From that point until a couple of hours ago, I noted that I’ve gone from 10-or-so to 700-or-so instances. These instances were not running any services.

~I will attempt to resolve manually, but can you provide some advise on how this can happen?~

I have since applied convox scale to a particular service which appeared to “kick” the convox autoscaler as that closed 199 instances. We’re now down to running 11 instances, which is ballpark for this rack.

I’m using t3.small instances which, as far as I can tell, offer 2048 CPU and 1955 RAM. I can confirm no services require greater than these values. I am configured for 3 on demand, and variable spot instances as required.

I will continue to update and work through this as I understand more, but this seems to be quite an issue.

Any and all help appreciated.

This is happening again, on a different (V2) rack.

The logs are filled with 2021-02-02T01:35:32Z service/monitor/2bc1bddb-cf07-408d-9384-4285e0900226 who="EC2/ASG" what="Launching a new EC2 instance: i-024f6710cf7973fbc" why="At 2021-02-02T01:33:42Z a user request update of AutoScalingGroup constraints to min: 30, max: 1000, desired: 30 changing the desired capacity from 29 to 30. At 2021-02-02T01:33:46Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 29 to 30." and there are many empty ECS instances.

I have resolved this through altering a particular service scale from 100% memory to 75%.

This is what I see in the convox rack logs

2021-02-02T03:25:32Z service/monitor/2bc1bddb-cf07-408d-9384-4285e0900226 who="EC2/ASG" what="Launching a new EC2 instance: i-0932c2e4894cb1c00" why="At 2021-02-02T03:23:43Z a user request update of AutoScalingGroup constraints to min: 85, max: 1000, desired: 85 changing the desired capacity from 84 to 85.  At 2021-02-02T03:23:53Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 84 to 85."
2021-02-02T03:25:32Z service/monitor/2bc1bddb-cf07-408d-9384-4285e0900226 ns=workers.monitor count=85 connected='i-0025459c3bc7612f7,i-00512c12048a9372d,i-0057fae8d457d1167,i-00657cc25df643d3e,i-006a6e4fd52dd84c9,i-0099a7ffb3e19e536,i-00b87193392829a00,i-00eb81469fdcf54c7,i-010efe6130868c2f8,i-01138ff005f0421aa,i-0147dc7956d96b38b,i-018154b0a40848441,i-018418c495437dbf0,i-01a7a4de9c9dc93d3,i-01f2385f49cf65b5e,i-020677104280afa12,i-0220af70a0562a562,i-022bf368b01bcdffc,i-024f6710cf7973fbc,i-0251559cf6fb870b8,i-0281f7ba692a65806,i-028fe62e2331adb16,i-02d5124d4a600066f,i-030ac41b4cba7cb4f,i-034a0aa463ff770d4,i-0363ee2d85a6cd926,i-0364d102c1068d992,i-039144b2aafe94380,i-03aab9ca82878bdfd,i-04006f611416968d3,i-04141cb7122f547b7,i-042e54e5b3e268b7c,i-042ee4d68759438fd,i-0490f22219b90fad1,i-04e6b7eef232b1c70,i-04f371c3a08531002,i-051f3c73828ffc214,i-05418fc71cbaa2aa8,i-0556ad5be855886fe,i-057037fb7490d69fc,i-0590d48aa17c66b89,i-05ccffc330164644d,i-05ecb4272ac81ea61,i-06524fa204f97e0fd,i-068899d7676373194,i-0717259ebf3d7853a,i-0772702899ce81a08,i-07c0d1477553ceedb,i-07d880d3f46af5cae,i-07f9363be24e891c9,i-080966ec46c36ac93,i-0831585ca6d7fb5ff,i-085bb34d4eb2bea0b,i-08f0f13fae0f4d321,i-0932c2e4894cb1c00,i-096c4692eca8c0e11,i-09b9f7baf0293fe78,i-09fffe970898a8a55,i-0a05260a779de1f82,i-0a6710fbab6558620,i-0ad0b56ad8fd67923,i-0ad4113a156b02085,i-0ad603de66e6e558e,i-0b647b5e4e510837d,i-0b7fa8e26e44eae60,i-0bd8242b173f884c5,i-0c10ebb2d23752365,i-0c26a6d604bede30f,i-0c3a562cff94ebade,i-0c4d247eb0d6ccd89,i-0ca65f1c99ede025c,i-0cc37a55886db0436,i-0cdd5266daace5508,i-0eb1c87020b35a206,i-0ebc3de580c053e64,i-0ec0d9243b083e248,i-0eca0a851a54e440a,i-0edfc01e4d04f0c41,i-0f6326a1ad18a3e7c,i-0f8054d52dcd5cddd,i-0fb22d8117079fded,i-0fb58a4d264e1c79d,i-0fb9f11fee1d9e146,i-0fcaf0a30d724e93e,i-0fec13af7379c241f' healthy='i-0025459c3bc7612f7,i-00512c12048a9372d,i-0057fae8d457d1167,i-00657cc25df643d3e,i-006a6e4fd52dd84c9,i-0099a7ffb3e19e536,i-00b87193392829a00,i-00eb81469fdcf54c7,i-010efe6130868c2f8,i-01138ff005f0421aa,i-0147dc7956d96b38b,i-018154b0a40848441,i-018418c495437dbf0,i-01a7a4de9c9dc93d3,i-01f2385f49cf65b5e,i-020677104280afa12,i-0220af70a0562a562,i-022bf368b01bcdffc,i-024f6710cf7973fbc,i-0251559cf6fb870b8,i-0281f7ba692a65806,i-028fe62e2331adb16,i-02d5124d4a600066f,i-030ac41b4cba7cb4f,i-034a0aa463ff770d4,i-0363ee2d85a6cd926,i-0364d102c1068d992,i-039144b2aafe94380,i-03aab9ca82878bdfd,i-04006f611416968d3,i-04141cb7122f547b7,i-042e54e5b3e268b7c,i-042ee4d68759438fd,i-0490f22219b90fad1,i-04e6b7eef232b1c70,i-04f371c3a08531002,i-051f3c73828ffc214,i-05418fc71cbaa2aa8,i-0556ad5be855886fe,i-057037fb7490d69fc,i-0590d48aa17c66b89,i-05ccffc330164644d,i-05ecb4272ac81ea61,i-06524fa204f97e0fd,i-068899d7676373194,i-0717259ebf3d7853a,i-0772702899ce81a08,i-07c0d1477553ceedb,i-07d880d3f46af5cae,i-07f9363be24e891c9,i-080966ec46c36ac93,i-0831585ca6d7fb5ff,i-085bb34d4eb2bea0b,i-08f0f13fae0f4d321,i-0932c2e4894cb1c00,i-096c4692eca8c0e11,i-09b9f7baf0293fe78,i-09fffe970898a8a55,i-0a05260a779de1f82,i-0a6710fbab6558620,i-0ad0b56ad8fd67923,i-0ad4113a156b02085,i-0ad603de66e6e558e,i-0b647b5e4e510837d,i-0b7fa8e26e44eae60,i-0bd8242b173f884c5,i-0c10ebb2d23752365,i-0c26a6d604bede30f,i-0c3a562cff94ebade,i-0c4d247eb0d6ccd89,i-0ca65f1c99ede025c,i-0cc37a55886db0436,i-0cdd5266daace5508,i-0eb1c87020b35a206,i-0ebc3de580c053e64,i-0ec0d9243b083e248,i-0eca0a851a54e440a,i-0edfc01e4d04f0c41,i-0f6326a1ad18a3e7c,i-0f8054d52dcd5cddd,i-0fb22d8117079fded,i-0fb58a4d264e1c79d,i-0fb9f11fee1d9e146,i-0fcaf0a30d724e93e,i-0fec13af7379c241f' marked=''
2021-02-02T03:25:35Z service/monitor/2bc1bddb-cf07-408d-9384-4285e0900226 ns=provider.aws at=stackResource stack=arn:aws:cloudformation:ap-southeast-2:143590141352:stack/production-social/104ad660-823f-11e7-a115-50fa575f6862 resource=LogGroup state=success physical=production-social-LogGroup-FK3PQJ92O5TY elapsed=116.669
2021-02-02T03:25:54Z service/monitor/2bc1bddb-cf07-408d-9384-4285e0900226 ns=provider.aws at=stackResource stack=production-paid-campaigns resource=LogGroup state=success physical=production-paid-campaigns-LogGroup-72OXWY41LNCE elapsed=194.795
2021-02-02T03:25:54Z service/monitor/2bc1bddb-cf07-408d-9384-4285e0900226 ns=provider.aws at=stackResource stack=production-social resource=LogGroup state=success physical=production-social-LogGroup-FK3PQJ92O5TY elapsed=123.123
2021-02-02T03:25:55Z service/monitor/2bc1bddb-cf07-408d-9384-4285e0900226 ns=provider.aws at=stackResource stack=production-web-analytics resource=LogGroup state=success physical=production-web-analytics-LogGroup-1NVQUB4QUMV3H elapsed=188.066
2021-02-02T03:25:55Z service/monitor/2bc1bddb-cf07-408d-9384-4285e0900226 ns=provider.aws at=stackResource stack=production-digivizer-app resource=LogGroup state=success physical=production-digivizer-app-LogGroup-1TQ5TBF7NICYI elapsed=153.573
2021-02-02T03:30:31Z service/monitor/2bc1bddb-cf07-408d-9384-4285e0900226 ns=workers.monitor tick
2021-02-02T03:30:32Z service/monitor/2bc1bddb-cf07-408d-9384-4285e0900226 ns=provider.aws at=stackResource stack=production resource=Instances state=success physical=production-Instances-2ZWEATDNSSII elapsed=233.194
2021-02-02T03:30:32Z service/monitor/2bc1bddb-cf07-408d-9384-4285e0900226 who="EC2/ASG" what="Launching a new EC2 instance: i-0040125f2db8ebbe1" why="At 2021-02-02T03:29:41Z a user request update of AutoScalingGroup constraints to min: 15, max: 1000, desired: 15 changing the desired capacity from 14 to 15.  At 2021-02-02T03:29:50Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 14 to 15."

We’ve run into this issue before, although thankfully we didn’t hit 200 instances!

I’m not sure if there’s anything Convox can do about this, I think it would be up to ECS to tell that no amount of instance scaling is going to find an instance that can hold the given process.

But if there is anything Convox can do about it, I’d +1 it

I think it’s been resolved now - certainly I haven’t seen it in a while.

I think this commit resolved it Tweak logic for extreme scaling edge case by beastawakens · Pull Request #3438 · convox/rack · GitHub