"Phasing in" an Autoscaling Group? - Google Groups

Jay Taylor's notes

back to listing index

"Phasing in" an Autoscaling Group? - Google Groups

[web search]

Original source (groups.google.com)

Tags: aws deployment terraform auto-scaling groups.google.com

Clipped on: 2017-02-03

Long time lurker, first time poster.

One thing I would like to do is phase in a new ASG. For example, when I update a new AMI for an autoscaling group, I want to

* Spin up a new ASG with the updated launch config that has new AMI

* attach to existing ELB, leave old ASG alone and attached

* ones the min healthy hosts are attached from the new ASG, scale down and destroy the old ASG

Is there any easy way to accomplish this via terraform? So far the behavior I've observed is it just destroys the previous ASG and LC and creates a new one. Seems rather dangerous in a production environment so I am sure I am missing something here.

Thanks!

James

Those steps are pretty much exactly how we currently do production rollouts at HashiCorp. :)

Here's how we structure things:

resource "aws_launch_configuration" "someapp" {

lifecycle { create_before_destroy = true }

image_id = "${var.ami}"

instance_type = "${var.instance_type}"

key_name = "${var.key_name}"

security_group = ["${var.security_group}"]

}

resource "aws_autoscaling_group" "someapp" {
lifecycle { create_before_destroy = true }

name = "someapp - ${aws_launch_configuration.someapp.name}"
launch_configuration = "${aws_launch_configuration.someapp.name}"
desired_capacity = "${var.nodes}"
min_size = "${var.nodes}"
max_size = "${var.nodes}"
min_elb_capacity = "${var.nodes}"
availability_zones = ["${split(",", var.azs)}"]
vpc_zone_identifier = ["${split(",", var.subnet_ids)}"]
load_balancers = ["${aws_elb.someapp.id}"]
}

The important bits are:

* Both LC and ASG have create_before_destroy set

* The LC omits the "name" attribute to allow Terraform to auto-generate a random one, which prevent collisions

* The ASG interpolates the launch configuration name into its name, so LC changes always force replacement of the ASG (and not just an ASG update).

* The ASG sets "min_elb_capacity" which means Terraform will wait for instances in the new ASG to show up as InService in the ELB before considering the ASG successfully created.

The behavior when "var.ami" changes is:

(1) New "someapp" LC is created with the fresh AMI

(2) New "someapp" ASG is created with the fresh LC

(3) Terraform waits for the new ASG's instances to spin up and attach to the "someapp" ELB

(4) Once all new instances are InService, Terraform begins destroy of old ASG

(5) Once old ASG is destroyed, Terraform destroys old LC

If Terraform hits its 10m timeout during (3), the new ASG will be marked as "tainted" and the apply will halt, leaving the old ASG in service.

Hope this helps! Happy to answer any further questions you might have,

Paul

- show quoted text -

- show quoted text -

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.

GitHub Issues: https://github.com/hashicorp/terraform/issues
IRC: #terraform-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Terraform" group.
To unsubscribe from this group and stop receiving emails from it, send an email to terraform-too...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/terraform-tool/6607b727-0240-4619-a694-4fb7470b57bc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Works like a charm... thanks!

- show quoted text -

It'd be nice to have some of these tricks included in the asg example in the repo for posterity, along with your very useful summary of why it works. :)

- show quoted text -

can you please clarify what would be behaviour in case the autoscaling group is not attached to a load balancer?

would the old ASG and related instances destroyed before the new ones are fully in service?

Regards

Bruno

- show quoted text -

Very good question, Bruno. Your observation is 100% correct.

Because AWS responds successfully for the ASG create well before the instances are actually ready to perform service, it's true that without an ELB the booting instances will end up racing the destroy, and almost certainly lose, resulting in a service outage during replacement.

The way we've worked around this today for our non-ELB services is that time-honored tradition of "adding a sleep". :)

resource "aws_autoscaling_group" "foo" {

lifecycle { create_before_destroy = true }

# ...

# on replacement, gives new service time to spin up before moving on to destroy

provisioner "local-exec" {

command = "sleep 200"

}
}

This obviously does not perform well if the replacement service fails to come up. A better solution would be to use a provisioner that actually checks service health. Something like:

resource "aws_autoscaling_group" "foo" {

lifecycle { create_before_destroy = true }

# ...

# on replacement, poll until new ASG's instances shows up healthy

provisioner "remote-exec" {
connection {
# some node in your env with scripts+access to check services
host = "${var.health_checking_host}"
user = "${var.health_checking_user}"
}
# script that performs an app-specific check on the new ASG, exits non-0 after timeout

inline = "poll-until-in-service --service foo --asg ${self.id}"

}
}

I'm also kicking ideas around in the back of my head about how TF can better support this "resource verification" / "poll until healthy" use case first class. Any ideas on design for features like that are welcome!

- show quoted text -

- show quoted text -
To view this discussion on the web visit https://groups.google.com/d/msgid/terraform-tool/db5e99e9-50c2-41ad-94bc-423b44afac43%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

I think you really need some custom code, somewhere, somehow, running in a loop, to determine that a resource is ready for action, or is ready for rollover and destruction. For example, I have previously used a custom script to poll RabbitMQ until it's queues are empty, before rolling the server which is processing the queue..

- show quoted text -

- show quoted text -
To view this discussion on the web visit https://groups.google.com/d/msgid/terraform-tool/CAJpeNAMm2o%3DOnnXNayDzeNx%2BfbP14paDg1uQ1u0GpqdduGMOhg%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

Yep totally. So it's just a matter of how we model a Terraform feature such that it allows a flexible hook for said custom code.

- show quoted text -

- show quoted text -
To view this discussion on the web visit https://groups.google.com/d/msgid/terraform-tool/CAEuGOZyii9NtOBn6VcNcyufQcq9r-LvC-nuG5fC%3DMNz%2B-bHZRQ%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

I think there are several dimensions to this problem.

There are different issues that can be addressed independently:

updating images and instance configuration in a ASG + LC setup

Here as dicussed https://github.com/hashicorp/terraform/issues/2183 there are constraints coming from AWS which we need to work around.
The solution you suggested seems to solve a specific case rather than being a general solution.

I think that for this specific problem it would be easier if Terraform rather than delete + create new would use for LC the strategy i've suggested.
The strategy consist on copying the launch configuration into a new/temporary LC, update the ASG to use the cloned LC, then delete the old LC,
at this point you can create a new LC with the new settings and finally update the ASG to use the new LC.

This would allow to update the LC without forcing an IMMEDIATE destruction of instances and recreation, which is very important for stateful systems such as databases,
and it would allow to implement a rolling restart with much less constraints.

Determine when an instance is READY and not just booted.

As noted by Matt this can be application specific, for example consider a database which upon startup of a new instance there is a syncronization phase in which

the data from existing one must be replicated to the new nodes before considering the new DB node READY. This operation can take from a few minutes to hours

depending on the data size. Not considering this a ASG update or rolling restart feature would have disastrous consequences (aka. new nodes up without full data, old nodes killed)

Updating ASG really means Rolling update

The when updating ASG what the user really want is a way to automate the update phase of the cluster (rolling update).

Creating a new cluster all at once and killing the old one is not a suitable solution in all cases (for example for the a DB as explained above)
A rolling update must consider the time it takes to the instance to join the set and replicate data, consider possible failures and allow to roll back of new instances.

Some systems such as Zookeeper and Kafka, have unique IDs which must be preserved, so before starting a new instance with id=1 the old instance must be stopped first,

which again is different than how normally you would approach with let' say Cassandra or MongoDB.

So as you see the update scenarios can be quite complex, I wouldn't use the lifecycle { create_before_destroy = true }

as a general solution to build upon.

If you have ideas on how to handle these scenarios currently I would be glad to see some examples.

Thanks

Bruno

- show quoted text -

> I think that for this specific problem it would be easier if Terraform rather than delete + create new would use for LC the strategy i've suggested.

The strategy consist on copying the launch configuration into a new/temporary LC, update the ASG to use the cloned LC, then delete the old LC,

at this point you can create a new LC with the new settings and finally update the ASG to use the new LC.

You can do this today with Terraform, you'll just need to manage the rolling of your instances via a separate process.

resource "aws_launch_configuration" "foo" {

lifecycle { create_before_destroy = true }

# omit name so it's generated as a unique value

image_id = "${var.ami}"
# ...

}

resource "aws_autoscaling_group" "foo" {

launch_configuration = "${aws_launch_configuration.foo.id}"

name = "myapp"
# ^^ do not interpolate AMI or launch config name in the name.
# this avoids forced ASG replacement on LC change

}

Given the above config, Terraform's behavior when `var.ami` is changed from `ami-abc123` to `ami-def456` is as follows:

* create LC with `ami-def456`

* update existing ASG with new LC name

* delete LC with `ami-abc123`

At this point, any new instance launched into the ASG will use `ami-def456`. So your deployment process can choose what behavior you want. Options include:

* scale up to 2x capacity then scale back down, which will terminate oldest instances

* terminate existing instances one by one, allowing them to be replaced with new ones

(Note create_before_destroy on the ASG is optional here - it depends on the behavior you'd like to see if/when you do need to replace the ASG for some reason.)

> As noted by Matt this can be application specific, for example consider a database which upon startup of a new instance there is a syncronization phase

Yep totally agreed. Terraform will need to provide hooks for delegating out to site-specific code for determining resource health. There's no one-size-fits-all solution here. Today this can be achieved to a certain extent by calling out via local-exec and remote-exec provisioners, but a more first-class primitive for expressing this behavior would be nice to add. It's just a matter of how we model it.

> The when updating ASG what the user really want is a way to automate the update phase of the cluster (rolling update).

Terraform does not have a mechanism for managing a rolling update today. Unfortunately, though AWS provides rolling update behavior in CloudFormation, it's done via cfn-specific, internally implemented behavior [1], and there are no externally available APIs to trigger it on a vanilla ASG.

As I described in the example above - Terraform can make the resource adjustments necessary to support rolling update scenarios, but the actual roll will need to be managed by an external system.

[1] https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-attribute-updatepolicy.html

- show quoted text -

- show quoted text -
To view this discussion on the web visit https://groups.google.com/d/msgid/terraform-tool/123a86ea-266f-4eb2-a077-1287b337fc03%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

ja...@fpcomplete.com

9/9/15

On Tuesday, September 8, 2015 at 11:45:24 AM UTC-4, Paul Hinze wrote:

> I think you really need some custom code, somewhere, somehow, running in a loop, to determine that a resource is ready for action, or is ready for rollover and destruction.

Yep totally. So it's just a matter of how we model a Terraform feature such that it allows a flexible hook for said custom code.

I have been thinking about similar topics recently. I almost wrote a basic utility that would poll/wait for a specific port to become available / service up - maybe Terraform could create similar functionality (the SSH polling is already doing this) - but with an ASG, it would be like: Terraform has to get a list of the new IPs in the ASG, and then poll SSH or similar ports there until the services are up. As a basic step, we could focus on starting with support for polling SSH and HTTP until the services are available on the specified ports.

Thoughts?

Bruno Bonacci

9/10/15

On Tuesday, September 8, 2015 at 5:44:24 PM UTC+1, Paul Hinze wrote:

Hi Bruno,

> I think that for this specific problem it would be easier if Terraform rather than delete + create new would use for LC the strategy i've suggested.
The strategy consist on copying the launch configuration into a new/temporary LC, update the ASG to use the cloned LC, then delete the old LC,
at this point you can create a new LC with the new settings and finally update the ASG to use the new LC.

You can do this today with Terraform, you'll just need to manage the rolling of your instances via a separate process.

resource "aws_launch_configuration" "foo" {
lifecycle { create_before_destroy = true }
# omit name so it's generated as a unique value
image_id = "${var.ami}"
# ...
}

resource "aws_autoscaling_group" "foo" {
launch_configuration = "${aws_launch_configuration.foo.id}"

name = "myapp"
# ^^ do not interpolate AMI or launch config name in the name.
# this avoids forced ASG replacement on LC change
}

Given the above config, Terraform's behavior when `var.ami` is changed from `ami-abc123` to `ami-def456` is as follows:

* create LC with `ami-def456`
* update existing ASG with new LC name
* delete LC with `ami-abc123`

Hi Paul,

I've updated the stack as per your previous description and now I get a Cycle error on destroy (https://github.com/hashicorp/terraform/issues/2359#issuecomment-139382605)

any suggestions?

Bruno

Paul Hinze

9/10/15

Responded on the issue. :)

- show quoted text -

- show quoted text -

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.

GitHub Issues: https://github.com/hashicorp/terraform/issues
IRC: #terraform-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Terraform" group.
To unsubscribe from this group and stop receiving emails from it, send an email to terraform-too...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/terraform-tool/93f883e6-f382-4ebf-842e-7444b77d23e7%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

James Carr

10/5/15

Hey Paul,

Thanks for all the guidance, I really appreciate it. I'm actually wonder if there's a good way to scale in a new ASG while simultaneously scaling down the previous ASG. For example in our situation I have an ASG with 60 servers in it that are queue workers. If we just bring in another 60 it'll blow out connection limits in places (we're working on it) so it'd be nicer if it brought instances in more like "launch new ASG, bringing in X at a time while scaling X down in the old).

Would something like this even be doable? The best I can see is the previous option mentioned: update the launch_config and then terminate instances to bring up new ones.

- show quoted text -

ja...@fpcomplete.com

10/6/15

On Monday, October 5, 2015 at 7:41:56 AM UTC-4, James Carr wrote:

Hey Paul,

Thanks for all the guidance, I really appreciate it. I'm actually wonder if there's a good way to scale in a new ASG while simultaneously scaling down the previous ASG. For example in our situation I have an ASG with 60 servers in it that are queue workers. If we just bring in another 60 it'll blow out connection limits in places (we're working on it) so it'd be nicer if it brought instances in more like "launch new ASG, bringing in X at a time while scaling X down in the old).

Raising the limit is not a terrible idea.. You could also split up the big ASG of 60 into smaller groups of 30 or whatever, and replace one group at a time. Going one route you have extra capacity for a little while, and the other you're short, so that might be a way to make the decision.

ti...@ibexlabs.com

11/18/15

Paul,

Everything seems to be working as expected

1 - Updating the ami creates new launch config

2 - Updates the autoscaling to new launch config name

However new machines are not launching automatically using the new LC/ASG created . What could be the issue here ?

- show quoted text -

Paul Hinze

12/3/15

> I'm actually wonder if there's a good way to scale in a new ASG while simultaneously scaling down the previous ASG.

This is an interesting and important question. If you're looking for fine grained control over a deploy like that today, I'd recommend a blue/green style deployment with two LC/ASGs that you can scale independently.

Assuming your example of 60 nodes, in a Blue/Green model you'd have steps like:

* Begin state: Blue in service at full 60 nodes, Green cold

* Replace Green with new LC/ASG holding fresh AMI, scale to a small number of instances

* Scale down blue and up green in as many batches as you like.

* End state: Blue cold, Green at 60 nodes

This can be driven by Terraform, but involves several runs to orchestrate the batches. Getting Terraform itself to manage a rolling deploy is being discussed in https://github.com/hashicorp/terraform/issues/1552

> However new machines are not launching automatically using the new LC/ASG created . What could be the issue here ?

This is the behavior of AWS when updating the LC. Existing instances are not touched, and new instances use the new LC.

What we do to force new instances to be created is interpolate the LC name into the ASG name - this forces the ASG to be recreated anytime the LC is replaced, which guarantees the nodes are rolled.

Paul

- show quoted text -

- show quoted text -
To view this discussion on the web visit https://groups.google.com/d/msgid/terraform-tool/da6ef9f2-d75c-44a7-b61d-4900f97a6d22%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Yevgeniy Brikman

2/19/16

Paul,

A question about your rolling deployment strategy: how does it work with autoscaling policies? The aws_autoscaling_policy resource docs (https://www.terraform.io/docs/providers/aws/r/autoscaling_policy.html) recommend omitting the desired_capacity attribute from the aws_autoscaling_group. So if my ASG has a min size of 2, a max size of 10, and based on traffic, the autoscaling policy has scaled the ASG up to 8 instances. If I try to roll out a new version, would it end up back at size 2 (min size) if I don't specify desired_capacity?

Thanks!

Jim

- show quoted text -

Paul Hinze

2/29/16

Jim,

In my example, Terraform is managing the scaling "manually" from the ASG's perspective, which is why desired_capacity is being used.

If you wanted to combine this sort of a strategy with scaling policies, you'd need to play around with temporarily setting and removing desired_capacity to "pre-warm" your clusters as you switch over, then either removing it or adding ignore_changes to let the scaling policy take over.

Paul

- show quoted text -

- show quoted text -
To view this discussion on the web visit https://groups.google.com/d/msgid/terraform-tool/854249f3-d4f8-4ca7-aeee-c9272098be8c%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Yevgeniy Brikman

2/29/16

Thanks Paul. It sounds a bit complicated to use this strategy with dynamically sized ASGs. And if we have to write custom scripts to make it work anyway, I wonder if it wouldn't be better to:

Update the AMI version in the launch configuration.
Deploy the new version with Terraform, which won't have any immediate effect on the ASG.
Write a script to force the ASG to deploy instances with the new launch configuration. Anyone have experience with aws-ha-release?
Perhaps run the script using a local provisioner on the launch configuration?

Jim

- show quoted text -

Paul Hinze

2/29/16

Yep that's definitely a valid strategy - use Terraform to _update_ the ASG and leave the instance rolling to an out-of-band process. Plenty of ways to slice the instance roll then.

-p

- show quoted text -

- show quoted text -
To view this discussion on the web visit https://groups.google.com/d/msgid/terraform-tool/c4ed1e4f-50c8-445d-ad72-d42ad2e92829%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Yevgeniy Brikman

3/1/16

Instead of using a hacky script to do a rolling deployment, I decided to try to leverage the rolling UpdatePolicy built into CloudFormation. It seems to be working pretty well. Details are here: https://github.com/hashicorp/terraform/issues/1552#issuecomment-190864512

Jim

- show quoted text -

jak...@thoughtworks.com

Jan 8

Hi Paul,

Thanks for the answer, it's clear and works well, I tried to separate my lc and asg into different module, and use the variable to pass the lc name (still generated by terraform) into asg module, but it stops working, it start to give cycle error, putting asg and lc into same module, everything works fine, any idea why this happen? why lc and asg cannot be in different module?

- show quoted text -

Lowe Schmidt

Jan 9

On 8 January 2017 at 10:58, <jak...@thoughtworks.com> wrote:

Thanks for the answer, it's clear and works well, I tried to separate my lc and asg into different module, and use the variable to pass the lc name (still generated by terraform) into asg module, but it stops working, it start to give cycle error, putting asg and lc into same module, everything works fine, any idea why this happen? why lc and asg cannot be in different module?

It sounds like your modules are depending on each other, have you tried to graph them and see if it loops?

Lowe Schmidt | +46 723 867 157

jak...@thoughtworks.com

Jan 9

Yes, the asg module depends on the lc module which is expected, yes I tried

terraform graph -draw-cycles

then I got red lines on a very complicated graph, I still couldn't figure out where is the cycle by reading the graph. here is my top level main.tf:

provider "aws" {
  region = "ap-southeast-2"
}

module "my_elb" {
  source = "../modules/elb"
  subnets = ["subnet-481d083f", "subnet-303cd454"]
  security_groups = ["sg-e8ac308c"]
}

module "my_lc" {
  source = "../modules/lc"
  subnets = ["subnet-481d083f", "subnet-303cd454"]
  security_groups = ["sg-e8ac308c"]
  snapshot_id = "snap-00d5e8ef70d1b3e24"
}

module "my_asg" {
  source = "../modules/asg"
  subnets = ["subnet-481d083f", "subnet-303cd454"]
  my_asg_name = "my_asg_${module.my_lc.my_lc_name}"
  my_lc_id = "${module.my_lc.my_lc_id}"
  my_elb_name = "${module.my_elb.my_elb_name}"
}

And here is the main.tf of lc module:

data "template_file" "userdata" {
  template = "${file("${path.module}/userdata.sh")}"

  vars {
    notify_email = "m...@email.co"
  }
}

resource "aws_launch_configuration" "my_lc" {
  lifecycle {
    create_before_destroy = true
  }
  image_id = "ami-28cff44b"
  instance_type = "t2.micro"
  security_groups = ["${var.security_groups}"]
  user_data = "${data.template_file.userdata.rendered}"
  associate_public_ip_address = false
  key_name = "sydney"

  root_block_device {
    volume_size = 20
  }

  ebs_block_device {
    device_name = "/dev/sdi"
    volume_size = 10
    snapshot_id = "${var.snapshot_id}"
  }
}

and main.tf of asg module:

resource "aws_autoscaling_group" "my_asg" {
  name = "${var.my_asg_name}"
  lifecycle {
    create_before_destroy = true
  }
  max_size = 1
  min_size = 1
  vpc_zone_identifier = ["${var.subnets}"]
  wait_for_elb_capacity = true
  wait_for_capacity_timeout = "6m"
  min_elb_capacity = 1
  launch_configuration = "${var.my_lc_id}"
  load_balancers = ["${var.my_elb_name}"]
  tag {
    key = "Role"
    value = "API"
    propagate_at_launch = true
  }
}

resource "aws_autoscaling_policy" "scale_up" {
  name = "scale_up"
  lifecycle { create_before_destroy = true }
  scaling_adjustment = 1
  adjustment_type = "ChangeInCapacity"
  cooldown = 300
  autoscaling_group_name = "${aws_autoscaling_group.my_asg.name}"
}

resource "aws_cloudwatch_metric_alarm" "scale_up_alarm" {
  alarm_name = "high_cpu"
  lifecycle  { create_before_destroy = true }
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods = "2"
  metric_name = "CPUUtilization"
  namespace = "AWS/EC2"
  period = "120"
  statistic = "Average"
  threshold = "80"
  insufficient_data_actions = []
  alarm_description = "EC2 CPU Utilization"
  alarm_actions = ["${aws_autoscaling_policy.scale_up.arn}"]
  dimensions {
    AutoScalingGroupName = "${aws_autoscaling_group.my_asg.name}"
  }
}

resource "aws_autoscaling_policy" "scale_down" {
  name = "scale_down"
  lifecycle { create_before_destroy = true }
  scaling_adjustment = -1
  adjustment_type = "ChangeInCapacity"
  cooldown = 600
  autoscaling_group_name = "${aws_autoscaling_group.my_asg.name}"
}

resource "aws_cloudwatch_metric_alarm" "scale_down_alarm" {
  alarm_name = "low_cpu"
  lifecycle  { create_before_destroy = true }
  comparison_operator = "LessThanThreshold"
  evaluation_periods = "5"
  metric_name = "CPUUtilization"
  namespace = "AWS/EC2"
  period = "120"
  statistic = "Average"
  threshold = "300"
  insufficient_data_actions = []
  alarm_description = "EC2 CPU Utilization"
  alarm_actions = ["${aws_autoscaling_policy.scale_down.arn}"]
  dimensions {
    AutoScalingGroupName = "${aws_autoscaling_group.my_asg.name}"
  }
}

Thanks very much, and very sorry for the long email, I have created the repo here: https://github.com/JakimLi/terraform-error/, I have googled this for days, so any suggestions will be greatly appreciated.

- show quoted text -