Exotel is growing faster than it ever has. We now have reached a phase where we handle more than 4 million phone calls per day and this number only keeps growing! A few months ago we decided to work on our technical debt by overhauling some of our DevOps practices. In this post, we talk about the problems we faced while re-architecting bits of our infrastructure and the approach we took to address them. Our hope is that people can use some of our learnings and re-use some of the code we wrote, to minimize the time and effort required to set up an auto-scaling infrastructure on Amazon AWS.
Here’s how things worked earlier – we’d bake service-based AMIs, then manually update the codebase and make other changes while adding new machines behind the ELBs. This greatly affected the rate at which we could add instances or push/revert code/environment changes. Besides, there was always a scope for human errors because the whole process was manual. We would usually add instances when we anticipated higher volumes or just observed a high load average across the instances.
We started off at the sidelines – setting up a build pipeline and storing artifacts on S3. This eliminated human errors while updating/reverting code.
With builds formalized, we needed to formalize deployments. It’s tricky because there are several ways of doing it and there’s no one-size-fits-all solution. We experimented with a bunch of different stuff. Pre-baked or vanilla AMIs? wget+unarchive or rsync? More machines or larger machines? Eventually, what worked best for us was to go with semi-baked AMIs that contained the packages that took too long to install. Everything else would be taken care of by Ansible.
We used Jenkins to set up a build pipeline. The Ansible scripts and the actual code of the services have different pipelines.
Each service has a Jenkins job which builds the project and uploads the build artifacts to S3.
#!/bin/bash cd $WORKSPACE aws s3 cp s3://build/$ARTIFACT/prod/latest.txt RELEASE if [ "$?" -eq 0 ]; then VERSION=$(( `grep VERSION RELEASE | cut -d'=' -f2` + 1 )) else VERSION=1 fi echo "MODULE=$ARTIFACT VERSION=$VERSION BUILD=$BUILD_NUMBER GIT_COMMIT=$GIT_COMMIT TIMESTAMP=$BUILD_TIMESTAMP" > RELEASE rm -rf $ARTIFACT.tar tar cf $ARTIFACT.tar obelix commonix RELEASE --exclude="*/.git" aws s3 cp $ARTIFACT.tar s3://build/$ARTIFACT/prod/$ARTIFACT-${VERSION}.tar --storage-class REDUCED_REDUNDANCY --sse AES256 aws s3 cp $WORKSPACE/RELEASE s3://build/$ARTIFACT/prod/latest.txt --storage-class REDUCED_REDUNDANCY --sse AES256
This Jenkins job takes the git branch to be used as a parameter. This allows us to deploy patches by forking a patch branch based out of “release” branch, applying our patch and then deploying the patch branch to production.
This newly created build is considered to be the “latest” build.
When an ASG cluster scales out, the instance is configured with the “latest-stable” build version of the service.
There is another Jenkins job for each service to promote any of it’s builds to “latest-stable”.
#!/bin/bash set -e if [ "latest" == "$RELEASE_VERSION" ]; then aws s3 cp s3://build/$ARTIFACT/prod/latest.txt RELEASE GIT_COMMIT_ID=`grep GIT_COMMIT RELEASE | cut -d'=' -f2` cd $WORKSPACE git config user.email "XXX@yyy.GIT_COMMIT" git config user.name "Demo user" git tag -d $GIT_TAG_NAME || echo "Tag doesn't exist. Creating one" git push origin :refs/tags/$GIT_TAG_NAME || echo "Tag doesn't exist. Creating one" git tag -a $GIT_TAG_NAME -m "$GIT_TAG_NAME" $GIT_COMMIT_ID git push --tags aws s3 cp s3://build/$ARTIFACT/prod/latest.txt s3://build/$ARTIFACT/prod/latest-stable.txt --storage-class REDUCED_REDUNDANCY --sse AES256 elif [[ $RELEASE_VERSION =~ ^[0-9]+$ ]]; then aws s3 cp s3://build/$ARTIFACT/prod/$ARTIFACT-$RELEASE_VERSION.tar $ARTIFACT-$RELEASE_VERSION.tar if [ "$?" -ne 0 ]; then echo "ERROR: s3://build/$ARTIFACT/prod/$ARTIFACT-$RELEASE_VERSION.tar does not exist." exit 1 else rm -rf $ARTIFACT-$RELEASE_VERSION mkdir $ARTIFACT-$RELEASE_VERSION cd $ARTIFACT-$RELEASE_VERSION tar xf ../$ARTIFACT-$RELEASE_VERSION.tar GIT_COMMIT_ID=`grep GIT_COMMIT RELEASE | cut -d'=' -f2` cd $WORKSPACE git config user.email "xxx@yyy.com" git config user.name "Demo user" git tag -d $GIT_TAG_NAME || echo "Tag doesn't exist. Creating one" git push origin :refs/tags/$GIT_TAG_NAME || echo "Tag doesn't exist. Creating one" git tag -a $GIT_TAG_NAME -m "$GIT_TAG_NAME" $GIT_COMMIT_ID git push --tags aws s3 cp $ARTIFACT-$RELEASE_VERSION/RELEASE s3://build/$ARTIFACT/prod/latest-stable.txt fi else echo "Invalid release number. Please specify valid one" exit 1 fi
Either the latest build or a specific build version of that service can be made the stable version. When making a particular build as “latest-stable”, we also tag the most recent commit ID of the branch in our git repository so that we can track the version of code running in production.
Deploying code to a specific instance or an entire ASG cluster can be done by choosing the relevant options from a Jenkins job –
This build internally uses EC2 external inventory script to get the IPs of the instances of the ASG to which the deployment has to be done. Here is the script triggered by the above Jenkins job.
Breaking from our earlier bad practice of using the AWS management console to configure the infrastructure, we decided to use Terraform to make sure that the infrastructure set up is codified and versioned. As explained by Netflix about the lessons they learned from their auto-scaling implementation, we scale up early and scale down slowly. At steady state, every web server cluster has exactly one On-Demand m4.large instance and one or more m4.large Spot instances.
We use Spot instances very heavily. Spot instances are around 80% cheaper than on-demand instances of the same configuration. We set bid prices for Spot instances to be higher than the price of the On-Demand instance of the same configuration. In effect, the Spot instances would almost never be terminated as it is unlikely for the price of the Spot instances to be greater than the price of the corresponding On-Demand instance. AWS provides a way wherein a warning is triggered two minutes before the spot instance would be terminated when the current Spot price rises above our bid price (Spot Instance Termination Notice). We have set up a cron job which polls this endpoint every 5 seconds. If it learns that the instance is scheduled for termination, we increase the desired count of the corresponding On-demand cluster by one. Thus, even in the case when spot instances are terminated abruptly, there will not be a disruption of the service.
Setting up the scaling policies was trickier than we expected. We tried out various combinations of scaling policies before settling down on the one that worked best for us. This is what we finally ended up with –
Most of our web server clusters are currently on Apache. We found that 50% CPU utilization over 5 minutes to be a safe threshold above which the cluster had to be scaled up. To be able to handle sudden spikes in traffic another scaling policy is set up to scale out when CPU utilization of spot cluster is more than 85% over two consecutive 1 minute periods. The scale out policy for the On-demand cluster has a higher CPU utilization threshold than that of the Spot cluster. This is because we assume that On-demand instances will be needed only if Spot instances are not being spawned for some reason.
We noticed that AWS’ default scale down policies don’t work well. Though an option to specify the “Seconds to warm up after each step” is provided, a scale-in activity is triggered every minute. This results in cluster scaling in too aggressively, thereby not having enough capacity at times. Instead “Simple Scaling policy” works as expected for scaling down wherein “Seconds before allowing another scaling activity” can be specified. We remove 10% of the instances of the cluster when the CPU utilization of the cluster is less than 30% for 10 consecutive periods of 60 seconds.
The ASG is associated with a target group. The target group is in turn associated with an Application Load Balancer (ALB). The instances are in a private subnet and they only accept requests on port 80 from the corresponding ALB. The ALB is associated with a public subnet and accepts requests on both ports 80 and 443 from everywhere. SSL termination is done only at the ALB.
When an instance is spawned from an auto-scaling group, the user data is set to download a setup script from S3.
#!/bin/bash SERVICE=obelix-appengine S3BUCKET="s3://build/deploy-scripts/prod" if [ -z $ANS_VERSION ]; then ANS_VERSION=latest-stable fi if [ -z $VERSION ]; then VERSION=latest-stable fi aws s3 cp ${S3BUCKET}/${SERVICE}.sh obelix-appengine.sh chmod +x obelix-appengine.sh ./obelix-appengine.sh $ANS_VERSION $VERSION
This script, in turn, downloads the “latest-stable” version of the ansible scripts and the service code or binary from S3. The script then executes the ansible playbook of the service to configure the newly launched instance. The last step of the ansible playbook on successful execution is to copy a health check file into the right location. Thus after the instance has been fully set up, the instance is brought into service after the target group’s health check passes.
We still had two things to figure out – logging and monitoring. When instances are spawned and terminated dynamically depending on traffic, we needed to figure out a way to ship logs out of the machines and dump them in a centralized location. We tried out Filebeat and Heka for shipping but eventually settled on good old rsyslog, which ships logs to a Kafka cluster. The log messages are later consumed from Kafka and indexed in ElasticSearch.
As far as monitoring is concerned, the major problem was the maintenance of dynamic inventory in Nagios. Since we were okay with a short delay in adding new hosts to monitoring, we decided to poll for ASG changes every few minutes and update the hosts in the monitoring config.
We created a dashboard on Grafana for each web server cluster with metrics from AWS Cloudwatch to visualize the status of the cluster –
While the above-described setup worked pretty smoothly for a few months now, one day the default YUM repository mirrors used by the AWS instances went down. This caused deployment failures and the cluster couldn’t be scaled up as expected. To prevent such an incident we the future, we now host the YUM repositories ourselves on an S3 bucket and use them instead.
That pretty much sums up our little adventure with auto scaling at Exotel. Our key takeaways:
Don’t like this approach or have ideas to make this better? Come join us. We’re hiring!