A decade ago, the idea of cloud computing was taking shape. Some of the bold ones decided to take their hardware to the cloud while the rest were skeptical about renting a hardware somewhere on the cloud. “How reliable would it be?” was the dampener question for a while. It sounds like such a no-brainer today.
In 2014, Amazon EC2 instances had a downtime of just 2.41 hours across 20 outages. That is a uptime of 99.9974%. For the scale we’re talking about, this is a phenomenal number. Similarly, Google Cloud Platform’s storage service had a mind boggling uptime percentage of 99.9996. Reliability is now a given, resulting in a more rapid adoption of cloud computing. The debate now hovers around the security front, but that again should die down with advancement in that space.
Cloud telephony is at a similar juncture today. 6 years ago, cloud telephony did not exist. When we started Exotel in 2011, there was a lot of education about cloud telephony that was necessary. The people we spoke to had to first be educated about cloud telephony before the possibility of convincing them about it arose. Today, when we speak to potential customers, they know what they want. They’re aware of the technology and the pros and cons of moving their phone system to the cloud.
As the adoption of cloud telephony continues to increase, customers expect a reliable solution – 24×7 access and the best possible uptimes. Customer phone calls are the backbone of most businesses in India. Imagine not being able to reach your favourite e-commerce brands, your internet service provider or even your food delivery services on the phone for even a few hours. All hell is sure to break loose. We have now become a core infrastructure provider to businesses.
Reliability is the one thing every business cares about.
Building a reliable business on unreliable blocks
Building a highly reliable cloud telephony platform in India comes with a whole lot of challenges. The toughest problem is how to build a reliable platform on a highly unreliable legacy phone systems – a completely unrelated event like a heavy rain or digging up a road could result in the snapping of telephone lines. The first step towards building for high reliability was a decision to monitor and measure our uptimes and a false bravado to make our uptime numbers public – http://status.exotel.in/.
When we quickly put processes a couple of years back to monitor our systems and started measuring our uptimes, it dawned upon us how much there was left to do to ensure the superior reliability we wanted to give our customers. And when we looked for benchmarks to measure us against, we could not find anything satisfactory. One, because there isn’t an industry gold standard yet, and two, because there was no fair way in which the uptime measurements were made. Even cloud telephony pioneers like Twilio were measuring uptimes as only “software uptime” and excluded telephony downtimes.
But if we think about it from the customer’s perspective, to say our software is up and running, just that the telephone lines are down, doesn’t save them their hassle. That is why we decided to include even operator uptimes as a part of our uptime measurement.
It might even look absurd at first. What can a cloud telephony company do about telephony downtimes? That is when we started looking at a whole lot of factors differently – operators we worked with, technologies operators can provide us with, datacenters where we are hosted our servers, method of deployments etc. This has helped us improve our uptimes from 99.7% to up to 99.97%. And still there is a long way to go.
Design for Failure
Most of what is relevant in the cloud computing space with regards to reliability hold in the cloud telephony space too. Building systems that will never fail is utopian. It never happens. Systems fail. And that is why we have constantly strived to ensure there is no single point of failure, and when things fail we detect them early. A reliable system must ensure graceful degradation of services even when there is a problem – a downtime must never affect all customers across the board and even if there are a few systems that are facing an issue, the most critical systems must keep functioning, in our case connecting calls.
With a technology as nascent as cloud telephony, reliability is a major factor that will drive more and more businesses to take the plunge and make the shift.