Engineering values: infrastructure stability

Our company’s two products, Dialpad and UberConference, are quite similar once you think about it—they both bridge voice calls over the internet. A call on Dialpad is just an UberConference with two people!

That’s why both products run on the same backend infrastructure. Our systems engineers have designed our infrastructure to be silent, invisible, and reliable. On the other hand, it’s easy to gloss over all the magic in the backend that connects thousands of calls every day.

Read on for a peek under the hood and explore the infrastructure behind Dialpad.

Stability & Development

A key value that drives our company’s engineering team is stability.

Stability is the foremost value in every system engineer’s mind. Yet they’re also ambitious developers who want to add new features, to expand global footprint, and to optimize code so it’s faster and easier to maintain.

To balance both new development and strong stability, our team has two strategies:

  • Robust testing and quality assurance processes
  • Slow, gradual roll outs of new code

For testing, there are three different environments of development, staging, and production. We also have a strong QA team that goes through all common use cases before releasing new features. In fact, our engineers can duplicate our entire infrastructure locally in their development environment for end-to-end testing.  

When rolling out new code, we adhere to a weekly push schedule that is both agile and efficient. Pushing new code to the production environment is also done gradually. That way if something comes up, the scope of the problem is limited.  

Stability & Monitoring

Both strong testing and gradual roll outs are key to catching human errors. Sometimes though, the issue is beyond our control. For example, an unexpected spike in phone calls, bad weather, or datacenter hiccups can disrupt normal functionality.

That’s why to ensure stability we also do:

  • Continuous monitoring and alerts when call activity passes certain thresholds
  • Redundancy and retries at all levels for when (not if!) a machine fails

Alerts cover a wide variety of situations, but generally fall into two categories. First, there are heartbeat alerts that recognize if a machine is down because of a power or network failure. Second, capacity alerts tell us when we’re approaching a certain threshold. This may be just a temporary blip or a sign to add capacity for the long term.

For redundancy, we have a strong N+2 philosophy—every backup has a backup. Retries are also important. Within our infrastructure our machines make a lot of requests both internally and externally. If the first request doesn’t get a response we retry, and the second request will usually get a response. Simple recoveries like this aren’t a big deal. But if this happens a lot more than normal, it can be a sign we’re overworking a server or something funky is going on that requires investigation.

Here at Dialpad stability is one of our core engineering values. We do everything we can to ensure Dialpad and UberConference are strong, robust, and dependable. From the engineering perspective, this means setting a high bar and designing a backend that’s available 24/7. All our engineering procedures—from robust testing, slow rollouts, alerts & monitoring, plus redundancy and retries—are designed with stability at the core.

Dialpad is the phone system designed for the way you work