Best Practices for Building a Secure and Scalable API
APIs are the core mechanism for decoupling front ends from back ends and for decomposing monolithic infrastructures into composable enterprises in the spirit of what's known as digital transformation. They're currently the most significant enablers of innovation, mobility, and the Internet of Things (IoT). APIs enable teams to focus on their core value proposition while allowing customers to achieve bigger goals by connecting to data and functionality with tools they prefer to use. But to deliver on these myriad benefits and objectives, teams must design APIs with scale in mind. However, the need to build high-performing APIs that scale with the business ecosystem is pressuring many development teams to build APIs that may be restricting business growth.
APIs that are built without scale as a consideration will suffer from poor usability, have limited availability, are open to security issues, and more.
Here at Raygun, we recognize that a high-quality API is pivotal to our business growth, and cite scalability as a critical success factor. When we need to grow to meet customer demands, we must handle billions of data points comfortably and with minimum disruption. In our particular case, we're ingestion heavy, meaning we receive billions of requests a day and any delay to our API's response would result in a potentially bad experience for our customers.
The issues we discuss here apply to both reading and writing data through APIs. We'll discuss how Raygun's development team manages the infrastructure and maintenance of our APIs to enable growth.
Why is a scalable API so important to software businesses?
Raygun drives better API-driven customer engagement through our use of SDKs. One of the goals of our SDK architecture is to be lightweight and have little to zero impact on the customer's application performance. While our API itself is a relatively straightforward endpoint, we found that providing SDKs means we can make API access easier, less error prone, and include niceties (e.g., if connectivity is interrupted, be sure to send the data later when it's restored).
As we accept large volumes of data from customers, managing our API effectively is critical to our business. If we can't receive data at volume, then our product would be useless. On average, we get thousands of requests per second to our API, with spikes into the hundreds of thousands per second, so we need to be able to handle a wide-ranging load.
Our product development is not all about the data handling, however. A great UI and nice features are what customers want on the front end, which isn't possible without a robust API.
As Uri Sarid, CTO of Mulesoft, articulates so well, "Much like a great UI is designed for optimal user experience, a great API is designed for optimal consumer experience."
For our survival as a company, our offering a large customer superior data management and a great experience on the front end is mission critical, and we must be able to scale to meet larger customer's needs.
So how do we do it? At Raygun, we look at two main areas when building a scalable API: infrastructure and maintenance.
Infrastructure of a scalable API
When you create an API, it exposes the business logic of your system to the outside world, which needs to be protected.
To build a scalable API, mitigating vulnerabilities should be the first port of call. As ProgrammableWeb's editor-in-chief David Berlind explains, attacks are becoming sophisticated, multi-dimensional, and hyper-targeted.
The sensitive data Raygun collects must be protected with a multi-layered system that extends beyond the infrastructure of the API. As cyberattacks evolve, the systems we put in place need to be dynamic enough to change and evolve with them. (You can read more about security in APIs here.)
Figure 1. Raygun's API security layers
To mitigate risks, Raygun uses several layers of security for our APIs. All calls are done with a customer's API key and authentication credentials.
A simple first layer is to offer a "regenerate authentication credentials " option. If you choose to re-generate your credentials, the original credentials are no longer valid.
The reason this is essential to protecting your system is to prevent anyone with malicious intent gaining access to your account. For example, if a developer accidentally checked your credentials into a public repository, you're safe because that key will no longer be valid.
After authenticating your credentials, we'll then generate a time-based token for subsequent API calls, expiring after 15 minutes.
Raygun also employs an independent third party to run penetration tests (sometimes call Pen Tests) against the service every quarter, alongside automated security tests that are run continually. As attackers become more sophisticated, you must continually invest in security.
Lastly, we undertake security training with our software team and ensure that we review pull requests before being merged, with an eye towards security concerns.
Hosting on multiple servers
When scaling an API, an important approach is to have the same code when running the API requests on multiple servers.
Depending on how you've scaled your systems so far, remember that when anyone makes an API call, the API won't make the request from the first machine available — you don't know which server will get the request, because requests are bounced to different servers.
Raygun uses autoscaling groups to handle volume. An autoscaling group contains a collection of EC2 instances that share similar characteristics and are treated as a logical grouping for the purposes of instance scaling and management. We also rely on a reasonably sized "warm pool" of servers (those ready to receive requests) that are available for sudden traffic spikes, which enables us to continue providing a great customer experience — even at busy times.
Use the right load balancer to autoscale
Using the correct load balancer for your system is very important for autoscaling your API.
The right load balancer will increase your application's capacity and reliability by sharing the workload evenly across the pool of servers in the load balancer. An ineffective load balancer will do the opposite, and you may find yourself unaware if a server falls over or there's an unknown, critical, and recurring error.
At Raygun, we use AWS load balancing, which is an effective way of building our load balancing service into our infrastructure so we can launch servers on demand.
Figure 2. Raygun uses AWS' load balancing. (Image source: AWS)
A high-traffic application like Reddit (who used infrastructure to scale to 1 billion page views per month) uses a mix of load balancing tools like HAProxy and Nginx to direct traffic to each. In the HighScalability.com article, Reddit: Lessons Learned from Mistakes Made Scaling to 1 Billion Pageviews a Month, Jeremy Edberg explains how they use HAProxy for load balancing and Nginx to terminate SSL and serve static content, enabling Reddit to manage billions of data points effectively.
The infrastructure of your API will be dependent on many factors, but we've found the above method is very effective.
Maintaining your API to handle scale
A sound infrastructure is, of course, only part of the story. We need to maintain our API to ensure it's at its most effective for our customers, and that it doesn't waste precious development time.
We need to maintain our API so we can add customers as needed, regardless of the amount of data they send us. To handle those data volumes, and to cope with even more over time, we need to have a lot of trust in our servers, which is why we use horizontal scale to our advantage.
You can use two ways of scaling your systems to enable a robust API: horizontal and vertical. At the start of business, scaling vertically makes more sense because it's more cost efficient (servers are expensive). However, you'll eventually have to scale horizontally to manage your data volumes. Here's how to determine which approach is best for your business scaling needs.
A simple way to scale your software is to pile on more hardware. Faster processors and more memory will certainly grow your software, but you'll need space and hardware continually, which gets expensive quickly.
You can also scale by building code that is optimized for performance; however, premature performance optimization may result in overly complex software that's difficult to maintain. This is a common pitfall because more performant code is less human readable in most cases, and therefore can be more complex. Eventually, your software will evolve to have more features to satisfy the demands of your customers or to keep ahead of the competition. But remember, the more features you add, the harder it is to scale due to the data storage needs that go with that expansion. This is the point at which you will need to use horizontal scale.
Horizontally scaling your API means adding more servers instead of adding more hardware.
The practice of using horizontally scaling software is used by companies like Facebook and Google — and it's also the model Raygun uses. We strongly recommend you do the same.
The main reason for using horizontal scaling is because it enables our systems to adjust to the load dynamically by automatically provisioning (or deprovisioning) more systems (nodes), rather than making one system larger. Now, if our system experiences the loss of a single node, the entire system will not collapse. More importantly, horizontally building our systems allows Raygun to scale at the right time.
Using horizontal scaling, the Raygun team discovered we could conserve server resources, and we can add our customers as needed. This way, we only add capacity to our environment as needed. Make sure you're aware of the fine line between better customer experiences and biting off more than you can chew. (As you add new customers, your load increases.)
After horizontally scaling, if you find there's still a bottleneck, you can cache data to improve performance.
This is where we use our own tool, Raygun Crash Reporting, as a guiding light to understand the capacity of our API. We use Real User Monitoring (defined on this Raygun page) paired with Crash Reporting so we can truly understand our software performance.
Queue up everything for better API performance
To get the best performance from your API, do minimal work by queuing (a way of exchanging work between systems with an added buffer for spikes in activity) and having as few processes as possible.
At Raygun, for example, as data comes in, basic validation occurs and it's then passed off to a queue to transition work from one system to another in a scalable and redundant way. We then have backend workers (other processes) pick the next work item off the queue.
As we add more workers, we're careful not to reduce performance. How effective are our workers? We measure them, so we know well ahead of time if we need any more. Our consistent model is that all worker tasks have analytics end points, which then are called and reported into a StatsD endpoint (DataDog in our case). This allows operations to monitor the health of individual workers, as well as building dashboards to show overall system health.
Through consistent monitoring, we understand the capability of our API. Your API will be similar to ours where traffic will come in bursts and follows a business hours model.
Raygun uses RabbitMQ for queuing and DataDog to monitor the capacity of our workers.
Part of effectively scaling horizontally is the ability to replicate and deploy quickly. To do this, one of the first steps we took as a development team was to remove error-prone manual deployments.
To scale efficiently, you need to find templates that work for your development team. Templates are a set of instructions for the autoscaling mechanism to let it know what to do when it starts up.
With autoscaling, modifying your template whenever you make a change to your API is the best approach. Every time you modify your template, you need to update your template with the API code. Then test your template by using it to cycle nodes, so test on one node then re-deploy.
Set up alerts
To build a scalable API, you need to know immediately when something goes wrong. Collect and display key metrics from your DevOps tools — the more publicly, the better. "Information Radiators," such as TVs with stats around the office, are a great way to keep system health at the forefront of your mind.
We find that more people pick up on problems this way, especially our engineers who are in the codebase all day long. Spikes are detected very quickly if engineers can access baseline figures — plus they can see the results of any improvements they've made to a piece of code.
Here at Raygun, we recognize that taking a proactive approach to understanding the traffic coming to our API is key for horizontal scaling. We use Crash Reporting to identify and raise problems in our code into a dashboard that is accessible by everyone on the development team. We also use Crash Reporting to monitor for errors in our API specifically and collect data on timings. We put a lot of effort into monitoring custom metrics, such as failure per application, so if people send us bad API keys, we can understand where and why traffic is being rejected.
We've found that thorough software testing using both scale testing and production testing is critical to maintaining quality code and to ensure our API is robust and is able to scale when necessary. Here's a brief breakdown of our using scale and production testing management to ensure a robust API and create a better experience for our customers.
Scale testing is a strategy we employ at Raygun, so we know exactly where bottlenecks are. This strategy means we can cater to larger customers with no nasty surprises. To test what our API can handle, we run regular load tests, looking for our upper limit.
While developing an API for any business, you'll be testing and looking for this upper limit so you can constantly push your boundaries and grow — that's the process of scaling, which should grow into autoscaling.
Anything in your system can fail at any time — just make sure you find it before your users do. At Raygun, we also test thoroughly in production. We operate under the assumption there will always be problems and software bugs, but we have visibility on problems in production with Crash Reporting tools. This acceptance of software problems leaves teams much better prepared for reacting and resolving errors faster.
A prevailing attitude at Raygun that has helped us to build and scale is our strategy to make incremental changes to the code so we can react quickly and roll back if necessary.
We have a "fail early and fail fast" approach, which allows us to move and scale up with our business goals. Remember, allow for failure and never get caught off-guard.
The key to scaling your software is locating bottlenecks before your users do, and often your API is what provides the biggest restrictions. Scale your API, and scale your business.