You are a forward thinking organisation, with a large quantity of complex data and/ or transactional services. You already provide access to these on a website, and probably elsewhere, too. To do that, you’ve got some internal APIs. Now you’re thinking of letting other people use the API. You want to keep some control over who uses the APIs, and protect your systems from becoming overloaded.
We are (for the sake of argument) the kind of organisation you want using that API. We want to focus on building something great using your data. We want to avoid wasting time writing code that isn’t central to this.
API key management and permissioning
Helpful tools like demo consoles, a documentation CMS and analytics
Promotion of your API to developers
Rate limiting and caching
We have no problem with the first three services. Although we question the value of you partnering for them. Don’t you already do all these things for your website? Maybe you should handle the API in the same way?
We have a big problem with the typical implementation of rate limiting and caching. These impose fixed, low limits that don’t burst to higher levels when we need them to. This makes life hard for us, and doesn’t meet your needs. We think you should roll your own rate limiting code. In this post I will attempt to explain why.
When you make an account for us on an API gateway you will be asked to select a rate limit, probably in terms of requests per second and requests per day. Naturally, you’ll be thinking about the number of users like us who might turn up at the same time. You will impose rate limits that are much, much lower than your system can handle. You’re probably aware of bits of your system that are sensitive to load, so we’re talking small numbers here, at least by default.
Although the rate limits you set are low, our code would naturally conform to them most of the time. But most of the time isn’t good enough. As our systems become more complex we will occasionally exceed the limit, and you will block us for a bit. We’ll get a hard to handle exception, and write a lot of complex code to monitor them, to back off, and to recover.
This is particularly troublesome where the stuff we’re building has a low latency requirement (i.e., we need to quickly reflect your updated data in our system), and you don’t provide a change feed for that data. We soon end up with parallel processes, doing regular polling of time-sensitive feeds alongside slow-scanning of the full dataset for low-priority changes. It is almost impossible to coordinate rate in that situation, without voluntarily imposing a further huge speed slowdown on our crawler.
It’s also troublesome if we need to bootstrap an element of our system, maybe because of a hardware failure, or because we’re doing a big re-engineering effort. We will have to keep a backup of your data, because we can’t get it fast enough from your systems. Then we’ll have to write re-import scripts. We’d prefer to be making something magnificent on top of your data.
Maybe we should only hit you when a user comes to us? Maybe, but then we end up exposing exceptions to them, and we can’t do any of the interesting offline analysis we excel at.
Bottom line is that we need to burst above limits occasionally, on a per second and a per day basis, either due to a quirk in the operation of our software, or for an operational reason. Maybe API gateways offer this, and we’ve just not seen it being used. If so, I strongly suggest you use the settings liberally.
But cloud API gateways will always be configured cautiously, because the limits they set don’t control the things you actually care about. Any reasonable per second or per day limit will probably allow us to put tricky load on your system if the diversity of our queries is high enough to increase traffic to your database and/ or cause lots of cache churn.
Finally, API gateways are expensive. This is unsurprising, since they use real bandwidth and compute resource, with considerable inefficiency (and environmental impact?) compared to what you can do in-house. We suggest you get better bang for your buck by investing money in performance improvements and/ or caching that allow us to burst.
You will already have some basic anti-distributed denial of service protection to block people who are being really unreasonable. Maybe this should sit in front of components that are contended, like your database, rather than components that are cheap to scale up, like caches?
The best bit of this in-house approach is it’s an investment in improving your core web infrastructure, and a chance for your engineers to spend time understanding the characteristics of your systems. What a shame to buy a sticking-plaster solution, rather than spend time on the important stuff.
Thoughts welcome. These gateways might be OK for organisations who don’t care to build a scalable web infrastructure. And maybe there’s a reason to use them, which we don’t get?