KMS Request Quotas and Throttling: How to Handle Them#
When you start building applications that rely on AWS Key Management Service (KMS) for encryption and key management, you often discover that KMS operates under strict API request quotas. These quotas exist for good reasons—they protect the service from resource exhaustion and ensure fair usage across all customers. But if you’re not aware of them, they can become a painful surprise when your application hits production traffic levels.
The good news? Understanding how KMS quotas work, recognizing throttling errors when they occur, and knowing proven mitigation strategies will let you build applications that scale reliably without constantly bumping up against these limits. This article walks you through everything you need to know.
Understanding KMS Request Quotas#
KMS enforces API request quotas on a per-region basis, and here’s the critical detail: these quotas are shared across all cryptographic operations you perform against keys in that region. This means the quota applies to the aggregate of operations like Encrypt, Decrypt, GenerateDataKey, Sign, Verify, and other cryptographic calls—not individual per-operation quotas.
The default request quota for KMS in each region is 10,000 requests per second. For most applications, this sounds plenty generous. But the moment you start scaling beyond a certain point, or if you’re performing encryption operations across many parallel processes, you might find that this shared quota becomes your bottleneck.
Let’s consider a concrete example. Imagine you’re building a multi-tenant SaaS platform where each customer’s data gets encrypted with their own KMS key. You have 100 concurrent users, and each user’s request requires two cryptographic operations: one to generate a data key and another to encrypt a payload. At just 100 concurrent requests, you’re already consuming 200 requests per second. At 1,000 concurrent users, you’re at 2,000 requests per second. The quota won’t stop you yet, but you can see how quickly you’d approach that 10,000 limit with higher concurrency or more complex encryption workflows.
The quota is regional, which means if you deploy your application across multiple AWS regions, each region maintains its own 10,000 request per second quota. This is actually helpful for distributed architectures—you’re not sharing quotas across regions—but it does mean you need to monitor quotas independently in each region where you operate.
Detecting Throttling with ThrottlingException#
When you exceed your KMS quota, the service responds with a ThrottlingException. This is the error you’ll encounter when your application tries to make cryptographic calls faster than the quota allows.
Here’s what a throttling error looks like when returned by the KMS API:
ThrottlingException: Rate exceededThe exception includes a message indicating that you’ve hit the rate limit, and the HTTP status code is typically 400. In SDKs, this surfaces as an exception you can catch and handle programmatically.
Detecting throttling is straightforward if you’re monitoring your application logs and CloudWatch metrics. KMS publishes metrics to CloudWatch, and you can set up alarms that trigger when you see throttling exceptions in your logs or when KMS user errors spike. The challenge isn’t really detecting throttling—it’s handling it gracefully when it happens and architecting your application so you don’t hit it in the first place.
When throttling does occur, the standard AWS best practice is to use exponential backoff with jitter. Most AWS SDKs have built-in retry logic with exponential backoff, so if you’re using the AWS SDK for your language (Python, Java, JavaScript, Go, etc.), you often get this behavior by default. However, understanding what’s happening under the hood helps you tune the retry behavior to suit your application’s needs.
Strategy 1: Data Key Caching#
The most powerful mitigation strategy for KMS throttling is data key caching. Instead of calling GenerateDataKey every time you need to encrypt data, you generate a data key once and reuse it to encrypt multiple data items. The data key itself is protected by your master key in KMS, so this remains secure.
The AWS Encryption SDK (which is available for Python, Java, and other languages) provides a built-in caching layer specifically for this purpose. Here’s how the pattern works:
You call GenerateDataKey once against your KMS master key, which costs you one request against your quota. You then use the returned plaintext data key to encrypt multiple items locally using AES encryption (which is very fast and doesn’t consume KMS quota). Once you’re done encrypting with that data key, you encrypt the data key itself with your master key and store it alongside the encrypted data. Later, when you need to decrypt, you use KMS to decrypt the data key (one request), then use that plaintext key to decrypt all the items that were encrypted with it.
This approach can reduce your KMS API requests by orders of magnitude. If you’re encrypting thousands of items per second, data key caching lets you do so with just a handful of KMS calls instead of thousands.
Here’s a simplified conceptual example using the AWS SDK for Python:
import boto3
from aws_encryption_sdk import KMSMasterKeyProvider, encrypt, decrypt
# Create a master key provider that uses your KMS key
kms_provider = KMSMasterKeyProvider(key_ids=['arn:aws:kms:us-east-1:123456789012:key/12345678-1234-1234-1234-123456789012'])
# Encrypt some data - the SDK handles data key generation and caching internally
plaintext = b'My sensitive data'
ciphertext, encryptor = encrypt(source=plaintext, key_provider=kms_provider)
# Decrypt - KMS is called to decrypt the data key, which then decrypts the data
decrypted_data, decryptor = decrypt(source=ciphertext, key_provider=kms_provider)The AWS Encryption SDK abstracts away many of these details, but the key insight is that it generates a data key per encryption operation and caches it, reducing KMS quota consumption dramatically.
If you want more granular control, you can also manually call GenerateDataKey and manage the data key lifecycle yourself. This gives you the flexibility to implement custom caching strategies tuned to your specific application patterns.
Strategy 2: S3 Bucket Keys#
If you’re encrypting objects stored in Amazon S3 using KMS, AWS offers a powerful feature called S3 Bucket Keys that dramatically reduces your KMS API calls. When you enable this feature on a bucket or object, S3 generates a temporary unique key for each object and uses that to encrypt the object data. S3 then encrypts the temporary key with your KMS master key. The result is that encrypting thousands of objects in S3 requires only one or two KMS API calls instead of one per object.
To enable S3 Bucket Keys, you specify the BucketKeyEnabled parameter when uploading objects to S3:
import boto3
s3_client = boto3.client('s3')
s3_client.put_object(
Bucket='my-encrypted-bucket',
Key='my-object',
Body=b'sensitive data',
ServerSideEncryption='aws:kms',
SSEKMSKeyId='arn:aws:kms:us-east-1:123456789012:key/12345678-1234-1234-1234-123456789012',
BucketKeyEnabled=True
)When BucketKeyEnabled is set to True, S3 handles the data key generation and management, and your KMS quota consumption becomes negligible even if you’re uploading thousands of objects per second. This is one of the easiest wins if your workload involves S3 encryption—you get massive quota savings with a simple configuration change.
Strategy 3: Exponential Backoff and Retry Logic#
Sometimes throttling happens despite your best efforts, and when it does, you need to handle it gracefully. This is where exponential backoff comes in.
The idea is simple: when you hit a throttling exception, wait a bit and try again. The “exponential” part means you wait longer with each subsequent retry attempt. Most AWS SDKs implement this automatically, but understanding it helps you tune the behavior.
A typical exponential backoff implementation might look like this:
import boto3
import time
import random
from botocore.exceptions import ClientError
kms_client = boto3.client('kms')
def decrypt_with_backoff(ciphertext, max_retries=3):
for attempt in range(max_retries):
try:
response = kms_client.decrypt(CiphertextBlob=ciphertext)
return response['Plaintext']
except ClientError as e:
if e.response['Error']['Code'] == 'ThrottlingException':
if attempt == max_retries - 1:
raise # Give up after max retries
# Calculate backoff with jitter
wait_time = (2 ** attempt) + random.uniform(0, 1)
print(f"Throttled. Retrying after {wait_time:.2f} seconds...")
time.sleep(wait_time)
else:
raise # Re-raise non-throttling errorsThe key principle here is jitter—adding randomness to the wait time. When multiple clients hit throttling simultaneously and all retry at the same time, they can create a thundering herd that makes the problem worse. Adding jitter spreads out the retries, improving the chance that at least some will succeed on the next attempt.
Most modern AWS SDKs (especially versions released in the last few years) have built-in retry strategies with exponential backoff and jitter, so you often don’t need to implement this manually. However, if you’re using a lower-level client or working in an environment with custom retry requirements, understanding this pattern is essential.
Requesting Quota Increases#
If you’ve optimized your application using data key caching, S3 Bucket Keys, and efficient request patterns, but you still need more than the default 10,000 requests per second, you can request a quota increase from AWS.
To request a KMS quota increase, you’ll use the AWS Service Quotas console or the Service Quotas API. Navigate to the Service Quotas section in your AWS console, search for KMS, and look for the “Requests per second” quota for your region. You can then request an increase to a higher limit.
When you submit a quota increase request, AWS reviews your use case. They don’t always grant unlimited increases, but they’re typically generous with high-throughput applications that have legitimate needs. Be prepared to describe your workload: the type of encryption operations you’re performing, your target throughput, and why the default quota isn’t sufficient.
It’s worth noting that even if you request and receive a higher quota, the previous strategies remain important. Data key caching and S3 Bucket Keys aren’t just about staying under quota—they’re about architectural efficiency. They reduce operational overhead on KMS and make your application more scalable overall.
Monitoring and Alerting#
You can’t manage what you don’t measure. Set up CloudWatch alarms to monitor your KMS usage and alert you before you hit quotas.
KMS publishes user errors to CloudWatch, and you can create a metric filter to count ThrottlingException errors:
[time, request_id, event_type = "ThrottlingException", ...]Create an alarm that triggers if throttling exceptions exceed a certain threshold (even one per minute is worth investigating). Additionally, set up a metric for KMS API requests per second—this helps you understand your actual usage patterns and forecast when you might hit quotas.
CloudWatch dashboards are invaluable here. Build a dashboard showing KMS request volume, throttling error count, and quota utilization. This gives you visibility into the health of your KMS usage and helps you spot trends before they become problems.
Bringing It All Together#
Handling KMS quotas effectively requires a multi-layered approach. First, understand that the 10,000 requests per second quota is shared across all cryptographic operations in a region. Second, design your application to minimize KMS API calls by using data key caching when appropriate and S3 Bucket Keys when encrypting S3 objects. Third, implement robust error handling with exponential backoff to gracefully handle occasional throttling. Finally, monitor your KMS usage and be prepared to request quota increases if your legitimate workload demands it.
Most developers find that combining data key caching with S3 Bucket Keys handles the vast majority of throttling scenarios. These strategies are proven patterns built into AWS services themselves, so they’re well-tested and reliable. By thinking about KMS quotas early in your architecture rather than treating them as a surprise when you hit production, you’ll build systems that scale reliably and efficiently.