CloudFront cache tag invalidation with Lambda@Edge for multi-tenant SaaS

Some of the API endpoints in a multi-tenant SaaS I work on are slow enough that I had to ask AWS to bump the API Gateway timeout quota to 180 seconds, because aggregating the right slice of data across multiple AWS accounts and several DynamoDB tables sometimes really does take that long. Users can’t be expected to wait three minutes for a dashboard to load. The underlying queries can’t be made instant. Caching aggressively at the CloudFront edge is the workaround, with TTLs that match how often the underlying data actually changes, which is “not very often”.

The data refreshes for each tenant happen in a per-tenant Step Function. When that pipeline finishes, the cached responses are stale. They need to be invalidated, and then prewarmed so the next user gets a fast response instead of paying the 180-second price.

For a long time, this was a problem of which paths to invalidate. For each tenant, I had a list of cached endpoints, and the post-refresh step issued a CloudFront invalidation with a list of path patterns: /api/cost?org=acme*, /api/inventory?org=acme*, /api/security?org=acme*, and so on. Whenever I added a new endpoint that cached data, I had to remember to add the corresponding path to the per-tenant invalidation list. Whenever the URL structure changed, the invalidation logic had to change in lockstep.

This is the pattern most people end up with, and the one most people end up regretting. The list of paths to invalidate is inherently coupled to URL structure, to the set of cached endpoints at any given moment, and to whatever encoding you happen to use for the tenant identifier in the query string. Every change to any of those things ripples into the invalidation logic. The map between “this tenant’s data changed” and “these specific URLs need purging” lives in your code, gets out of date, and surfaces as bugs that are annoying to diagnose because cache staleness rarely fails loudly.

In my case the brittleness expressed itself in three places: the path list in the Step Function definition, the prewarm list in another Lambda, the cache key configuration in the CloudFront distribution. Three places that all had to agree about what “this tenant’s cached responses” meant, and that drifted out of agreement on a regular basis.

Some honest perspective on where AWS sat in the industry on this. Other CDNs have had cache tag invalidation natively for years. Fastly’s surrogate keys have been a core feature since around 2014. Cloudflare shipped Cache-Tag in 2016, initially Enterprise-only, then available across all plans in April 2025. Akamai has tag-based Fast Purge in the same vein, also for years. CloudFront sat on path-based invalidation for the entire decade. So if you wanted tag semantics on AWS, you had two options: switch CDN, or build the indexing yourself.

I knew about the second option. In March 2024 the AWS Networking blog published a tag-based invalidation pattern built on Lambda@Edge, DynamoDB, Step Functions, SNS, and SQS, all coordinating to track URL-to-tag mappings and batch invalidations within CloudFront’s API limits. For my use case, swapping a brittle list of paths for that much additional surface area to maintain didn’t seem like a clear win. The path list was at least visible in one place. Switching CDN wasn’t on the table either: the rest of the platform is on AWS, the cost model assumes AWS, the IAM integration assumes AWS. Staying in one ecosystem has real value, and that value was higher than the cost of living with the brittleness, so I lived with the brittleness.

Then in late April 2026, AWS shipped CloudFront invalidation by cache tag natively. Roughly a decade after the rest of the industry, which gave me plenty of time to rehearse the migration plan in my head. 🙃 The mapping piece lives in CloudFront now, and the only custom code on my side is a small Lambda@Edge that injects the tag header. I started the migration the day the announcement landed.

How CloudFront cache tag invalidation works

Instead of identifying cached objects by path, you tag them at cache time, then invalidate by tag. The mechanism is straightforward enough:

Configure the distribution with a CacheTagConfig that names which response header carries the tags.
The origin (or something on the path between the origin and the cache, like Lambda@Edge) sets that header on the response, with a comma-separated list of tags.
CloudFront stores those tags alongside the cached object.
When you want to invalidate, you call CreateInvalidation with --paths "#tag-value". The # prefix tells CloudFront to treat the value as a cache tag rather than a URL path.

In my case, the tag is the tenant identifier. Every cached response gets tenant:acme (or whatever the tenant is) attached. When a tenant’s data refreshes, the Step Function’s last step runs:

aws cloudfront create-invalidation \
  --distribution-id D123ABC \
  --paths "#tenant:acme"

That single call invalidates every cached response associated with that tenant, regardless of URL. The path list goes away. The Step Function doesn’t need to know which endpoints exist. It just needs to know the tenant.

Setting cache tag headers: origin, Lambda@Edge, or CloudFront Functions

The tag needs to be on the HTTP response before CloudFront caches it. Where you put the logic that adds it depends on two questions: where in the stack does the tenant context live, and how invasive is it to add an HTTP header at that point.

There are three plausible places.

The origin itself. The application code adds the header in its response. Cleanest in principle, because the origin is where the tenant context is most reliably available: it’s already extracted from the JWT or the URL or the authorization layer, and you have the full request context. The cost is how cross-cutting the change is. If your origin is a single application with a shared response middleware, adding one line there is trivial. If your origin is API Gateway with explicit response mapping templates per endpoint (my case), adding a header means modifying every template, and that’s exactly the kind of change you don’t want spread across the codebase. If your origin is a static S3 bucket, you literally can’t add dynamic per-request headers there at all.
A Lambda@Edge function on Origin Response. This is the trigger that runs after the origin returns the response, before CloudFront caches it. The function reads the request and response, computes the tag, and adds it as a header. The origin code stays unchanged. The cost is the operational footprint of Lambda@Edge: deployments live in us-east-1 even if your CDK app is in another region, debugging is harder than regional Lambda, and there’s a per-request charge. The benefit is that this trigger fires before caching, which is exactly when the header needs to be present.
CloudFront Functions. Cheaper and faster than Lambda@Edge, but they only run on Viewer Request and Viewer Response. Viewer Response runs after the cache lookup, on the way out to the user. By that point the object is already cached or not cached, and CloudFront has already decided what tags to associate. CloudFront Functions therefore can’t set cache tags. This is worth knowing if you reach for them by reflex like I did at first.

The decision tree, then, is roughly: if the tenant context is naturally available at your origin and adding a header is a small change, do it there. Otherwise, Lambda@Edge on Origin Response is the right place. CloudFront Functions are out of scope for this specific job, regardless of how attractive they look on price and latency.

I went with Lambda@Edge on Origin Response, for the reason above: the API Gateway response mapping templates would have made the origin-side change too invasive. The function is small. It reads the request URL and the JWT in the Authorization header, extracts the tenant identifier from the claims, and adds a Cache-Tag header to the response. The header name is configured in the distribution’s CacheTagConfig.HeaderName, so I picked Cache-Tag instead of the x-amz-meta-cache-tag convention used for S3-origin metadata. The AWS docs explicitly note that with Lambda@Edge you’re free of the S3 naming convention, which is a small but pleasant detail.

The function only runs on cache misses, because Origin Response only fires when CloudFront actually goes to the origin. The cold-start risk is bounded to the first request after a TTL expiry, which is the same request that’s already paying the slow-aggregation price. The relative latency added by the edge function is negligible against an already-slow origin.

Per-tenant invalidation and prewarm orchestration

Two principles that generalize beyond my case. First: invalidation and prewarm belong together when the origin is slow. A tag invalidation that isn’t followed by a prewarm just shifts the slow-aggregation cost onto whichever user happens to land first after the refresh. Tag invalidation is fast (CloudFront’s docs quote P95 propagation under 5 seconds), but “fast invalidation” doesn’t mean “fast next request”. The next request still pays the full origin cost on a cache miss. If you care about that latency, prewarm is part of the same workflow, not an optimization to add later.

Second: per-tenant orchestration with a per-tenant tag gives you isolation as a property of the architecture, not as a runtime check. A pipeline scoped to one tenant, issuing an invalidation for one tag, can only affect that tenant’s cached objects. There’s no IAM policy or runtime check involved, no “what if the tag is wrong” failure mode that operators need to monitor. Tenant A’s pipeline literally cannot invalidate tenant B’s cache, because the tag it issues only matches its own. The structural guarantee comes for free as long as the tag scheme is per-tenant and the orchestration is per-tenant.

In my implementation, both principles live in the per-tenant Step Function that handles data refresh. Its last two steps are:

Issue a tag invalidation: one CreateInvalidation call with --paths "#tenant:{tenant_id}". P95 propagation under 5 seconds means the next user’s request will see fresh data within seconds of the invalidation.
Prewarm by replaying a small set of representative endpoints for the tenant. There’s a service account in Cognito that exists for this purpose: a non-human user in the superadmin group, with credentials in Secrets Manager. The prewarm Lambda authenticates as that user, hits the slow endpoints for the tenant, and warms the cache with the new data.

The two steps together mean that by the time a real user lands on a dashboard for that tenant after a refresh, the cache is already populated with the fresh response. No 180-second wait.

The choice of which endpoints to prewarm is a judgment call. Prewarming everything is wasteful: most cached responses won’t be hit before TTL expiry anyway. Prewarming nothing brings the slow-aggregation problem back to the first user. I prewarm the four or five endpoints that historically account for most of the post-refresh traffic, which is a small enough set to keep the prewarm Lambda fast and a large enough set to cover the realistic usage patterns. This is the kind of decision that depends on traffic shape and is worth measuring before deciding.

CDK and CloudFormation implementation

The CloudFormation property that enables tag-based invalidation is AWS::CloudFront::Distribution -> DistributionConfig.CacheTagConfig.HeaderName. You set it to the name of the response header that carries your tags. The schema is minimal:

Type: AWS::CloudFront::Distribution
Properties:
  DistributionConfig:
    CacheTagConfig:
      HeaderName: "Cache-Tag"

That’s it. Once this is set, CloudFront inspects the named header on every origin response and indexes the tags for later invalidation.

The CDK L2 Distribution construct does not expose CacheTagConfig as of mid-2026. There’s no cacheTagConfig prop, no builder method, nothing. You have to drop to the L1 escape hatch. This is the pattern:

from aws_cdk import Duration
from aws_cdk import aws_lambda as _lambda
from aws_cdk import aws_cloudfront as cloudfront

# Define the Lambda@Edge function
# Lambda@Edge requires a published version, $LATEST won't work.
# current_version_options ensures CDK publishes a new version on each deploy.
self.cache_tag_edge_function = _lambda.Function(
    self,
    "CacheTagInjectionEdgeFunction",
    function_name="towerguard-cache-tag-injection",
    runtime=_lambda.Runtime.NODEJS_24_X,
    handler="index.handler",
    code=_lambda.Code.from_asset(
        "backend/lambda/edge/cache_tag_injection",
        exclude=["node_modules", "*.test.mjs", "package-lock.json", "vitest.config.mjs"],
    ),
    timeout=Duration.seconds(5),
    memory_size=128,
    description="Lambda@Edge: injects Cache-Tag header for tenant-based cache invalidation",
    current_version_options=_lambda.VersionOptions(
        description="Lambda@Edge version for CloudFront Origin Response",
    ),
)
self.cache_tag_edge_function_version = self.cache_tag_edge_function.current_version

# Create the distribution (your existing code)
self.distribution = self._create_distribution(certificate, origin_verify_token)

# L1 escape hatch: set CacheTagConfig on the underlying CfnDistribution
cfn_distribution = self.distribution.node.default_child
cfn_distribution.add_property_override(
    "DistributionConfig.CacheTagConfig.HeaderName",
    "Cache-Tag"
)

To attach the Lambda@Edge function as an Origin Response trigger, you pass it in the edge_lambdas list on each behavior:

edge_lambdas=[
    cloudfront.EdgeLambda(
        function_version=self.cache_tag_edge_function_version,
        event_type=cloudfront.LambdaEdgeEventType.ORIGIN_RESPONSE,
    ),
],

This goes on both the default_behavior and every entry in additional_behaviors. If you miss one, requests routed through that behavior won’t get tagged, and tag invalidation won’t cover them.

The Lambda@Edge function

The whole function is small. The interesting part is the handler signature and the fail-open pattern, because these are Lambda@Edge specifics that aren’t intuitive if you’ve only ever written regular Lambda:

export async function handler(event) {
  const response = event.Records[0].cf.response;

  try {
    const request = event.Records[0].cf.request;
    const orgId = extractAndSanitizeOrgId(request);

    if (!orgId) {
      return response; // No tenant context, leave object untagged
    }

    response.headers["cache-tag"] = [{
      key: "Cache-Tag",
      value: `tenant:${orgId}`
    }];

    return response;
  } catch (error) {
    // Fail-open: return original response unmodified rather than 502
    console.error(JSON.stringify({ error: error.message }));
    return response;
  }
}

A few things worth pulling out of this:

The Lambda@Edge event shape is event.Records[0].cf.{request,response}, not the regular Lambda event. The first time you write one of these, the destructuring is the part that catches you out.
Headers in the response object are arrays of { key, value } objects, not strings. CloudFront expects this exact shape, and silently ignores anything else.
The function fails open. If anything goes wrong, malformed event, unexpected exception, missing field, returning the original response unmodified is preferable to throwing, because a thrown exception in Lambda@Edge surfaces as a 502 to the user. The cost is that the object won’t be tag-invalidatable on that one request, but it’ll still expire via TTL.
The tenant extraction reads the x-organization-id request header first (already forwarded by the CloudFront origin request policy), falls back to query string parameters (organization or organization_id), and sanitizes the result to the allowed ASCII range with a length cap. Trivial to implement and not worth showing in full.

The invalidation Lambda

On the other side, the Lambda issues the tag-based CreateInvalidation call. The whole point reduces to one boto3 call, which is anticlimactic in the best way:

def create_tag_invalidation(distribution_id: str, org_id: str, caller_reference: str):
    invalidation_tag = f"#tenant:{org_id.lower()}"

    response = cloudfront.create_invalidation(
        DistributionId=distribution_id,
        InvalidationBatch={
            'Paths': {
                'Quantity': 1,
                'Items': [invalidation_tag]
            },
            'CallerReference': caller_reference
        }
    )

    return response['Invalidation']['Id']

The # prefix in the path is what tells CloudFront the value should be treated as a tag rather than a URL path. Without it, CloudFront tries to invalidate a literal path called tenant:acme, which doesn’t exist, and the invalidation completes successfully having done nothing. There’s no error or warning. This is one of the easiest ways to spend an hour wondering why nothing is happening.

The real production wrapper around this adds two things that aren’t specific to cache tags: a retry loop with exponential backoff to handle TooManyInvalidationsInProgress errors when bursts of refresh pipelines run concurrently, and a polling loop on GetInvalidation so the Step Function’s next step (cache prewarm) doesn’t fire while the invalidation is still propagating. Both are standard boto3 patterns and don’t change for tag invalidations specifically. I sized the polling timeout at 60 seconds, which is generous given P95 propagation under 5 seconds, and let TTL handle the rest if it ever times out.

Migrating from path patterns to cache tags

Path-based invalidation is still there. Tag invalidation is additive. Mixing both in the same CreateInvalidation call works (--paths "/index.html" "#tenant:acme"). Existing wildcard invalidations still work. Nothing breaks if you adopt this incrementally.

The thing that does change is the operational model. The map between “what changed” and “what to invalidate” used to be a list of paths someone had to maintain. Now it’s a tag taxonomy that the tagging logic and the invalidation logic both refer to. Tagging happens at cache time, invalidation happens at refresh time, and the only thing they share is the tag name.

For multi-tenant systems where the natural unit of invalidation is “everything for tenant X”, this is the cleanest model AWS has shipped for cache invalidation in a long time. I’m keeping it.

The broader lesson, for me, is about the value of waiting for native primitives versus building on third-party patterns. The 2024 blog post showed the path was technically possible. I chose not to take it, and in retrospect that was right: two years of brittleness was cheaper than two years of maintaining a self-built indexing pipeline that I’d then have had to migrate off the moment AWS shipped the native version. The fragility had a known shape and a known cost. The custom infrastructure would have had unknown failure modes and a guaranteed deprecation date. When the gap between “the platform supports this natively” and “you can build it yourself” is wide enough, sitting on the slightly-uncomfortable-but-stable solution often beats the architecturally-cleaner-but-self-maintained one. Especially when the platform vendor has the feature on their long-rumored roadmap, and competitors have shipped it, and you can credibly believe it’s coming.

Implementation gotchas across CloudFront, Lambda@Edge, and CDK

A few notes from doing this, the kind of thing that doesn’t show up in the docs because it sits in between two sections that nobody reads back to back.

Lambda@Edge deployments live in us-east-1, regardless of where the rest of your stack is. My CDK app is in eu-central-1. The first deploy failed in a way that wasn’t immediately obvious, because the function was being created in the right place but the cross-region wiring to CloudFront wasn’t. If your deployment automation assumes a single region (mine did), this is the moment you discover it doesn’t.

The CDK L2 Distribution construct doesn’t expose CacheTagConfig. You drop to the L1 escape hatch with node.default_child and add_property_override. Easy fix, but the failure mode if you forget is silent: your Lambda@Edge happily injects Cache-Tag headers, CloudFront happily ignores them, tag invalidations complete successfully having done nothing. I spent more time than I want to admit chasing this before realizing the distribution wasn’t actually configured to look for the header.

CloudFront Functions look like the natural fit, but they can’t do this job. They only run on Viewer Request and Viewer Response, and Viewer Response fires after the cache decision is made. Setting a tag at that point is too late, because the object is already in cache without it. I reached for CloudFront Functions first because they’re cheaper and faster than Lambda@Edge, and lost an afternoon understanding why nothing was happening before reading the trigger semantics carefully enough.

Multiple tags in a single invalidation use OR logic, not AND. Issuing create-invalidation --paths "#tenant:acme" "#region:eu" invalidates objects matching either tag, not both. Useful to know before you design a tagging scheme that assumes AND. If you need AND semantics, model the combination as a third tag and apply it at cache time.

The wildcard quota is separate, and tag invalidations don’t pull from it. CloudFront has two distinct concurrency limits per distribution: 3,000 individual paths in progress, and 15 wildcard paths in progress. Tag invalidations count as individual paths, so they pull from the larger pool. If you used to have 50 wildcard paths per refresh and were occasionally bumping into the 15-in-progress ceiling under bursty schedules, switching to tags fixes that as a side effect, even before talking about cost.

The cost reduction is real but small in absolute terms. Each tag counts as one path under the same pricing model: 1,000 free paths per month at the AWS account level, then $0.005 each. Replacing N path patterns per refresh with 1 tag per refresh is mathematically a big ratio, but the dollar number was small to begin with. The reasons to switch are operational, not financial.

Origin Response only fires on cache misses, which is a feature. The Lambda@Edge function only runs when CloudFront actually goes to the origin, so cache hits pay nothing. The flip side is that you can’t retroactively tag already-cached objects: anything cached before you deploy the function will live out its TTL untagged and unreachable by tag invalidation. Plan for a transition period where some objects are tag-invalidatable and some aren’t.