Per-tenant DynamoDB isolation with the Token Vending Machine pattern

If a Lambda function in your multi-tenant SaaS asked DynamoDB for the wrong tenant’s data, what would actually stop it? In most architectures I’ve seen, the answer is “the application code, hopefully”. The Lambda’s IAM role has read access to all tenant tables, and the only filter is the tenant_id parameter that gets passed through. If that parameter is wrong because of a routing bug, a cache key collision, or an off-by-one in pagination, IAM doesn’t care. It serves what you asked for.

This bothered me when I started designing the data layer for a multi-tenant platform I work on. The system has per-tenant DynamoDB tables (one per customer, named customer-data-{tenant_id} and a few others), Lambdas that read from them on behalf of authenticated users, and a tier of internal superadmins who can legitimately access any customer’s data for support and operational work. Cognito groups give us application-level authorization, an audit trail logs every authorization decision, and the API code is unit-tested for tenant filtering. None of that physically prevents a wrong query though. I wanted something that did.

This post is about the pattern I ended up with: a single IAM role, broad enough to read every tenant table, that gets narrowed at runtime to exactly one table per request. The narrowing happens via STS, with an inline session policy. AWS calls this the Token Vending Machine pattern, with the most concrete walkthrough in the AWS Prescriptive Guidance pattern for S3 and broader treatment in the AWS SaaS Factory whitepapers and the Well-Architected SaaS Lens. The thing I rarely see written down is a real implementation walked through end to end with per-tenant tables rather than the pooled shared-table model that most TVM material assumes, with the trust policy decisions and the gotchas you only learn by hitting them.

Multi-tenant scenarios where Token Vending Machine isolation matters

Before getting into the mechanics, it’s worth being precise about which scenarios actually need a pattern like this and which ones don’t, because the value is uneven across cases with separate tables per tenant. Skipping that framing leaves the rest of the article looking like an engineering exercise.

Take the simplest case first. A regular customer user logs in, Cognito issues a JWT with their tenant_id baked into a claim, and the Lambda derives the table name from that claim and nothing else. Cognito has already decided this user can only ever see their own tenant’s data, and the table name comes from the same claim that Cognito validated. There is no real path for the wrong table name to be queried, short of code that explicitly throws away the JWT-derived tenant and substitutes something else. In this scenario IAM-level tenant scoping is mostly defensive scaffolding: it enforces a property the application is already enforcing, and protects against future bugs that haven’t been written yet. Useful, but not load-bearing.

The picture changes the moment the architecture has any of the following.

Privileged users who legitimately access multiple tenants

Many real systems have an internal class of users, support engineers, operators, account managers, who can legitimately access any customer’s data for troubleshooting and operational work. From Cognito’s point of view, this kind of user asking for tenant A’s data and the same user asking for tenant B’s data are both authorized. The decision of which tenant gets read is made entirely in the application: a customer dropdown in the console, a path parameter, a filter, a saved view. None of that is something Cognito can validate, because all of it is legal for this user.

This is exactly the situation where the parameter that determines the table name is genuinely free to be wrong, in the normal ways bugs happen in admin consoles: UI state stuck on the previous customer after navigation, a cached selection that didn’t get invalidated when the user switched context, a request parameter parsed from the wrong place because the same field name exists in two different objects. Without IAM-level scoping, a bug like that means a support engineer investigating a ticket for one customer gets shown another customer’s data, with full audit consequences and a customer conversation no one wants to have.

With TVM, the credentials issued for that request are scoped to whichever table the application explicitly named when assuming the role. If the application later tries to read a different tenant’s table inside the same handler, IAM returns AccessDenied. The principle worth holding onto: a privileged user can read any tenant, but in any individual request can only read the tenant that was explicitly selected for that request. The capability is broad, the exercised authority is always narrow, and the gap between “Cognito says yes” and “the right table is read” is where bugs used to live.

This is also where the audit story matters most. The AssumeRole call goes into CloudTrail with a RoleSessionName that names the tenant explicitly, along with the calling identity, the timestamp, and the source IP. Six months later, when someone asks “which support engineer looked at customer X’s data the week of March 9”, you have an answer at the IAM layer, not an answer that depends on application logs that may have rotated, been bypassed by a code path you forgot about, or been written in a format nobody can grep efficiently. For broadly privileged users, this granular per-request trail is half the value of the pattern.

A side note on the implementation: the privileged path goes through the same TVM as everyone else, with the same per-request scoping. It would be tempting to give these users a separate role with broad access since they “need it anyway”, but that recreates the original problem for the user population that can do the most damage when something goes wrong.

Code paths where tenant_id doesn’t come from a JWT

The single-tenant-per-request, JWT-derived tenant_id model covers a lot of an application but not all of it. The places it doesn’t cover tend to be the ones where bugs are easiest to introduce.

Scheduled jobs that iterate over tenants for cleanup, billing aggregation, or health checks are one obvious example. There’s no JWT here, just a list of tenant IDs and a loop, and an off-by-one or a stale list with a tenant that was supposed to be archived can mean the wrong table gets touched. The Cognito layer that protects everything else simply isn’t in the loop.

Another case is callbacks from external systems, where the tenant identifier arrives in a payload from an upstream service. The upstream is responsible for sending the correct value, but the receiving Lambda has to do data access based on what arrived. Whatever validation you do, the credentials had better be scoped to whatever you concluded the tenant was, so a bug between “validated the payload” and “queried the table” can’t drift to a different tenant.

Then there are pagination tokens, entity lookups, and cross-references where the tenant gets derived from data rather than from auth. A pattern like “look up record X, then read that record’s tenant’s profile” depends on the lookup returning the right record. If a key collision or a stale index returns the wrong one, you’re now reading the wrong tenant.

In all of these, Cognito is either not in the loop at all, or only at the entry point and not at each subsequent data access. With TVM, the session policy gets built right where the table ARN is constructed, so whichever logic just decided “this is the tenant to access”, that decision is bound to the IAM credentials immediately and can’t drift afterwards.

Read versus write action scoping

The session policy doesn’t only narrow which table, it narrows which actions. A handler that exists to read data can be wired to only ever request read-only credentials, and a bug that calls PutItem from inside it gets AccessDenied rather than a successful write to the wrong place. The parent role grants both reads and writes, the session policy decides which subset this particular request needs.

This matters most in code paths where reads and writes legitimately coexist, like an endpoint that reads to validate and then writes if validation passes. If the write step has a logic bug that runs when it shouldn’t, IAM stops it at the boundary, before the data layer has a chance to do the wrong thing.

Limiting the blast radius of leaked credentials

Credentials end up in places they shouldn’t from time to time: an exception that gets logged with the full request context, an HTTP response that includes more than it should, a debug session left running. With TVM, the credentials that escape are valid for 15 minutes and authorized for one tenant’s table. With the parent role’s credentials, they would be valid much longer and authorized for everything. This isn’t a primary justification for TVM on its own, it’s the kind of property that pays for itself the one time something does leak.

Alternatives: per-tenant roles, code filtering, and ABAC

Before settling on TVM, I considered three alternatives.

One IAM role per tenant. The Lambda assumes a different role depending on which tenant it’s serving. This works, but it doesn’t scale. Every new tenant means a new role, new trust policy, new identity policy. Your IAM gets cluttered with N variants of the same role, which is exactly what AWS service quotas are designed to make uncomfortable.
One IAM role with all permissions, filter in code. The Lambda’s role can read every tenant table. The application code is responsible for only ever querying the right one. This is what most teams do, and it works fine until the day it doesn’t. The blast radius of an application bug becomes the entire tenant population.
ABAC with session tags. Achieves the same isolation goal as TVM through a different mechanism. Instead of generating a session policy at call time, you give the parent role a static identity policy that references session tags, like Resource: arn:aws:dynamodb:*:*:table/customer-data-${aws:PrincipalTag/TenantID}. The caller passes the tenant identifier as a session tag in the AssumeRole call, and IAM substitutes it into the resource ARN at evaluation time. AWS documents this approach explicitly for multi-tenant SaaS. It scales better at high tenant counts because session tags are smaller than session policies, and the parent role’s identity policy stays static. I went with inline session policies because the policy is inspectable at the call site: the full resource ARN is visible in the code that builds it, with no template substitution to chase down at audit time. For our scale and review habits, the explicit form was easier to reason about.

The Token Vending Machine takes a different shape: one role, broad permissions, narrowed per request via an inline session policy at AssumeRole time.

How the Token Vending Machine pattern works

The mechanism is simple, but the security guarantee depends on getting the details right. STS AssumeRole accepts an optional Policy parameter, called a session policy. When you pass one, the credentials returned by STS have permissions equal to the intersection of two things: the role’s identity policy, and the session policy you just submitted.

Concretely, the role I have looks like this in CDK:

self.tenant_scoped_role = iam.Role(
    self,
    "TenantScopedRole",
    assumed_by=iam.CompositePrincipal(*data_access_lambda_roles),
    description="Assumed by data-processing Lambdas with a tenant-scoped session policy",
    max_session_duration=Duration.hours(1)
)

self.tenant_scoped_role.add_to_policy(
    iam.PolicyStatement(
        effect=iam.Effect.ALLOW,
        actions=[
            "dynamodb:Query", "dynamodb:GetItem", "dynamodb:Scan",
            "dynamodb:BatchGetItem", "dynamodb:DescribeTable",
            "dynamodb:PutItem", "dynamodb:UpdateItem",
            "dynamodb:DeleteItem", "dynamodb:BatchWriteItem"
        ],
        resources=[
            "arn:aws:dynamodb:*:*:table/customer-data-*",
            "arn:aws:dynamodb:*:*:table/customer-data-*/index/*",
        ]
    )
)

That role can read and write to every tenant table in the account. On its own, it offers exactly zero isolation guarantees.

The isolation comes from how the role is consumed:

def get_tenant_scoped_credentials(
    tenant_table_arn: str,
    role_arn: str,
    duration_seconds: int = 900,
    read_write: bool = False
):
    actions = [
        "dynamodb:BatchGetItem", "dynamodb:ConditionCheckItem",
        "dynamodb:DescribeTable", "dynamodb:GetItem",
        "dynamodb:Query", "dynamodb:Scan",
    ]
    if read_write:
        actions.extend([
            "dynamodb:PutItem", "dynamodb:UpdateItem",
            "dynamodb:DeleteItem", "dynamodb:BatchWriteItem"
        ])

    session_policy = {
        "Version": "2012-10-17",
        "Statement": [{
            "Effect": "Allow",
            "Action": actions,
            "Resource": [
                tenant_table_arn,
                f"{tenant_table_arn}/index/*"
            ]
        }]
    }

    table_name = tenant_table_arn.split("/")[-1]
    response = sts_client.assume_role(
        RoleArn=role_arn,
        RoleSessionName=f"tenant-{table_name}",
        Policy=json.dumps(session_policy),
        DurationSeconds=duration_seconds
    )
    return response['Credentials']

The Lambda calls this function with the ARN of one tenant’s table. The credentials it gets back can do nothing else. They cannot read another tenant’s table, even if the resulting boto3 client tries. IAM evaluation returns AccessDenied because the table ARN isn’t in the session policy.

This is the actual security property. The application code can have any kind of bug: mismatched tenant_id, wrong table name in the lookup, anything. If you scope the credentials before you build the boto3 client, the bug cannot result in cross-tenant access. IAM evaluation runs before any application logic.

Trust policy: explicit principals and the trust boundary

The trust policy of the TenantScopedRole enumerates exactly which Lambda execution roles can assume it. The list is built directly in CDK from the data-access Lambda roles defined in the same app, so adding a new consumer means adding it to that list, in code, in the same pull request that introduces the Lambda.

Each Lambda listed in the trust policy also needs a matching sts:AssumeRole permission in its own identity policy:

{
    "Effect": "Allow",
    "Action": "sts:AssumeRole",
    "Resource": "arn:aws:iam::ACCOUNT:role/TenantScopedRole"
}

Both sides have to allow the call. The trust policy says “this role accepts being assumed by these principals”. The identity policy says “this principal is allowed to do sts:AssumeRole on this role”. If either is missing, AssumeRole fails with AccessDenied before any session policy is even evaluated.

The reason for this shape is that it puts the security boundary on the role itself, not on the discipline of every identity policy in the account. If some other Lambda in the same account has a permissive identity policy that grants sts:AssumeRole on *, it still cannot assume the TenantScopedRole, because the trust policy doesn’t list it. The role enforces who can use it, regardless of how generous other identity policies are written.

The cost is operational: adding a new Lambda that needs tenant-scoped access requires updating the trust policy. The number of legitimate consumers is small and grows slowly, so the explicit list is easy to read in code review and rarely changes. If the consumer set were large or volatile, an SCP at the organization level scoped to a naming convention would scale better; for our numbers, the explicit list is simpler to maintain and read.

If the role were ever made assumable cross-account, I would also add an external ID condition. Same-account, the explicit principal list does the work.

Token Vending Machine threat model and limitations

There are classes of compromise TVM does not address, and being clear about them matters as much as being clear about what it does.

Compromised Lambda execution environment. An attacker who can run code inside a Lambda can call STS directly with whatever session policy they like, including one that grants access to every tenant table. The execution role still has AssumeRole permission and the parent role still has broad access. The protection only holds when TVM is the only path through which credentials are obtained and the rest of the code is what you wrote.
Tampering with the tenant-scoped role’s identity policy. An attacker with permission to modify the parent role can broaden it. The whole intersection-with-session-policy guarantee assumes the parent role’s identity policy is the one you wrote and reviewed.
Misconfiguration of the parent role itself. If you grant it dynamodb:* instead of an explicit list of read and write actions, the session policy can only narrow within dynamodb:*. The narrower the parent role, the tighter the maximum reach of any session policy issued from it.

The class of bug TVM does eliminate is application-level: wrong tenant identifier flowing through the data path, wrong table name in a lookup, code paths where a parameter from outside the JWT got trusted too much. The classes it doesn’t eliminate require IAM-level compromise, which is a different problem with different controls.

TVM does not replace application-level authorization

Worth saying explicitly: TVM doesn’t replace application-level authorization. The Cognito-group-based check still runs before any data access, and a regular user with access to org A asking for org B’s data gets a 403 long before any table is queried.

Two layers, both failing closed. The application check turns wrong access into a deliberate decision someone has to make and the audit log has to record. The IAM scope turns it into AccessDenied with no work done.

Implementation notes for the Token Vending Machine pattern

A few things I’d do again:

Don’t put session policies in CDK source. They are computed per-request based on the caller’s tenant context. Putting them in identity policies misses the entire point.
Make the parent role’s identity policy as specific as you can stand. The session policy can only narrow what the parent grants. If you start with dynamodb:* “to be safe”, the session policy can narrow the resource but not the action set in any meaningful way.
Inline session policies are capped at 2048 characters in the Policy parameter. That’s plenty for a single-table policy like the one above, but worth keeping in mind if you find yourself adding conditions or expanding the resource list. Hit the cap and you’d have to switch to managed policy ARNs in PolicyArns, which makes the per-request narrowing story more cumbersome.
Keep max_session_duration short. We use 1 hour as the role’s max, and request 15 minutes per session. STS lets you go up to 12 hours; there is no good reason to.
Cache the resulting credentials in module scope, with separate slots for read-only and read-write. STS has account-level rate limits on AssumeRole and adds tens to hundreds of milliseconds per call, so a warm Lambda container doing multiple operations for the same tenant shouldn’t be calling AssumeRole on every one. The separate slots also keep a request that should only read from accidentally pulling RW credentials out of the cache. Refresh shortly before expiry.
Log the assume-role calls with correlation IDs that match your application logs. When something goes wrong, you want to correlate “the Lambda invocation for tenant A at 14:32” with “the AssumeRole call that scoped the credentials at 14:32”. CloudTrail has the STS event, your app logs have the request, the correlation ID joins them.

One way to think about all this: tenant isolation is something you can keep pushing down the stack, from application code to IAM identity policy to runtime-scoped session policy. The further you push it, the less the layers above have to be correct for the guarantee to hold. With separate tables per tenant and only single-tenant users, the pushing is largely redundant. The moment you have privileged cross-tenant users, batch jobs, callbacks, or any other path where the tenant is selected as a parameter rather than read off the JWT, the lower layers earn their keep.