Virtual key management

Issue API keys with per-key token budgets and cost tracking (also known as virtual keys).

About

Virtual key management is a common feature in AI gateway solutions that allows you to issue API keys to users or applications, each with independent token budgets and cost tracking. Competitors like LiteLLM and Portkey offer this as a single “virtual keys” abstraction.

Agentgateway achieves the same outcome by composing three existing capabilities:

API key authentication: Identify incoming requests by API key
Token-based rate limiting: Enforce per-key token budgets
Observability metrics: Track per-key spending and usage

This composable approach gives you more flexibility in how you configure and apply virtual key management policies, while maintaining compatibility with standard Kubernetes patterns.

How virtual keys work

Virtual keys combine authentication, rate limiting, and observability to create isolated token budgets for each API key:

    flowchart TD
  A[Request arrives with API key] --> B[Validate API key]
  B --> C[Extract user ID]
  C --> D[Check user's token budget]
  D --> E{Budget available?}
  E -->|Yes| F[Forward to LLM]
  F --> G[Track token usage]
  G --> H[Deduct from budget]
  E -->|No| I[Reject with 429]
  subgraph refill["Budget refills periodically"]
    H
  end

When a request arrives:

Agentgateway validates the API key
The user ID is extracted from a request header
The request is checked against the user’s token budget
If budget is available, the request proceeds to the LLM
Token usage is tracked and deducted from the user’s budget
If budget is exhausted, the request is rejected with a 429 status code
Budgets refill at the configured interval (daily, hourly, etc.)

More considerations

Evaluation order: Rate limiting is evaluated before prompt guards (content safety checks). This means that requests rejected by guardrails (403 Forbidden) still consume quota from the user’s token budget. In contrast, authentication (JWT/OPA) is evaluated before rate limiting, so unauthenticated requests do not consume quota.

Multiple policies: When multiple AgentgatewayPolicy resources target the same Gateway or HTTPRoute, one policy silently overwrites the other based on creation order, even though both report ACCEPTED/ATTACHED status. There is no error to indicate that one policy’s settings are not taking effect. To avoid this conflict, combine the settings that apply to the same target into a single policy. For example, this guide puts API key authentication and per-key rate limiting in one policy rather than two.

Before you begin

Set up an agentgateway proxy.
Set up access to the OpenAI LLM provider.

Set up virtual keys

This example creates two virtual keys (for Alice and Bob) with independent daily token budgets. The budget is deliberately small (100 tokens per day) so that you can exhaust it in a few requests and see the enforcement in action. For production-sized budgets, see Advanced configuration.

Create API keys for users

Create an API key secret that stores keys and metadata for each user.

kubectl apply -f- <<EOF
apiVersion: v1
kind: Secret
metadata:
  name: llm-api-keys
  namespace: agentgateway-system
type: Opaque
stringData:
  alice: |
    {
      "key": "sk-alice-abc123def456",
      "metadata": {
        "user_id": "alice"
      }
    }
  bob: |
    {
      "key": "sk-bob-xyz789uvw012",
      "metadata": {
        "user_id": "bob"
      }
    }
EOF

Review the following table to understand this configuration.

Setting	Description
`stringData.<name>`	Each key in `stringData` represents a user. The value is a JSON object containing the API key and metadata.
`key`	The API key value that users include in their `Authorization: Bearer` header.
`metadata.user_id`	The user identifier extracted by rate limiting policies to enforce per-user budgets.

Configure API key authentication

Create an AgentgatewayPolicy that requires API key authentication for all requests to the gateway. You can source the API keys from a single Secret with secretRef, or from multiple Secrets selected by label with secretSelector. Use secretSelector when you want to spread keys across many Secrets, such as one Secret per team or tenant, instead of maintaining a single Secret.

Reference a single Secret by name. This example uses the llm-api-keys Secret that you created in the previous step.

kubectl apply -f- <<EOF
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayPolicy
metadata:
  name: api-key-auth
  namespace: agentgateway-system
spec:
  targetRefs:
    - group: gateway.networking.k8s.io
      kind: Gateway
      name: agentgateway-proxy
  traffic:
    apiKeyAuthentication:
      mode: Strict
      secretRef:
        name: llm-api-keys
EOF

Select all Secrets that carry a particular label. Every matching Secret contributes its keys to the same key set, so you do not need to consolidate keys into one Secret. Label each Secret that holds virtual keys, for example:

kubectl label secret llm-api-keys -n agentgateway-system agentgateway.dev/apikey=true

Then reference the label with secretSelector instead of secretRef.

kubectl apply -f- <<EOF
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayPolicy
metadata:
  name: api-key-auth
  namespace: agentgateway-system
spec:
  targetRefs:
    - group: gateway.networking.k8s.io
      kind: Gateway
      name: agentgateway-proxy
  traffic:
    apiKeyAuthentication:
      mode: Strict
      secretSelector:
        matchLabels:
          agentgateway.dev/apikey: "true"
EOF

secretSelector matches Secret resources only. Keep key identifiers unique across the selected Secrets: if the same key is defined in more than one Secret, the behavior is undefined.

Review the following table to understand this configuration.

Setting	Description
`targetRefs`	Apply the policy to the entire Gateway so all routes require API keys.
`apiKeyAuthentication.mode`	Set to `Strict` to require a valid API key for all requests.
`secretRef.name`	References a single Secret containing API keys and user metadata. Use this or `secretSelector`, not both.
`secretSelector.matchLabels`	Selects all Secrets that carry the given labels, combining their keys. Use instead of `secretRef` when keys are spread across multiple Secrets. Secret-only.

Configure per-key token budgets

Update the api-key-auth AgentgatewayPolicy from the previous step to also enforce a per-user token budget.

The policy sends a per-user token cost to the rate limit server. It extracts the user_id from each API key and reports the token usage of each response under that descriptor. The rate limit server holds the actual budget (100 tokens per day per user), which you deploy in the next step.

kubectl apply -f- <<EOF
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayPolicy
metadata:
  name: api-key-auth
  namespace: agentgateway-system
spec:
  targetRefs:
    - group: gateway.networking.k8s.io
      kind: Gateway
      name: agentgateway-proxy
  traffic:
    apiKeyAuthentication:
      mode: Strict
      secretRef:
        name: llm-api-keys
    rateLimit:
      global:
        domain: agentgateway
        backendRef:
          kind: Service
          name: ratelimit
          namespace: ratelimit
          port: 8081
        descriptors:
          - entries:
              - name: user_id
                expression: 'apiKey.user_id'
            unit: Tokens
EOF

This example keeps the secretRef authentication from the previous step. If you used secretSelector instead, keep your secretSelector block in place of secretRef.

Review the following table to understand this configuration.

Setting	Description
`apiKeyAuthentication`	The API key authentication from the previous step. Keeping it in the same policy as the rate limit avoids the silent conflict that occurs when two policies target the same Gateway.
`rateLimit.global`	Use global rate limiting to enforce limits across all agentgateway instances.
`domain`	The rate limit domain. Must match the `domain` in the rate limit server configuration (`agentgateway`).
`backendRef`	References the rate limit server Service. Must include `kind`, `name`, `namespace`, and `port`. This example points at the `ratelimit` Service in the `ratelimit` namespace that you deploy in the next step.
`descriptors[].entries[].name`	The name of the descriptor entry. Must match a `key` in the rate limit server config. Set to `user_id` to rate limit per user.
`descriptors[].entries[].expression`	CEL expression to extract the user ID from the API key’s metadata.
`descriptors[].unit`	Set to `Tokens` so the gateway reports each response’s token count as the cost. The rate limit server subtracts that cost from the user’s budget.

Deploy the rate limit server

Global rate limiting requires an external rate limit server that stores the budgets and maintains the counters. Deploy Redis and the rate limit service as described in Deploy the rate limit service in the global rate limiting guide. That example deploys a ratelimit Service in the ratelimit namespace (the target of the backendRef in the previous step) and configures it with the user_id token-budget descriptor that this guide relies on:

# Excerpt from the rate limit server ConfigMap
domain: agentgateway
descriptors:
  - key: user_id
    rate_limit:
      unit: day
      requests_per_unit: 100   # 100 tokens per day per user

The key (user_id) matches the descriptor name in your token budget policy, and the domain (agentgateway) matches the policy’s domain. The requests_per_unit value is the per-user token budget, because the policy reports token usage with unit: Tokens. To change the budget, edit requests_per_unit in the server config; to change the window, edit unit (second, minute, hour, or day).

Set up an LLM backend

Create an AgentgatewayBackend that connects to your LLM provider.

kubectl apply -f- <<EOF
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayBackend
metadata:
  name: openai
  namespace: agentgateway-system
spec:
  ai:
    provider:
      openai:
        model: gpt-3.5-turbo
  policies:
    auth:
      secretRef:
        name: openai-secret
EOF

For detailed instructions on creating backends and storing provider API keys, see the API keys guide.

Create a route to the backend

Create an HTTPRoute that routes requests to your LLM backend.

kubectl apply -f- <<EOF
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: openai
  namespace: agentgateway-system
spec:
  parentRefs:
    - name: agentgateway-proxy
      namespace: agentgateway-system
  rules:
    - matches:
        - path:
            type: PathPrefix
            value: /openai
      backendRefs:
        - name: openai
          namespace: agentgateway-system
          group: agentgateway.dev
          kind: AgentgatewayBackend
EOF

Test the virtual keys

The following steps verify API key authentication, routing, and per-key token budget enforcement. Budget enforcement requires the rate limit server from the previous step.

Send a request with Alice’s API key. Verify that the request succeeds.

curl "$INGRESS_GW_ADDRESS/openai" \
  -H "Authorization: Bearer sk-alice-abc123def456" \
   -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-3.5-turbo",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

curl "localhost:8080/openai" \
  -H "Authorization: Bearer sk-alice-abc123def456" \
   -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-3.5-turbo",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Example successful response:

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1234567890,
  "model": "gpt-3.5-turbo",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "Hello! How can I help you today?"
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 10,
    "completion_tokens": 9,
    "total_tokens": 19
  }
}

Send several more requests with Alice’s API key until her 100-token daily budget is exhausted. Because the LLM provider returns roughly 20-30 tokens per response, a handful of requests pushes Alice over the budget. The request that crosses the budget still completes; subsequent requests are rejected with a 429 status code.

for i in $(seq 1 10); do
  STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
    "$INGRESS_GW_ADDRESS/openai" \
    -H "Authorization: Bearer sk-alice-abc123def456" \
    -H "Content-Type: application/json" \
    -d '{"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "Hello!"}]}')
  echo "Request $i: HTTP $STATUS"
done

Example 429 response:

HTTP/1.1 429 Too Many Requests
x-ratelimit-limit: 100
x-ratelimit-remaining: 0
x-ratelimit-reset: 43200

rate limit exceeded

Verify that Bob can still send requests with his own budget, independent of Alice’s usage.

curl "$INGRESS_GW_ADDRESS/openai" \
  -H "Authorization: Bearer sk-bob-xyz789uvw012" \
   -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-3.5-turbo",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

curl "localhost:8080/openai" \
  -H "Authorization: Bearer sk-bob-xyz789uvw012" \
   -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-3.5-turbo",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Bob’s requests succeed because he has his own independent budget.

Monitor per-key spending

Track token usage and spending for each virtual key by using Prometheus metrics.

By default, the agentgateway token usage metric (agentgateway_gen_ai_client_token_usage) is broken down by dimensions such as the model and token type, but not by user. To attribute usage to each virtual key, add a user_id label to the metrics with a metrics policy, then query Prometheus.

Before you begin

Set up a Prometheus instance to scrape agentgateway metrics. The OpenTelemetry stack guide walks you through the full setup; at a minimum, complete the Prometheus step. The following steps assume the kube-prometheus-stack release exists in the telemetry namespace, as deployed by that guide.

Add a per-user metric label

Create an AgentgatewayPolicy that adds the user_id from each API key as a label on all Prometheus metrics. The frontend.metrics field can only be set on a policy that targets the Gateway.

kubectl apply -f- <<EOF
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayPolicy
metadata:
  name: per-user-metrics
  namespace: agentgateway-system
spec:
  targetRefs:
    - group: gateway.networking.k8s.io
      kind: Gateway
      name: agentgateway-proxy
  frontend:
    metrics:
      attributes:
        add:
          - name: user_id
            expression: 'apiKey.user_id'
EOF

Review the following table to understand this configuration.

Setting	Description
`frontend.metrics.attributes.add[].name`	The name of the Prometheus label to add (`user_id`).
`frontend.metrics.attributes.add[].expression`	A CEL expression that is evaluated per request. Use `apiKey.user_id` to read the `user_id` from the authenticated API key. If the expression fails to evaluate (for example, on an unauthenticated request), the label value is set to `unknown`.

The user_id label is high cardinality: every unique value creates a new metric series, which increases Prometheus memory and storage. This is acceptable for tens or hundreds of keys, but avoid attaching unbounded identifiers (such as raw end-user IDs) to metrics at large scale. Prefer lower-cardinality dimensions like tier or team when possible.

Send a few requests with each virtual key so that the metrics have per-user data to report. You can reuse the requests from Test the virtual keys.

Query per-key usage

Port-forward the Prometheus server from the OpenTelemetry stack.

kubectl port-forward -n telemetry svc/kube-prometheus-stack-prometheus 9090:9090

Then open the Prometheus UI at http://localhost:9090/graph and run the following queries, or send them to the HTTP API with curl. For example:

curl -s http://localhost:9090/api/v1/query \
  --data-urlencode 'query=sum by (user_id) (agentgateway_gen_ai_client_token_usage_sum)'

Example output:

{"status":"success","data":{"resultType":"vector","result":[{"metric":{},"value":[1782410561.391,"720"]},{"metric":{"user_id":"bob"},"value":[1782410561.391,"448"]},{"metric":{"user_id":"alice"},"value":[1782410561.391,"448"]}]}}

Query token usage broken down by user ID. The token usage metric carries a separate series per token type (input, output, input_cache_read), so match both the input and output types in a single selector and sum them, rather than adding two selectors together.

# Total tokens consumed by user over the last 24 hours
sum by (user_id) (
  increase(agentgateway_gen_ai_client_token_usage_sum{gen_ai_token_type=~"input|output"}[24h])
)

# Percentage of a 100-token daily budget used (adjust the divisor to match your budget)
(sum by (user_id) (
  increase(agentgateway_gen_ai_client_token_usage_sum{gen_ai_token_type=~"input|output"}[24h])
) / 100) * 100

curl -s http://localhost:9090/api/v1/query \
  --data-urlencode 'query=sum by (user_id) (increase(agentgateway_gen_ai_client_token_usage_sum{gen_ai_token_type=~"input|output"}[24h]))'

curl -s http://localhost:9090/api/v1/query \
  --data-urlencode 'query=(sum by (user_id) (increase(agentgateway_gen_ai_client_token_usage_sum{gen_ai_token_type=~"input|output"}[24h])) / 100) * 100'

Each result series is labeled with a user_id, such as alice and bob. If a key is missing the user_id field, or the request is not attributed to a key, its usage appears under user_id="unknown".

Example output:

{"status":"success","data":{"resultType":"vector","result":[{"metric":{},"value":[1782411002.488,"0"]},{"metric":{"user_id":"bob"},"value":[1782411002.488,"372.2787929364588"]},{"metric":{"user_id":"alice"},"value":[1782411002.488,"309.56920815395927"]}]}}

{"status":"success","data":{"resultType":"vector","result":[{"metric":{},"value":[1782411059.867,"0"]},{"metric":{"user_id":"bob"},"value":[1782411059.867,"370.95800165527817"]},{"metric":{"user_id":"alice"},"value":[1782411059.867,"307.9427844448483"]}]}}

increase() and rate() need at least two samples within the time range to report a value, so a brand-new user_id series shows no result until it has been scraped a few times under continuous traffic. For a quick instant check, query the cumulative counter directly: sum by (user_id) (agentgateway_gen_ai_client_token_usage_sum).

Calculate costs per user by multiplying token counts by your provider’s pricing. Input and output tokens are usually priced differently, so reduce each token type to a per-user series with sum by (user_id) before adding them, which keeps the two sides matchable.

# Cost per user (assuming $0.50 per 1M input tokens, $1.50 per 1M output tokens)
sum by (user_id) (rate(agentgateway_gen_ai_client_token_usage_sum{gen_ai_token_type="input"}[24h])) / 1000000 * 0.50
+
sum by (user_id) (rate(agentgateway_gen_ai_client_token_usage_sum{gen_ai_token_type="output"}[24h])) / 1000000 * 1.50

curl -s http://localhost:9090/api/v1/query \
  --data-urlencode 'query=sum by (user_id) (rate(agentgateway_gen_ai_client_token_usage_sum{gen_ai_token_type="input"}[24h])) / 1000000 * 0.50 + sum by (user_id) (rate(agentgateway_gen_ai_client_token_usage_sum{gen_ai_token_type="output"}[24h])) / 1000000 * 1.50'

Example output:

{"status":"success","data":{"resultType":"vector","result":[{"metric":{},"value":[1782410758.432,"0"]},{"metric":{"user_id":"bob"},"value":[1782410758.432,"6.101636101191084e-09"]},{"metric":{"user_id":"alice"},"value":[1782410758.432,"5.106526900820178e-09"]}]}}

For more information on cost tracking, see the cost tracking guide.

Advanced configuration

Tiered budgets based on user type

Provide different budget tiers for free, standard, and premium users.

Add tier metadata to each API key in the Secret.

apiVersion: v1
kind: Secret
metadata:
  name: llm-api-keys
  namespace: agentgateway-system
type: Opaque
stringData:
  alice: |
    {
      "key": "sk-alice-abc123def456",
      "metadata": {
        "user_id": "alice",
        "tier": "premium"
      }
    }
  charlie: |
    {
      "key": "sk-charlie-ghi345jkl678",
      "metadata": {
        "user_id": "charlie",
        "tier": "free"
      }
    }

Configure rate limiting to use the tier and user_id from API key metadata.

traffic:
  rateLimit:
    global:
      domain: agentgateway
      backendRef:
        kind: Service
        name: ratelimit
        namespace: ratelimit
        port: 8081
      descriptors:
        - entries:
            - name: tier
              expression: 'apiKey.tier'
            - name: user_id
              expression: 'apiKey.user_id'
          unit: Tokens

Configure the rate limit server with tier-based budgets.

domain: agentgateway
descriptors:
  - key: tier
    value: "free"
    descriptors:
      - key: user_id
        rate_limit:
          unit: day
          requests_per_unit: 10000  # 10K tokens/day for free tier
  - key: tier
    value: "standard"
    descriptors:
      - key: user_id
        rate_limit:
          unit: day
          requests_per_unit: 100000  # 100K tokens/day for standard tier
  - key: tier
    value: "premium"
    descriptors:
      - key: user_id
        rate_limit:
          unit: day
          requests_per_unit: 500000  # 500K tokens/day for premium tier

Hourly budget limits

Set a smaller budget that refreshes every hour for tighter cost control.

# In the ratelimit-config ConfigMap
domain: agentgateway
descriptors:
  - key: user_id
    rate_limit:
      unit: hour
      requests_per_unit: 10000  # 10,000 tokens per hour

Multi-tenant virtual keys

Create virtual keys scoped to both user and tenant for multi-tenant applications. Add tenant_id to the API key metadata.

# In TrafficPolicy
descriptors:
  - entries:
      - name: tenant_id
        expression: 'apiKey.tenant_id'
      - name: user_id
        expression: 'apiKey.user_id'
    unit: Tokens

# In the ratelimit-config ConfigMap
domain: agentgateway
descriptors:
  - key: tenant_id
    descriptors:
      - key: user_id
        rate_limit:
          unit: day
          requests_per_unit: 50000

For more advanced rate limiting patterns, see the budget and spend limits guide.

Cleanup

You can remove the resources that you created in this guide.

kubectl delete AgentgatewayPolicy api-key-auth per-user-metrics -n agentgateway-system --ignore-not-found
kubectl delete secret llm-api-keys -n agentgateway-system
kubectl delete httproute openai -n agentgateway-system
kubectl delete AgentgatewayBackend openai -n agentgateway-system

To remove the rate limit server, follow the cleanup steps in the global rate limiting guide.

What’s next

Manage API keys for detailed authentication configuration
Budget and spend limits for advanced rate limiting patterns
Track costs per request for cost calculation and monitoring
Set up observability to view token usage metrics and logs

Manage API keys Load balancing

Was this page helpful?

Virtual key management

About

How virtual keys work

More considerations

Before you begin

Set up virtual keys

Create API keys for users

Configure API key authentication

Configure per-key token budgets

Deploy the rate limit server

Set up an LLM backend

Create a route to the backend

Test the virtual keys

Monitor per-key spending

Before you begin

Add a per-user metric label

Query per-key usage

Advanced configuration

Tiered budgets based on user type

Hourly budget limits

Multi-tenant virtual keys

Cleanup

What’s next

What could be improved?