Skip to content
🎯 New workshop: Govern AI Costs in Real Time — Hands-On with agentgateway agentgateway has joined the Agentic AI FoundationLearn more

For the complete documentation index, see llms.txt. Markdown versions of all docs pages are available by appending .md to any docs URL.

Page as Markdown

Virtual key management

Issue API keys with per-key token budgets and cost tracking (also known as virtual keys).

Issue API keys with per-key token budgets and cost tracking (also known as virtual keys).

About

Virtual key management is a common feature in AI gateway solutions that allows you to issue API keys to users or applications, each with independent token budgets and cost tracking. Competitors like LiteLLM and Portkey offer this as a single “virtual keys” abstraction.

Agentgateway achieves the same outcome by composing three existing capabilities:

  • API key authentication: Identify incoming requests by API key
  • Token-based rate limiting: Enforce per-key token budgets
  • Observability metrics: Track per-key spending and usage

This composable approach gives you more flexibility in how you configure and apply virtual key management policies, while maintaining compatibility with standard Kubernetes patterns.

How virtual keys work

Virtual keys combine authentication, rate limiting, and observability to create isolated token budgets for each API key:

    flowchart TD
  A[Request arrives with API key] --> B[Validate API key]
  B --> C[Extract user ID]
  C --> D[Check user's token budget]
  D --> E{Budget available?}
  E -->|Yes| F[Forward to LLM]
  F --> G[Track token usage]
  G --> H[Deduct from budget]
  E -->|No| I[Reject with 429]
  subgraph refill["Budget refills periodically"]
    H
  end
  

When a request arrives:

  1. Agentgateway validates the API key
  2. The user ID is extracted from a request header
  3. The request is checked against the user’s token budget
  4. If budget is available, the request proceeds to the LLM
  5. Token usage is tracked and deducted from the user’s budget
  6. If budget is exhausted, the request is rejected with a 429 status code
  7. Budgets refill at the configured interval (daily, hourly, etc.)

More considerations

Evaluation order: Rate limiting is evaluated before prompt guards (content safety checks). This means that requests rejected by guardrails (403 Forbidden) still consume quota from the user’s token budget. In contrast, authentication (JWT/OPA) is evaluated before rate limiting, so unauthenticated requests do not consume quota.

Multiple policies: When multiple AgentgatewayPolicy resources target the same Gateway or HTTPRoute, one policy silently overwrites the other based on creation order, even though both report ACCEPTED/ATTACHED status. There is no error to indicate that one policy’s settings are not taking effect. To avoid this conflict, combine the settings that apply to the same target into a single policy. For example, this guide puts API key authentication and per-key rate limiting in one policy rather than two.

Before you begin

  1. Set up an agentgateway proxy.
  2. Set up access to the OpenAI LLM provider.

Set up virtual keys

This example creates two virtual keys (for Alice and Bob) with independent daily token budgets. The budget is deliberately small (100 tokens per day) so that you can exhaust it in a few requests and see the enforcement in action. For production-sized budgets, see Advanced configuration.

Create API keys for users

Create an API key secret that stores keys and metadata for each user.

kubectl apply -f- <<EOF
apiVersion: v1
kind: Secret
metadata:
  name: llm-api-keys
  namespace: agentgateway-system
type: Opaque
stringData:
  alice: |
    {
      "key": "sk-alice-abc123def456",
      "metadata": {
        "user_id": "alice"
      }
    }
  bob: |
    {
      "key": "sk-bob-xyz789uvw012",
      "metadata": {
        "user_id": "bob"
      }
    }
EOF

Review the following table to understand this configuration.

SettingDescription
stringData.<name>Each key in stringData represents a user. The value is a JSON object containing the API key and metadata.
keyThe API key value that users include in their Authorization: Bearer header.
metadata.user_idThe user identifier extracted by rate limiting policies to enforce per-user budgets.

Configure API key authentication

Create an AgentgatewayPolicy that requires API key authentication for all requests to the gateway. You can source the API keys from a single Secret with secretRef, or from multiple Secrets selected by label with secretSelector. Use secretSelector when you want to spread keys across many Secrets, such as one Secret per team or tenant, instead of maintaining a single Secret.

Reference a single Secret by name. This example uses the llm-api-keys Secret that you created in the previous step.

kubectl apply -f- <<EOF
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayPolicy
metadata:
  name: api-key-auth
  namespace: agentgateway-system
spec:
  targetRefs:
    - group: gateway.networking.k8s.io
      kind: Gateway
      name: agentgateway-proxy
  traffic:
    apiKeyAuthentication:
      mode: Strict
      secretRef:
        name: llm-api-keys
EOF

Review the following table to understand this configuration.

SettingDescription
targetRefsApply the policy to the entire Gateway so all routes require API keys.
apiKeyAuthentication.modeSet to Strict to require a valid API key for all requests.
secretRef.nameReferences a single Secret containing API keys and user metadata. Use this or secretSelector, not both.
secretSelector.matchLabelsSelects all Secrets that carry the given labels, combining their keys. Use instead of secretRef when keys are spread across multiple Secrets. Secret-only.

Configure per-key token budgets

Update the api-key-auth AgentgatewayPolicy from the previous step to also enforce a per-user token budget.

The policy sends a per-user token cost to the rate limit server. It extracts the user_id from each API key and reports the token usage of each response under that descriptor. The rate limit server holds the actual budget (100 tokens per day per user), which you deploy in the next step.

kubectl apply -f- <<EOF
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayPolicy
metadata:
  name: api-key-auth
  namespace: agentgateway-system
spec:
  targetRefs:
    - group: gateway.networking.k8s.io
      kind: Gateway
      name: agentgateway-proxy
  traffic:
    apiKeyAuthentication:
      mode: Strict
      secretRef:
        name: llm-api-keys
    rateLimit:
      global:
        domain: agentgateway
        backendRef:
          kind: Service
          name: ratelimit
          namespace: ratelimit
          port: 8081
        descriptors:
          - entries:
              - name: user_id
                expression: 'apiKey.user_id'
            unit: Tokens
EOF
This example keeps the secretRef authentication from the previous step. If you used secretSelector instead, keep your secretSelector block in place of secretRef.

Review the following table to understand this configuration.

SettingDescription
apiKeyAuthenticationThe API key authentication from the previous step. Keeping it in the same policy as the rate limit avoids the silent conflict that occurs when two policies target the same Gateway.
rateLimit.globalUse global rate limiting to enforce limits across all agentgateway instances.
domainThe rate limit domain. Must match the domain in the rate limit server configuration (agentgateway).
backendRefReferences the rate limit server Service. Must include kind, name, namespace, and port. This example points at the ratelimit Service in the ratelimit namespace that you deploy in the next step.
descriptors[].entries[].nameThe name of the descriptor entry. Must match a key in the rate limit server config. Set to user_id to rate limit per user.
descriptors[].entries[].expressionCEL expression to extract the user ID from the API key’s metadata.
descriptors[].unitSet to Tokens so the gateway reports each response’s token count as the cost. The rate limit server subtracts that cost from the user’s budget.

Deploy the rate limit server

Global rate limiting requires an external rate limit server that stores the budgets and maintains the counters. Deploy Redis and the rate limit service as described in Deploy the rate limit service in the global rate limiting guide. That example deploys a ratelimit Service in the ratelimit namespace (the target of the backendRef in the previous step) and configures it with the user_id token-budget descriptor that this guide relies on:

# Excerpt from the rate limit server ConfigMap
domain: agentgateway
descriptors:
  - key: user_id
    rate_limit:
      unit: day
      requests_per_unit: 100   # 100 tokens per day per user

The key (user_id) matches the descriptor name in your token budget policy, and the domain (agentgateway) matches the policy’s domain. The requests_per_unit value is the per-user token budget, because the policy reports token usage with unit: Tokens. To change the budget, edit requests_per_unit in the server config; to change the window, edit unit (second, minute, hour, or day).

Set up an LLM backend

Create an AgentgatewayBackend that connects to your LLM provider.

kubectl apply -f- <<EOF
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayBackend
metadata:
  name: openai
  namespace: agentgateway-system
spec:
  ai:
    provider:
      openai:
        model: gpt-3.5-turbo
  policies:
    auth:
      secretRef:
        name: openai-secret
EOF

For detailed instructions on creating backends and storing provider API keys, see the API keys guide.

Create a route to the backend

Create an HTTPRoute that routes requests to your LLM backend.

kubectl apply -f- <<EOF
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: openai
  namespace: agentgateway-system
spec:
  parentRefs:
    - name: agentgateway-proxy
      namespace: agentgateway-system
  rules:
    - matches:
        - path:
            type: PathPrefix
            value: /openai
      backendRefs:
        - name: openai
          namespace: agentgateway-system
          group: agentgateway.dev
          kind: AgentgatewayBackend
EOF

Test the virtual keys

The following steps verify API key authentication, routing, and per-key token budget enforcement. Budget enforcement requires the rate limit server from the previous step.
  1. Send a request with Alice’s API key. Verify that the request succeeds.

    curl "$INGRESS_GW_ADDRESS/openai" \
      -H "Authorization: Bearer sk-alice-abc123def456" \
       -H "Content-Type: application/json" \
      -d '{
        "model": "gpt-3.5-turbo",
        "messages": [{"role": "user", "content": "Hello!"}]
      }'

    Example successful response:

    {
      "id": "chatcmpl-abc123",
      "object": "chat.completion",
      "created": 1234567890,
      "model": "gpt-3.5-turbo",
      "choices": [{
        "index": 0,
        "message": {
          "role": "assistant",
          "content": "Hello! How can I help you today?"
        },
        "finish_reason": "stop"
      }],
      "usage": {
        "prompt_tokens": 10,
        "completion_tokens": 9,
        "total_tokens": 19
      }
    }
  2. Send several more requests with Alice’s API key until her 100-token daily budget is exhausted. Because the LLM provider returns roughly 20-30 tokens per response, a handful of requests pushes Alice over the budget. The request that crosses the budget still completes; subsequent requests are rejected with a 429 status code.

    for i in $(seq 1 10); do
      STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
        "$INGRESS_GW_ADDRESS/openai" \
        -H "Authorization: Bearer sk-alice-abc123def456" \
        -H "Content-Type: application/json" \
        -d '{"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "Hello!"}]}')
      echo "Request $i: HTTP $STATUS"
    done

    Example 429 response:

    HTTP/1.1 429 Too Many Requests
    x-ratelimit-limit: 100
    x-ratelimit-remaining: 0
    x-ratelimit-reset: 43200
    
    rate limit exceeded
  3. Verify that Bob can still send requests with his own budget, independent of Alice’s usage.

    curl "$INGRESS_GW_ADDRESS/openai" \
      -H "Authorization: Bearer sk-bob-xyz789uvw012" \
       -H "Content-Type: application/json" \
      -d '{
        "model": "gpt-3.5-turbo",
        "messages": [{"role": "user", "content": "Hello!"}]
      }'

    Bob’s requests succeed because he has his own independent budget.

Monitor per-key spending

Track token usage and spending for each virtual key by using Prometheus metrics.

By default, the agentgateway token usage metric (agentgateway_gen_ai_client_token_usage) is broken down by dimensions such as the model and token type, but not by user. To attribute usage to each virtual key, add a user_id label to the metrics with a metrics policy, then query Prometheus.

Before you begin

Set up a Prometheus instance to scrape agentgateway metrics. The OpenTelemetry stack guide walks you through the full setup; at a minimum, complete the Prometheus step. The following steps assume the kube-prometheus-stack release exists in the telemetry namespace, as deployed by that guide.

Add a per-user metric label

  1. Create an AgentgatewayPolicy that adds the user_id from each API key as a label on all Prometheus metrics. The frontend.metrics field can only be set on a policy that targets the Gateway.

    kubectl apply -f- <<EOF
    apiVersion: agentgateway.dev/v1alpha1
    kind: AgentgatewayPolicy
    metadata:
      name: per-user-metrics
      namespace: agentgateway-system
    spec:
      targetRefs:
        - group: gateway.networking.k8s.io
          kind: Gateway
          name: agentgateway-proxy
      frontend:
        metrics:
          attributes:
            add:
              - name: user_id
                expression: 'apiKey.user_id'
    EOF

    Review the following table to understand this configuration.

    SettingDescription
    frontend.metrics.attributes.add[].nameThe name of the Prometheus label to add (user_id).
    frontend.metrics.attributes.add[].expressionA CEL expression that is evaluated per request. Use apiKey.user_id to read the user_id from the authenticated API key. If the expression fails to evaluate (for example, on an unauthenticated request), the label value is set to unknown.
    The user_id label is high cardinality: every unique value creates a new metric series, which increases Prometheus memory and storage. This is acceptable for tens or hundreds of keys, but avoid attaching unbounded identifiers (such as raw end-user IDs) to metrics at large scale. Prefer lower-cardinality dimensions like tier or team when possible.
  2. Send a few requests with each virtual key so that the metrics have per-user data to report. You can reuse the requests from Test the virtual keys.

Query per-key usage

  1. Port-forward the Prometheus server from the OpenTelemetry stack.

    kubectl port-forward -n telemetry svc/kube-prometheus-stack-prometheus 9090:9090

    Then open the Prometheus UI at http://localhost:9090/graph and run the following queries, or send them to the HTTP API with curl. For example:

    curl -s http://localhost:9090/api/v1/query \
      --data-urlencode 'query=sum by (user_id) (agentgateway_gen_ai_client_token_usage_sum)'

    Example output:

    {"status":"success","data":{"resultType":"vector","result":[{"metric":{},"value":[1782410561.391,"720"]},{"metric":{"user_id":"bob"},"value":[1782410561.391,"448"]},{"metric":{"user_id":"alice"},"value":[1782410561.391,"448"]}]}}
  2. Query token usage broken down by user ID. The token usage metric carries a separate series per token type (input, output, input_cache_read), so match both the input and output types in a single selector and sum them, rather than adding two selectors together.

    # Total tokens consumed by user over the last 24 hours
    sum by (user_id) (
      increase(agentgateway_gen_ai_client_token_usage_sum{gen_ai_token_type=~"input|output"}[24h])
    )
    
    # Percentage of a 100-token daily budget used (adjust the divisor to match your budget)
    (sum by (user_id) (
      increase(agentgateway_gen_ai_client_token_usage_sum{gen_ai_token_type=~"input|output"}[24h])
    ) / 100) * 100

    Each result series is labeled with a user_id, such as alice and bob. If a key is missing the user_id field, or the request is not attributed to a key, its usage appears under user_id="unknown".

    Example output:

    {"status":"success","data":{"resultType":"vector","result":[{"metric":{},"value":[1782411002.488,"0"]},{"metric":{"user_id":"bob"},"value":[1782411002.488,"372.2787929364588"]},{"metric":{"user_id":"alice"},"value":[1782411002.488,"309.56920815395927"]}]}}
    
    {"status":"success","data":{"resultType":"vector","result":[{"metric":{},"value":[1782411059.867,"0"]},{"metric":{"user_id":"bob"},"value":[1782411059.867,"370.95800165527817"]},{"metric":{"user_id":"alice"},"value":[1782411059.867,"307.9427844448483"]}]}}
    increase() and rate() need at least two samples within the time range to report a value, so a brand-new user_id series shows no result until it has been scraped a few times under continuous traffic. For a quick instant check, query the cumulative counter directly: sum by (user_id) (agentgateway_gen_ai_client_token_usage_sum).
  3. Calculate costs per user by multiplying token counts by your provider’s pricing. Input and output tokens are usually priced differently, so reduce each token type to a per-user series with sum by (user_id) before adding them, which keeps the two sides matchable.

    # Cost per user (assuming $0.50 per 1M input tokens, $1.50 per 1M output tokens)
    sum by (user_id) (rate(agentgateway_gen_ai_client_token_usage_sum{gen_ai_token_type="input"}[24h])) / 1000000 * 0.50
    +
    sum by (user_id) (rate(agentgateway_gen_ai_client_token_usage_sum{gen_ai_token_type="output"}[24h])) / 1000000 * 1.50

    Example output:

    {"status":"success","data":{"resultType":"vector","result":[{"metric":{},"value":[1782410758.432,"0"]},{"metric":{"user_id":"bob"},"value":[1782410758.432,"6.101636101191084e-09"]},{"metric":{"user_id":"alice"},"value":[1782410758.432,"5.106526900820178e-09"]}]}}

For more information on cost tracking, see the cost tracking guide.

Advanced configuration

Tiered budgets based on user type

Provide different budget tiers for free, standard, and premium users.

  1. Add tier metadata to each API key in the Secret.

    apiVersion: v1
    kind: Secret
    metadata:
      name: llm-api-keys
      namespace: agentgateway-system
    type: Opaque
    stringData:
      alice: |
        {
          "key": "sk-alice-abc123def456",
          "metadata": {
            "user_id": "alice",
            "tier": "premium"
          }
        }
      charlie: |
        {
          "key": "sk-charlie-ghi345jkl678",
          "metadata": {
            "user_id": "charlie",
            "tier": "free"
          }
        }
  2. Configure rate limiting to use the tier and user_id from API key metadata.

    traffic:
      rateLimit:
        global:
          domain: agentgateway
          backendRef:
            kind: Service
            name: ratelimit
            namespace: ratelimit
            port: 8081
          descriptors:
            - entries:
                - name: tier
                  expression: 'apiKey.tier'
                - name: user_id
                  expression: 'apiKey.user_id'
              unit: Tokens
  3. Configure the rate limit server with tier-based budgets.

    domain: agentgateway
    descriptors:
      - key: tier
        value: "free"
        descriptors:
          - key: user_id
            rate_limit:
              unit: day
              requests_per_unit: 10000  # 10K tokens/day for free tier
      - key: tier
        value: "standard"
        descriptors:
          - key: user_id
            rate_limit:
              unit: day
              requests_per_unit: 100000  # 100K tokens/day for standard tier
      - key: tier
        value: "premium"
        descriptors:
          - key: user_id
            rate_limit:
              unit: day
              requests_per_unit: 500000  # 500K tokens/day for premium tier

Hourly budget limits

Set a smaller budget that refreshes every hour for tighter cost control.

# In the ratelimit-config ConfigMap
domain: agentgateway
descriptors:
  - key: user_id
    rate_limit:
      unit: hour
      requests_per_unit: 10000  # 10,000 tokens per hour

Multi-tenant virtual keys

Create virtual keys scoped to both user and tenant for multi-tenant applications. Add tenant_id to the API key metadata.

# In TrafficPolicy
descriptors:
  - entries:
      - name: tenant_id
        expression: 'apiKey.tenant_id'
      - name: user_id
        expression: 'apiKey.user_id'
    unit: Tokens
# In the ratelimit-config ConfigMap
domain: agentgateway
descriptors:
  - key: tenant_id
    descriptors:
      - key: user_id
        rate_limit:
          unit: day
          requests_per_unit: 50000

For more advanced rate limiting patterns, see the budget and spend limits guide.

Cleanup

You can remove the resources that you created in this guide.
kubectl delete AgentgatewayPolicy api-key-auth per-user-metrics -n agentgateway-system --ignore-not-found
kubectl delete secret llm-api-keys -n agentgateway-system
kubectl delete httproute openai -n agentgateway-system
kubectl delete AgentgatewayBackend openai -n agentgateway-system

To remove the rate limit server, follow the cleanup steps in the global rate limiting guide.

What’s next

Was this page helpful?
Agentgateway assistant

Ask me anything about agentgateway configuration, features, or usage.

Note: AI-generated content might contain errors; please verify and test all returned information.

Tip: one topic per conversation gives the best results. Use the + button in the chat header to start a new conversation.

Switching topics? Starting a new conversation improves accuracy.
↑↓ navigate select esc dismiss

What could be improved?

Your feedback helps us improve assistant answers and identify docs gaps we should fix.

Need more help? Join us on Discord: https://discord.gg/y9efgEmppm

Want to use your own agent? Add the Solo MCP server to query our docs directly. Get started here: https://search.solo.io/.