Docs

qzira AI API Gateway
Technical Reference

Manage and unify multiple AI models with a single API key

Ver 2.4 Last updated: 2026-02-17

qzira is an AI API gateway that lets you manage multiple AI providers — OpenAI, Anthropic, Google AI, and DeepSeek — through a single unified endpoint.

BYOK (Bring Your Own Key) — use your existing API keys as-is. qzira handles request proxying, usage visibility, failover, and rate limiting automatically.

Why use qzira?

  • Cost visibility: Monitor request counts and token consumption in real time via the dashboard
  • Failover: Automatically switch to another provider when one goes down
  • Auto retry: qzira handles 429 retries on your behalf
  • Unified key management: Store all provider keys in qzira; your app only needs one gw_ key
  • Budget alerts & auto-stop: Prevent runaway AI agent costs
  • Usage export: Download logs as CSV for reporting or expense trackingNEW

Architecture overview

qzira sits as a proxy between your application and AI providers. One gw_ key grants access to all providers.

Your App Cursor / Cline / SDK gw_xxxxxxxx One key for all providers HTTPS qzira Gateway 🔑 API Key Auth & Management 💰 Budget Control & Auto-Stop 📊 Cache & Logs 🔄 Failover 🛡️ Rate Limit & Retry OpenAI Your Key: sk-proj-... Anthropic Your Key: sk-ant-... Google AI Your Key: AIza... BYOK: each provider key belongs to you
ℹ️ qzira is a proxy. Requests pass through qzira's servers and are forwarded to the provider. qzira does not issue its own AI keys — it securely manages and uses the keys you bring from each provider (BYOK).

⚠️ About This Documentation (As-Is)
This documentation is provided on an as-is basis. AI services evolve rapidly, and each provider's API specifications, pricing, and limits may change without notice. The content reflects information at the time of writing and does not permanently guarantee future behavior. Always check each provider's official documentation for the latest information.

1. Account Setup

1Access the dashboard

Go to https://app.qzira.com.

2Sign in with Google

Click "Sign in with Google" and authenticate with your Google account.

ℹ️ You will be asked to agree to the Terms of Service when signing in for the first time.

2. Create API Key

After logging in, click "API Keys" in the sidebar.

1Create a new API key

Click "Create new API key" and give it a name (e.g., my-app-key, cursor-dev).

2Copy your API key

The gw_xxxxxxxx key shown immediately after creation must be copied and stored securely. It will only be displayed once.

⚠️ Important: The API key is shown only once at creation. If lost, issue a new key or use the rotation feature to regenerate it.
💡 This key becomes your unified gateway key for all providers. Regular key rotation is recommended (see Section 11).

3. Register Provider API Keys (BYOK)

Click "Providers" in the sidebar.

Supported Providers

ProviderWhere to get your keyKey format
OpenAI platform.openai.com/api-keys sk-...
Anthropic console.anthropic.com/settings/keys sk-ant-...
Google AI aistudio.google.com/apikey AIza...
DeepSeek platform.deepseek.com/api_keys sk-...

Registration steps

  1. Click "Register API key" for the provider you want to use
  2. Enter your API key
  3. Click "Register" — qzira will automatically validate the key
🔒 Registered API keys are stored encrypted and used only for proxying requests. All plans support registering keys for all three providers.

4. Enable Providers

Registering a provider key alone does not enable routing through that provider. You also need to enable it.

How to enable

In the Providers screen, click the "Enable" button for the provider you want to use.

Simultaneous active providers by plan

PlanMax active providers
Free1
Starter3
Pro and aboveUnlimited
💡 On the Free plan, only one provider can be active at a time. To switch providers, enable a new one and it will automatically replace the current one (no need to re-enter your key).

5. Code Migration

qzira provides an OpenAI-compatible API endpoint. Migration requires only two changes: Base URL and API key.

Endpoint info

ItemValue
Base URLhttps://api.qzira.com/v1
Endpoint (OpenAI-compatible)/chat/completions
Endpoint (Anthropic-compatible)/v1/messages NEW
AuthenticationAuthorization: Bearer gw_xxxxxxxx or x-api-key: gw_xxxxxxxx

Migration example: Python (OpenAI SDK)

Before (direct call):

from openai import OpenAI

client = OpenAI(
    api_key="sk-xxxxxxxx"  # OpenAI API key
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}]
)

After (via qzira):

from openai import OpenAI

client = OpenAI(
    api_key="gw_xxxxxxxx",                  # qzira API key
    base_url="https://api.qzira.com/v1"     # qzira endpoint
)

response = client.chat.completions.create(
    model="gpt-4o",  # model name unchanged
    messages=[{"role": "user", "content": "Hello"}]
)

Migration example: Python (Anthropic SDK → OpenAI-compatible)

Before (direct Anthropic SDK call):

import anthropic

client = anthropic.Anthropic(api_key="sk-ant-xxxxxxxx")

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system="You are a helpful assistant.",
    messages=[{"role": "user", "content": "Hello"}]
)

After (via qzira — OpenAI-compatible format):

from openai import OpenAI

client = OpenAI(
    api_key="gw_xxxxxxxx",
    base_url="https://api.qzira.com/v1"
)

response = client.chat.completions.create(
    model="claude-sonnet-4-20250514",  # specify Claude model directly
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello"}
    ]
)
ℹ️ qzira automatically detects the provider from the model name and converts the request to the appropriate format. system messages are handled automatically.

Migration example: Python (Google Gemini → OpenAI-compatible)

from openai import OpenAI

client = OpenAI(
    api_key="gw_xxxxxxxx",
    base_url="https://api.qzira.com/v1"
)

response = client.chat.completions.create(
    model="gemini-2.0-flash",  # Gemini model name unchanged
    messages=[{"role": "user", "content": "Hello"}]
)

Migration example: JavaScript / TypeScript

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: "gw_xxxxxxxx",
  baseURL: "https://api.qzira.com/v1",
});

const response = await client.chat.completions.create({
  model: "claude-sonnet-4-20250514",
  messages: [{ role: "user", content: "Hello" }],
});

Managing keys with environment variables (recommended)

.env file:

QZIRA_API_KEY=gw_xxxxxxxx
QZIRA_BASE_URL=https://api.qzira.com/v1

Python:

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.getenv("QZIRA_API_KEY"),
    base_url=os.getenv("QZIRA_BASE_URL")
)

Streaming support

qzira supports SSE (Server-Sent Events) streaming for all providers.

stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Migration example: Anthropic SDK (native format) NEW

If you use the Anthropic SDK directly, you can use qzira's /v1/messages endpoint in native format.

⚠️ Set base_url to https://api.qzira.com (without /v1). The SDK automatically appends /v1/messages.

Before (direct call):

import anthropic

client = anthropic.Anthropic(
    api_key="sk-ant-xxxxxxxx"  # Anthropic API key
)

message = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello"}]
)

After (via qzira — only 2 lines change):

import anthropic

client = anthropic.Anthropic(
    api_key="gw_xxxxxxxx",       # qzira API key
    base_url="https://api.qzira.com"  # ⚠️ no /v1
)

message = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello"}]
)

Migration example: TypeScript / JavaScript (Anthropic SDK)

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic({
  apiKey: "gw_xxxxxxxx",        // qzira API key
  baseURL: "https://api.qzira.com"  // ⚠️ no /v1
});

const message = await client.messages.create({
  model: "claude-sonnet-4-20250514",
  max_tokens: 1024,
  messages: [{ role: "user", content: "Hello" }]
});

Supported models (major examples)

ProviderExample models
OpenAIgpt-4o, gpt-4o-mini, gpt-4-turbo, o1, o3-mini
Anthropicclaude-sonnet-4-20250514, claude-3-5-haiku-20241022, claude-3-opus-20240229
Google AIgemini-2.0-flash, gemini-2.5-flash, gemini-2.5-pro
DeepSeekdeepseek-chat, deepseek-reasoner
💡 OpenAI's o1/o3 reasoning models are also supported. Specify the model name directly. Tool Calling (Function Calling) is supported for all three providers.

6. Testing with curl

You can test qzira instantly using curl commands.

OpenAI model

curl -X POST https://api.qzira.com/v1/chat/completions \
  -H "Authorization: Bearer gw_xxxxxxxx" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Anthropic model

curl -X POST https://api.qzira.com/v1/chat/completions \
  -H "Authorization: Bearer gw_xxxxxxxx" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-3-5-haiku-20241022",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Google AI model

curl -X POST https://api.qzira.com/v1/chat/completions \
  -H "Authorization: Bearer gw_xxxxxxxx" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemini-2.0-flash",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Anthropic native format (/v1/messages) NEW

curl -X POST https://api.qzira.com/v1/messages \
  -H "x-api-key: gw_xxxxxxxx" \
  -H "Content-Type: application/json" \
  -H "anthropic-version: 2023-06-01" \
  -d '{
    "model": "claude-sonnet-4-20250514",
    "max_tokens": 1024,
    "messages": [{"role": "user", "content": "Hello"}]
  }'

/v1/messages streaming

curl -N -X POST https://api.qzira.com/v1/messages \
  -H "x-api-key: gw_xxxxxxxx" \
  -H "Content-Type: application/json" \
  -H "anthropic-version: 2023-06-01" \
  -d '{
    "model": "claude-sonnet-4-20250514",
    "max_tokens": 1024,
    "messages": [{"role": "user", "content": "Count from 1 to 5"}],
    "stream": true
  }'

Streaming test

curl -N -X POST https://api.qzira.com/v1/chat/completions \
  -H "Authorization: Bearer gw_xxxxxxxx" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini",
    "messages": [{"role": "user", "content": "Count from 1 to 5"}],
    "stream": true
  }'

PowerShell (Windows)

$headers = @{
  "Authorization" = "Bearer gw_xxxxxxxx"
  "Content-Type"  = "application/json"
}
$body = '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"Hello"}]}'

Invoke-RestMethod -Uri "https://api.qzira.com/v1/chat/completions" `
  -Method POST -Headers $headers -Body $body

Successful response example

{
  "id": "chatcmpl-xxxxx",
  "object": "chat.completion",
  "model": "gpt-4o-mini",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! How can I help you today?"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 8,
    "completion_tokens": 9,
    "total_tokens": 17
  },
  "provider": "openai"
}
💡 The response includes a provider field so you can see which provider handled the request.

Tool Calling (Function Calling) NEW

qzira supports Tool Calling for all three providers. Simply send the OpenAI-compatible tools parameter and qzira relays it in the appropriate format for each provider.

Tool Calling support by provider

ProviderMethodStatus
OpenAIPass-through (native OpenAI format)✅ Supported
AnthropicPass-through (SDK-handled)✅ Supported
DeepSeekPass-through (OpenAI-compatible)✅ Supported
Google (Gemini)Auto-conversion (OpenAI tools → Gemini functionDeclarations)✅ Supported
💡 Gemini auto-conversion: When a Gemini model is specified, qzira automatically converts OpenAI-format tools to Gemini-format functionDeclarations, and converts the response's functionCall back to tool_calls. No changes needed on your end.

Tool Calling test (curl)

OpenAI model

curl -X POST https://api.qzira.com/v1/chat/completions \
  -H "Authorization: Bearer gw_xxxxxxxx" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini",
    "messages": [
      {"role": "user", "content": "What is the weather in Tokyo?"}
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "get_weather",
          "description": "Get current weather for a location",
          "parameters": {
            "type": "object",
            "properties": {
              "location": {
                "type": "string",
                "description": "City and country, e.g. Tokyo, Japan"
              }
            },
            "required": ["location"]
          }
        }
      }
    ]
  }'

Gemini model (auto-converted)

# Same tools format as OpenAI — qzira auto-converts
curl -X POST https://api.qzira.com/v1/chat/completions \
  -H "Authorization: Bearer gw_xxxxxxxx" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemini-2.0-flash",
    "messages": [
      {"role": "user", "content": "What is the weather in Osaka?"}
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "get_weather",
          "description": "Get current weather for a location",
          "parameters": {
            "type": "object",
            "properties": {
              "location": {
                "type": "string",
                "description": "City and country, e.g. Osaka, Japan"
              }
            },
            "required": ["location"]
          }
        }
      }
    ]
  }'

Tool Calling response example

{
  "id": "chatcmpl-xxxxx",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": null,
        "tool_calls": [
          {
            "id": "call_xxxxx",
            "type": "function",
            "function": {
              "name": "get_weather",
              "arguments": "{\"location\": \"Tokyo, Japan\"}"
            }
          }
        ]
      },
      "finish_reason": "tool_calls"
    }
  ],
  "usage": {
    "prompt_tokens": 53,
    "completion_tokens": 6,
    "total_tokens": 59
  },
  "provider": "openai"
}

tool_choice options

ValueBehaviorGemini conversion
"auto" (default)Model decides whether to call a toolAUTO
"required"Model must call at least one toolANY
"none"No tool callsNONE
{"type":"function","function":{"name":"xxx"}}Call a specific toolANY + allowedFunctionNames

Streaming with Tool Calling

Tool Calling works correctly with streaming ("stream": true). Chunk structure when a tool is called:

# Chunk 1: role
data: {"choices":[{"delta":{"role":"assistant"},...}]}

# Chunk 2: tool_calls (function name + arguments)
data: {"choices":[{"delta":{"tool_calls":[{"index":0,"id":"call_xxx","type":"function","function":{"name":"get_weather","arguments":"{...}"}}]},...}]}

# Chunk 3: done
data: {"choices":[{"delta":{},"finish_reason":"tool_calls",...}]}

# Chunk 4: usage
data: {"usage":{"prompt_tokens":24,"completion_tokens":6}}

data: [DONE]
⚠️ Note: If response caching is enabled, an old cached response (without tools) from before Tool Calling was added may be returned. If results are unexpected, slightly modify the message content to bypass the cache.

7. Cursor / AI Agent Integration

Cursor Verified

You can add qzira as an OpenAI-compatible API provider in Cursor's settings. Manage all OpenAI, Anthropic, and Google models with a single API key.

⚠️ Cursor Pro or higher is required. The Free tier does not support BYOK (Bring Your Own Key), so custom API keys and Base URLs cannot be configured.

Prerequisites

Step 1: Set up qzira

  1. Log in to app.qzira.com
  2. Under "Providers" in the sidebar, enable the providers you want:
    • OpenAI: for GPT-4o, GPT-4o-mini, etc.
    • Anthropic: for Claude Sonnet, Claude Haiku, etc.
    • Google AI: for Gemini 2.0 Flash, Gemini 2.5 Pro, etc.
    • DeepSeek: for DeepSeek V3 (deepseek-chat), DeepSeek R1 (deepseek-reasoner)
  3. Under "API Keys," create a key for Cursor (e.g., cursor-dev)
  4. Copy the displayed gw_xxxxxxxx key (⚠️ shown only once)

Step 2: Configure Cursor

  1. Open Cursor and go to Settings → Models
  2. Enter your qzira API key in OpenAI API Key:
    gw_xxxxxxxx
  3. Enter the following in Override OpenAI Base URL:
    https://api.qzira.com/v1
  4. Manually add the models you want (click "+ Add model"):
    • gpt-4o-mini (OpenAI)
    • gpt-4o (OpenAI)
    • claude-sonnet-4-20250514 (Anthropic)
    • gemini-2.0-flash (Google)
    • gemini-2.5-pro (Google)
  5. Toggle the added models ON
ℹ️ Models preset in Cursor like claude-3.5-sonnet use Cursor's built-in API. To use them via qzira, add the exact model name manually as described above.

Verification checklist

  • Select model: In Cursor chat or editor, select a manually added model (e.g., gemini-2.0-flash)
  • Test send: Send a simple prompt ("Hello") and confirm a response is returned
  • Check dashboard: Confirm the request is logged at app.qzira.com
  • Check provider: Confirm the correct provider name appears in the log's provider column

Troubleshooting

SymptomCauseSolution
401 Unauthorized Invalid or mistyped API key Check key status in qzira dashboard; create a new key if needed
403 Provider not enabled Provider not enabled Dashboard → Providers → enable the provider and set its API key
403 Plan upgrade required Feature not available on Free plan Upgrade qzira plan to Pro or higher
Model not visible in selector Model not added or toggle is OFF Settings → Models → "+ Add model" then toggle ON
No response / timeout Base URL misconfigured Confirm https://api.qzira.com/v1 is entered correctly (no trailing slash)
💡 Using separate API keys per agent isolates Cursor's usage from other tools. Combined with per-key budgets (Section 13), you can prevent runaway Cursor costs.

Claude Code NEW

Claude Code switches to qzira by setting environment variables. All requests are logged in the dashboard, and budget controls and auto-stop apply.

⚠️ Set ANTHROPIC_BASE_URL to https://api.qzira.com (without /v1). Claude Code automatically appends /v1/messages.

Linux / macOS

# Set environment variables
export ANTHROPIC_BASE_URL="https://api.qzira.com"
export ANTHROPIC_API_KEY="gw_xxxxxxxx"

# Start Claude Code
claude

PowerShell (Windows)

# Set environment variables
$env:ANTHROPIC_BASE_URL = "https://api.qzira.com"
$env:ANTHROPIC_API_KEY = "gw_xxxxxxxx"

# Start Claude Code
claude

Persist via ~/.bashrc or ~/.zshrc

# Add to ~/.bashrc or ~/.zshrc
export ANTHROPIC_BASE_URL="https://api.qzira.com"
export ANTHROPIC_API_KEY="gw_xxxxxxxx"

Verification

  • First launch: The Anthropic OAuth screen may appear (select your organization). This is Claude Code's own auth and is separate from qzira.
  • Auth conflict warning: "Using ANTHROPIC_API_KEY instead of Anthropic Console key" means the qzira key is being used ✅
  • Automatic model switching: Claude Code automatically uses Haiku (lightweight tasks) and Sonnet (responses). All requests are logged in the dashboard.
  • Dashboard check: Confirm requests with provider: anthropic appear in the usage log at app.qzira.com.

Cline (VS Code extension)

In Cline's settings, change "API Provider" to "OpenAI Compatible" and enter:

  • Base URL: https://api.qzira.com/v1
  • API Key: gw_xxxxxxxx
  • Model: the model name you want (e.g., claude-sonnet-4-20250514)

Windsurf

⚠️ Windsurf does not currently support qzira integration.

Windsurf's BYOK feature only supports Claude 4 Sonnet / Opus, and API keys are set in the Windsurf management panel (windsurf.com/subscription/provider-api-keys). There is no custom Base URL or OpenAI Compatible provider setting, so routing through an API gateway like qzira is not possible.

Source: Windsurf Docs — AI Models (verified February 2026)

Roo Code (VS Code extension)

Roo Code has been verified to work in both OpenAI Compatible mode and Anthropic mode (v3.47.3).

Method A: OpenAI Compatible mode (recommended)

  • API Provider: OpenAI Compatible
  • Base URL: https://api.qzira.com/v1
  • API Key: gw_xxxxxxxx
  • Model: gpt-4o-mini (or gpt-4o)

Method B: Anthropic mode

  • API Provider: Anthropic
  • ☑ Check "Use custom base URL"
  • Base URL: https://api.qzira.com (⚠️ no /v1)
  • API Key: gw_xxxxxxxx
  • Model: claude-sonnet-4-20250514

Other AI agents

Any tool that supports configuring an OpenAI-compatible Base URL and API key can be integrated the same way.

ToolWhere to configure
CursorSettings → Models → OpenAI API Key
Claude CodeEnv vars: ANTHROPIC_BASE_URL=https://api.qzira.com (no /v1) + ANTHROPIC_API_KEY=gw_xxx
ClineAPI Provider → OpenAI Compatible
Windsurf❌ No custom Base URL support (BYOK limited to Claude, URL not configurable)
Roo CodeAPI Provider → OpenAI Compatible or Anthropic (custom base URL)
Continue.devapiBase in config.json
Aider--openai-api-base option
LangChainbase_url parameter

Tool × Provider Compatibility Matrix

Verified compatibility of AI coding tools via qzira (as of February 2026).

Tool OpenAI Anthropic Google AI curl Claude Code Anthropic-only tool by design Cline ⚠️ Text generation only Roo Code ⚠️ Text generation only Cursor ⚠️ Text generation only ※ Requires Cursor Pro or higher (Free plan ignores custom settings) Windsurf No custom Base URL support Legend: Supported ⚠️ Partial Not supported N/A (tool limitation) ⚠️ Google AI "Text generation only": Text generation via qzira → Google AI works, but Tool Calling (function calling) is not yet supported. AI coding tools rely heavily on Tool Calling, so practical use is limited.
📎 Sources & References
ℹ️ The above reflects verification results for specific versions as of February 2026. Tool or provider updates may change compatibility. Check each tool's official documentation for the latest status.
💡 Try it free on the Free plan. Send feedback and tool reports to @qzira_dev.

8. Monitoring Usage in the Dashboard

Log in at https://app.qzira.com/dashboard to see the following in real time.

Dashboard overview

  • This month's requests: Total monthly request count (vs. plan limit)
  • Usage rate: Percentage of plan limit used
  • Input / output tokens: Total token consumption
  • Daily request graph: Visualizes historical request trends

Recent requests

Details for each request:

FieldDescription
API key nameName of the API key used for the requestNEW
ModelModel name used (e.g., claude-sonnet-4-20250514)
ProviderResponding provider (e.g., Anthropic)
TokensInput / output token count
LatencyResponse time (milliseconds)
StatusSuccess / Error
Cost NEWEstimated cost (USD / JPY switchable). Failed requests show "—"
Tool Calls 🔧 NEWNumber of tool calls. Click to view function names and arguments
TimestampRequest date and time

Usage export (CSV) NEW

Request logs can be downloaded as CSV. Useful for expense reporting and internal analytics.

Click the "Export CSV" button in the "Request Logs" section of the dashboard to download the displayed log data as a CSV file.

CSV columns

ColumnDescription
idLog ID
api_key_nameName of the API key used
modelModel name
providerProvider name
input_tokensInput token count
output_tokensOutput token count
latency_msResponse time (ms)
statusStatus code
created_atRequest timestamp
estimated_cost_usdEstimated cost (USD)NEW
estimated_cost_jpyEstimated cost (JPY)NEW
tool_callsTool call details (JSON format)NEW
ℹ️ CSV export covers data within the log retention period. We recommend exporting regularly.

Log retention period NEW

Request logs are automatically retained for a period based on your plan. Logs older than the retention period are automatically deleted daily.

PlanLog retention
Free3 days
Starter30 days
Pro30 days
Business90 days
Scale365 days
⚠️ Important: Logs past the retention period cannot be recovered. If you need long-term storage, export to CSV within the retention period.

9. Budget Alerts & Auto-Stop

Click "Budget" in the sidebar to configure cost controls (Starter and above).

⚠️ Budget limit enforcement timing
Usage aggregation is not real-time — it updates at intervals. As a result, spending may exceed the configured limit depending on aggregation timing.

This is common behavior in many API environments, and qzira behaves similarly.

qzira keeps aggregation intervals to a few minutes to minimize overage, but instantaneous hard stops are not guaranteed.
⚡ Realtime Budget Stop (Scale plan only)

In addition to the standard KV-based budget check (up to ~5 min delay), Scale plan users can enable instant enforcement via direct D1 query.

ItemDetail
PlanScale only (visible when Auto-Stop is enabled)
EffectBlocks the very next request after the limit is reached — no ~5 min delay
Latency overhead+15–40ms per request (D1 query)
How to enableDashboard → Budget Settings → Realtime Budget Stop toggle

Budget management modes

qzira supports two budget modes: request-count-based and cost-based (USD). Switch between them in the Budget settings page of the dashboard.

ModeUnitCharacteristics
Request countAPI request countSimple. Each request = 1 count, regardless of model or token usage
Cost (USD)Estimated API cost (USD)Manages based on estimated cost from token usage. Prevents overuse of expensive models

Configuration options

SettingDescriptionAvailable plan
Monthly limitMax monthly request count or cost (USD)All plans
Daily limitMax daily request count or cost (USD)Starter and above
Budget alert notificationsEmail notification at 50% / 80% / 100%Starter and above
Auto-stopAutomatically block requests when limit is reachedPro and above

Exchange rate display (USD / JPY) NEW

In cost mode, the USD budget setting also supports JPY display.

ItemDetails
Rate sourceExchangeRate-API (open.er-api.com) — daily updates
Update frequencyOnce daily (UTC 15:00)
CacheKV store, 48-hour cache
FallbackFixed rate ¥150/USD if API fetch fails
Currency switchToggle USD ↔ JPY in the input form
ℹ️ JPY display is for reference only. Exchange rate fluctuations may cause discrepancies between the dashboard's JPY estimate and actual provider billing. The fallback rate (¥150/USD) is a last resort during API outages and may differ significantly from the market rate. Check each provider's dashboard for accurate billing amounts.

About cost estimation

Estimated costs in cost mode are calculated from each request's token usage and each model's published pricing. Please note:

  • Estimated costs are approximations and may not match actual provider billing
  • If a provider changes pricing, there may be a lag before it's reflected
  • Cost estimation accuracy may be lower for some models or special requests (e.g., image input)
⚠️ To guard against AI agents sending large numbers of overnight requests, we recommend setting a daily limit + auto-stop.

10. Plan Upgrade

Click "Plan & Billing" in the sidebar to change your plan.

Plan comparison

PlanMonthlyRequests/moActive providersAPI keysLog retention
Free$01,000113 days
Starter$510,0003230 days
Pro$10100,000Unlimited530 days
Business$29500,000Unlimited5090 days
Scale$693,000,000Unlimited100365 days
ℹ️ Prices shown in USD. Billed in JPY via Stripe. USD prices are approximate.

Key features by plan

FeatureFreeStarterProBusinessScale
Streaming
API key rotation
Usage export (CSV)
Auto retry
Failover
Budget alerts
Budget limit & auto-stop
Response cache
Semantic cache
Per-key budget
Access control (Per-Key)NEW
Smart RoutingNEW
Secret Shield NEW
Priority support

11. API Key Rotation NEW

As a security best practice, we recommend rotating (regenerating) API keys regularly. qzira supports one-click rotation from the dashboard.

What is rotation?

When you rotate a key, the existing API key is immediately invalidated and a new gw_ key is issued. The key's ID (internal identifier) and name are preserved, so dashboard usage history and settings carry over.

Rotation steps

  1. Click "API Keys" in the sidebar
  2. Click the "Rotate" button (🔄 icon) for the key you want to rotate
  3. Select "Execute rotation" in the confirmation dialog
  4. The new key is displayed — copy and store it securely immediately
🚨 Important: Rotation immediately invalidates the old key. There is no grace period. Any in-flight requests using the old key will immediately receive 401 Unauthorized.
⚠️ Recommended process: Before rotating, prepare the following:
1. Have a way to receive the new key ready (open your .env file for editing)
2. Execute rotation → immediately copy the new key
3. Update the new key in all apps, agents, and CI/CD pipelines
4. Confirm requests from the new key appear in the dashboard
※ A grace period feature is being considered for future implementation.

When to rotate

  • When a key may have been accidentally exposed (log output, git history, etc.)
  • When a team member leaves or changes role
  • Periodically as a security measure (every 30–90 days recommended)
  • When suspicious requests are detected in the dashboard
💡 If managing API keys via environment variables (.env file), just update the value after rotation — that's it.

Post-rotation update example

# Update .env file
QZIRA_API_KEY=gw_yyyyyyyy    # ← Replace with new key
QZIRA_BASE_URL=https://api.qzira.com/v1

Restart (or redeploy) your application and the new key will be used automatically.

12. Response Cache

Cache AI provider responses for identical requests to speed up subsequent responses and save tokens (Pro and above).

ℹ️ Response caching is disabled by default. Enable it explicitly from the dashboard. LLM response caching can cause unexpected behavior — understand the use case before enabling.

Benefits

  • Cost reduction: Skip provider API calls for identical requests — zero token consumption
  • Faster responses: Returned instantly from KV cache (hundreds of ms → tens of ms)
  • Provider outage mitigation: During cache TTL, you're unaffected by provider outages
  • Zero config: Flip the toggle to enable immediately — no code changes required

Downsides & caveats

  • Streaming requests are not cached
  • Exact match only (for fuzzy matching, use semantic cache)
  • Stale responses may be returned during the TTL period
  • Requests containing PII may also be cached

Good fit vs. poor fit

Good fitPoor fit
Repeated test / debug runsGenerating unique creative content each time
Batch processing (same prompt, many runs)Fetching real-time information
FAQ bots / templated responsesStreaming agents
Cost minimization use casesRequests with heavy PII

Supported plans and TTL

PlanAvailableDefault TTLCustom TTL
Free
Starter
Pro1 hourUp to 1 hour
Business24 hoursUp to 24 hours
Scale7 daysUp to 7 days

How to enable

  1. Click "Cache" in the sidebar
  2. Toggle "Enable cache" to ON
  3. Optionally configure custom TTL or temperature limit

Configuration options

SettingDescriptionDefault
Cache on/off Enable or disable response caching OFF
Custom TTL Cache retention duration (seconds). Can be set to at most the plan's default TTL. Plan default
Temperature limit Exclude requests above a specified temperature from caching No limit (all requests cached)

How caching works

The following request fields are SHA-256 hashed to detect identical requests:

  • User ID
  • Model name
  • Message content
  • temperature / top_p / max_tokens

On a cache hit, the provider request is skipped and the stored response is returned immediately, saving token consumption.

Response headers

When caching is active, the following headers are added to responses:

HeaderValueDescription
X-CacheEXACT_HIT / SEMANTIC_HIT / MISSExact hit / Semantic hit / Miss
X-Cache-TTLSecondsApplied TTL
X-Semantic-Score0.00–1.00Semantic cache similarity score (on hit only)
X-Cache-Skip-ReasonStringReason caching was skipped (on skip only)

Requests excluded from caching

  • Streaming requests ("stream": true): SSE format is not suited for caching
  • Temperature limit exceeded: Requests above the configured temperature limit
  • Error responses: Provider error responses are not cached
⚠️ Note: Response cache (exact match) only returns a cached result if all parameters are identical. Even a slight difference in message content results in a new request. Use semantic cache (Section 15) for semantically similar requests.

Reading cache statistics

Cache utilization statistics are shown at the bottom of the cache settings page.

All-time statistics

ItemDescription
Total hitsTotal number of cache hits
Input tokens savedTotal input tokens saved by cache hits
Output tokens savedTotal output tokens saved by cache hits
Unique prompt countNumber of unique prompts stored in cache

This month's statistics

Hit count and token savings for the current month. Resets at the start of each month.

Recent cache hits

A list of recent cache hits. Each entry includes:

  • Provider name / model name
  • Input / output tokens saved
  • Hit count
  • Last hit timestamp
💡 Cache statistics can be reset with the "Clear stats" button. Cached responses are automatically deleted when their TTL expires.

13. Per-Key Budget Settings

Set individual request limits per API key to control specific applications or agents (Business and above).

ℹ️ Per-key budgets are independent from the user-level budget alerts in Section 9. Combining both gives you a double layer of protection: user-wide + per-key.

Supported plans

PlanPer-key budget
Free
Starter
Pro
Business
Scale

Setup steps

  1. Click "API Keys" in the sidebar
  2. Click the wallet icon for the key you want to configure
  3. Set the following in the budget modal:
    • Monthly request limit: Max monthly requests
    • Daily request limit: Max daily requests
    • Auto-stop: Automatically block requests when the limit is reached
  4. Click "Save"

Configuration options

SettingDescriptionRequired
Monthly request limit Max monthly requests for this key. Leave blank for unlimited. Optional
Daily request limit Max daily requests for this key. Leave blank for unlimited. Optional
Auto-stop When ON, requests from this key are automatically blocked when the limit is reached. Optional

How it works

Checks are performed in the following order on each request:

  1. User-level budget check (Section 9 settings)
  2. Per-key budget check (this section's settings)
  3. Only if both pass, the request is forwarded to the provider

Response when limit is reached:

{
  "error": {
    "message": "API key budget exceeded",
    "type": "budget_exceeded",
    "code": 403
  }
}

Notifications

  • Threshold alerts: Email at 50% / 80% (determined by 5-minute cron)
  • Limit reached: Immediate email at 100%

Notification emails include the API key name so you can immediately identify which key hit its limit.

Checking current usage

The budget modal shows daily and monthly progress bars so you can check current usage in real time.

Use case examples

  • Per-agent limits: 100/day for Cursor key, 200/day for Claude Code key
  • Per-project management: Separate budget management for Project A and Project B keys
  • Overnight safety: Set daily limit + auto-stop on autonomous agent keys to prevent runaway
💡 Combining per-key budgets with user-level budgets enables fine-grained controls like "max 10,000 total/month," "Agent A: max 500/day," "Agent B: max 200/day."

14. Access Control (Per-Key)

Whitelist-based controls on which providers and models each API key can access. Control "what can be used" per agent within a team, preventing unintended provider usage or expensive model misuse.

ℹ️ Access control is available on Pro plans and above. Combined with per-key budgets (Section 13), you can implement a double guard of "what can be used" + "how much can be spent."

Supported plans

PlanAccess control
Free
Starter
Pro
Business
Scale

How to configure

  1. Open Dashboard → API Keys
  2. Click the 🔐 Access Restrictions button for the target key (visible on Pro and above)
  3. Provider restriction: Select "Allow all providers" or specific providers
    • Options: openai / anthropic / google
  4. Model restriction: Select "Allow all models" or enter specific models
    • Example: gpt-4o, gpt-4o-mini, claude-sonnet-4-20250514
  5. Click Save to apply settings
⚠️ Once restrictions are set, requests to disallowed providers or models are immediately rejected with a 403 error. Verify the models your AI agent uses before configuring.

Check priority order

Access control checks are performed in this order:

Request received
  │
  ├─ 1. Provider restriction check (evaluated first)
  │     If allowed_providers is set:
  │     → Target provider not in list → 403 provider_not_allowed
  │
  ├─ 2. Model restriction check (after provider passes)
  │     If allowed_models is set:
  │     → Target model not in list → 403 model_not_allowed
  │
  └─ 3. Both pass → Forward to normal proxy processing

Error responses

Requests that violate access controls return errors in the following format:

Error codeHTTP statusTrigger
provider_not_allowed403Request to a disallowed provider
model_not_allowed403Request with a disallowed model

Response examples:

// Provider restriction error
{
  "error": {
    "message": "Provider 'google' is not allowed for this API key",
    "code": "provider_not_allowed"
  }
}

// Model restriction error
{
  "error": {
    "message": "Model 'o1-mini' is not allowed for this API key",
    "code": "model_not_allowed"
  }
}

Use cases

ScenarioProvider settingModel settingEffect
Cursor-only keyopenai onlyAll modelsBlock all non-OpenAI providers
Claude Code-only keyanthropic onlyAll modelsBlock all non-Anthropic providers
Cost-restricted keyAll providersgpt-4o-mini, gemini-2.0-flashAllow low-cost models only
Production-validated keyopenai onlygpt-4o, gpt-4o-miniLimit to validated models only

Default behavior

  • No restrictions set (default): All providers and models are accessible (same as before)
  • Provider restriction only: All models under allowed providers are accessible
  • Model restriction only: Specified models accessible from any provider
  • Both restricted: Only requests satisfying both provider and model conditions pass
💡 Removing restrictions: Click "Remove All Restrictions" to clear all provider and model restrictions, returning to the default (allow all) state.
💡 Combined with budgets: Using access control together with per-key budgets (Section 13) enables precise controls like "OpenAI gpt-4o-mini only, max 500 requests/month" per agent.

15. Semantic Cache

Semantic cache uses AI vector search to detect semantically similar requests and reuse cached responses. Unlike response cache (Section 12) which requires exact matches, semantic cache can hit on requests with different wording but the same intent.

ℹ️ Semantic cache is available on Business plans and above. Using it together with response cache (Section 12) maximizes token savings.

Benefits

  • Handles phrasing variations: Recognizes "What are the benefits of TypeScript?" and "Why should I use TypeScript?" as the same intent
  • Significant cost reduction: Catches requests that exact match misses, reducing token consumption
  • Two-layer optimization: On semantic hits, also saved to exact cache (response cache) automatically
  • Safety-first design: PII detection, short text, and multi-turn requests are automatically excluded

Downsides & caveats

  • False positives possible (mitigated by setting threshold ≥ 0.95)
  • Embedding costs apply (approx. $0.30 for 200K queries/month — very low)
  • Non-streaming, single-turn requests only
  • Vector storage has limits (Business: 10,000 / Scale: 50,000)
  • Eventual consistency (newly cached items may take minutes to become searchable)

Good fit vs. poor fit

Good fitPoor fit
FAQ / helpdesk (same question, different phrasing)Unique creative requests each time
Education / learning apps (similar questions recur)Multi-turn conversations (chatbots)
Repetitive code generation (similar patterns)Streaming-required agents
Best results when combined with response cacheRequests with heavy PII

Response cache vs. semantic cache

ItemResponse Cache (Section 12)Semantic Cache (this section)
Match methodSHA-256 exact matchAI vector search (cosine similarity)
Supported plansPro and aboveBusiness and above
StreamingNot supportedNot supported
Multi-turnSupportedNot supported (single-turn only)
LatencyVery low (KV get only)Slightly higher (embedding generation + vector search)
HeaderX-Cache: EXACT_HITX-Cache: SEMANTIC_HIT
StorageKV onlyVectorize + KV

Processing flow

Semantic cache is evaluated after response cache (exact match). If an exact match hits, semantic cache processing is skipped.

Request received
  ↓
Auth + plan check (Business+?) + Feature Toggle ON?
  │ NO → Normal flow (skip semantic cache)
  ↓ YES
Streaming? → YES → Skip (X-Cache-Skip-Reason: streaming)
  ↓ NO
① Exact Cache Lookup (SHA-256, KV get ×1)
  ├─ EXACT_HIT → Return immediately (no embedding needed, fastest)
  └─ MISS ↓
② Exclusion check (PII / short text / multi-turn)
  ├─ Excluded → Normal flow (X-Cache-Skip-Reason shows reason)
  └─ OK ↓
③ Generate embedding (@cf/baai/bge-small-en-v1.5, 384 dims)
  ↓
④ Vectorize search (user_scope + meta_group filter)
  ├─ Score ≥ threshold → Fetch response from KV
  │   ├─ KV success → SEMANTIC_HIT (also save to exact cache)
  │   └─ KV failure → Treat as MISS (X-Cache-Skip-Reason: kv_miss)
  └─ Score < threshold → MISS
       ↓
⑤ Send request to provider
  ↓
⑥ Save response (Vectorize + KV + Exact Cache)
💡 When a semantic cache hit occurs, the response is automatically saved to exact cache as well. If the same request comes again, it will be served from exact cache (KV get only) immediately — even faster.

Supported plans and storage limits

ItemBusiness ($29/mo)Scale ($69/mo)
Max vectors10,00050,000
Threshold range0.90 – 0.990.85 – 0.99
TTL24 hours7 days
Embedding model@cf/baai/bge-small-en-v1.5 (384 dims)
MetricCosine similarity
ℹ️ When the vector count reaches the limit, new request caching is skipped (existing cache can still be searched). Check current usage on the "Semantic" page in the dashboard.

Automatic cleanup (CRON)

Old vectors past their TTL are deleted by a daily automatic cleanup.

  • Schedule: Daily UTC 15:00
  • Action: Detect and delete expired vectors per user
  • Deletion limit: Max 500 vectors/user/day (5 iterations × 100)
  • Count sync: Deleted vector count is automatically decremented from D1 counters

To manually delete all vectors, use the "Delete all" button on the "Semantic" page in the dashboard.

Dashboard settings

Click "Semantic" in the sidebar to open the semantic cache settings page.

Vector storage

A progress bar at the top shows vector storage usage — current count, limit, and usage percentage. TTL and auto-cleanup info is also shown.

Cache settings

  1. Semantic cache ON/OFF: Toggle to enable/disable
  2. Similarity threshold: Adjust with the slider (recommended: 0.92)
    • Higher → More precise (lower hit rate but fewer false positives)
    • Lower → Higher hit rate (more matches but higher false positive risk)
  3. Click "Save" to apply

Cache statistics

When enabled, cache utilization statistics appear at the bottom of the page: total hits, tokens saved, this month's hits, unique prompt count, and recent semantic hit details.

Threshold guide

ThresholdCharacteristicsRecommended for
0.95–0.99High precision. Only near-identical sentences hitMission-critical workloads, finance/medical
0.92 (recommended)Balanced. Hits on sufficiently similar requestsGeneral development, code generation, QA
0.85–0.91Wide catch. Maximizes cost savingsBatch processing with many similar requests (Scale only)

Auto-exclusion conditions

ConditionReasonX-Cache-Skip-Reason
Streaming requestsResponse is sent incrementally; full response cannot be cachedstreaming
Multi-turn conversationsConversations with assistant/tool roles are highly context-dependentmulti_turn
Short textPrompts under 30 characters are insufficient for semantic comparisontoo_short
PII detectedRequests containing email addresses, phone numbers, API keys, etc. are not cachedpii_detected
Feature Toggle OFFSemantic cache disabled in features settingsfeature_off
Plan not eligibleFree / Starter / Pro plans do not support semantic cacheplan_not_allowed

Response headers

HeaderValueDescription
X-CacheSEMANTIC_HITResponse returned from semantic cache
X-Semantic-Score0.00 – 1.00Matched similarity score (on SEMANTIC_HIT only)
X-Cache-Skip-ReasonSee exclusion conditions aboveReason caching was skipped
X-Cache-Meta8-char hashmeta_group identifier (model/system/temperature combo)

Matching conditions

For a semantic cache hit, all of the following must be true:

  • Same user
  • Same meta_group (model name + system prompt + temperature combination)
  • User message cosine similarity ≥ configured threshold
  • Vector TTL still valid

In other words, a hit only occurs for semantically similar questions using the same model, system prompt, and temperature. Cached responses from a different model or system prompt will never be incorrectly returned.

⚠️ Multi-turn conversation limitation: Semantic cache uses only the last user message in the messages array for similarity matching. The full conversation context is not considered. The same question in different conversation flows may have different intents, so the cache may incorrectly hit. For precision, set the threshold to 0.95 or higher, or consider disabling semantic cache for multi-turn-heavy workloads.

Responses that are not saved

  • HTTP 4xx / 5xx error responses
  • Responses where finish_reason is "length" (truncated output)
  • Responses shorter than 80 characters
💡 Using with Smart Routing: When Smart Routing is active, semantic cache matching uses the post-routing model name. For example, if gpt-4o is routed to claude-sonnet-4-20250514, cache matching is done on claude-sonnet-4-20250514.
ℹ️ Vectorize behavior: It may take up to a few minutes for newly cached vectors to become searchable (eventual consistency). This is a Cloudflare Vectorize specification.

16. Smart Routing NEW

Smart Routing automatically routes requested models to the optimal provider model (Scale plan only). Efficiently leverage multiple providers for cost optimization or latency improvement.

⚠️ Smart Routing is disabled by default. If your AI agent depends on model-specific behavior (JSON output format, tool calling specs, etc.), switching models may cause unexpected results. Understand the use case thoroughly before enabling.
ℹ️ How routing works: Smart Routing selects an alternative provider based on a predefined model mapping table. It does not automatically assess model quality (tier classification). Failover (Section 17) also follows the same mapping table order. If a model not in the mapping is specified, routing does not trigger and the request goes directly to the original provider.

Three routing strategies

StrategyDescriptionRecommended for
Cost optimization (default) Auto-selects the cheapest provider among equivalent models Batch processing, high-volume requests, cost-sensitive workloads
Latency optimization Prioritizes the fastest-responding provider Real-time responses, chatbots, user-facing apps
Round-robin Distributes requests evenly across all active providers Rate limit distribution, provider load balancing

Benefits

  • Cost optimization: Auto-switch to cheaper equivalent providers (up to 30-50% savings possible)
  • Rate limit avoidance: Distributing across providers reduces throttling
  • Zero code changes: Just toggle ON in the dashboard

Downsides & caveats

  • Agents dependent on model-specific behavior (output format, tool calling specs) may behave unexpectedly
  • Response quality may vary between providers
  • When used with semantic cache, matching uses the post-routing model name

How to configure

  1. Open the "Feature Settings" page in the dashboard
  2. Toggle "Smart Routing" to ON
  3. Select routing strategy (Cost optimization / Latency optimization / Round-robin)
  4. Confirm at least 2 providers have registered and enabled API keys
💡 Difference from failover: Failover is an emergency switch when a provider goes down. Smart Routing is optimal distribution during normal operation. Enabling both means failover kicks in if Smart Routing's chosen provider fails.

17. Failover Details NEW

Failover automatically switches to an equivalent model on another provider when a provider returns an error, then resends the request (Starter and above).

⚠️ Failover is disabled by default. Model switching may affect AI agent behavior. Register and enable multiple provider API keys, then explicitly enable in the dashboard.

Retry count by plan

PlanRetries429 handlingNotes
Free0 (disabled)No failover
Starter25xx errors only
Pro35xx + 429
Business35xx + 429
Scale55xx + 429

How it works

ℹ️ Request → Provider A (5xx/429 error)
→ Retry 1: Auto-switch to equivalent model on Provider B
→ Retry 2: Auto-switch to equivalent model on Provider C
→ All providers fail: Return last error response

Check the switch status in response headers:
X-Gateway-Failover-From: Original provider
X-Gateway-Failover-To: Switched-to provider
X-Gateway-Retries: Retry count

Failover exclusions

  • 4xx errors (Bad Request, auth errors, etc.): Client-side issues produce the same result on another provider
  • 404 model not found: Specified model does not exist
  • Only one active provider: No fallback target available

How to configure

  1. Register and enable API keys for at least 2 providers
  2. Open the "Feature Settings" page in the dashboard
  3. Toggle "Failover" to ON
💡 Model tiers: Failover switches to the same class of model. For example, if gpt-4o (flagship) fails, it switches to claude-sonnet-4-20250514 (flagship). It does not switch to gpt-4o-mini (economy).

18. Agent Trace NEW

Agent Trace visualizes every processing step of an API request in a Chrome DevTools-style waterfall timeline. You can see exactly how long each stage took — from auth, PII Shield, cache lookup, provider selection, to the final response — making debugging and performance optimization of AI agents straightforward.

Available Plans

Agent Trace is available on the Scale plan only.

Processing Steps

StepStatus ExamplesDescription
authsuccess / errorAPI key authentication
pii_shieldsuccess / skipSecret Shield PII masking
cache_checkexact_hit / semantic_hit / missResponse cache lookup
provider_selectrouted / defaultSmart Routing decision
provider_requestsuccess / error / streamingRequest to AI provider
failoversuccessAutomatic provider failover
responsesuccess / streamingFinal response delivered

Custom Trace ID

Add the x-qzira-trace-id header to your request to correlate traces with your own application logs:

curl https://api.qzira.com/v1/chat/completions \
  -H "Authorization: Bearer YOUR_QZIRA_KEY" \
  -H "x-qzira-trace-id: my-agent-session-001" \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"Hello"}]}'

The assigned trace ID is returned in the X-Trace-Id response header.

Data Retention

Trace logs are automatically deleted after 24 hours by a daily CRON job. Traces are not available for export.

Dashboard

  1. Open Dashboard → Agent Trace
  2. The waterfall timeline for each request is displayed in the list
  3. Click a row to expand the step-by-step breakdown with timing
  4. Rows marked EXT indicate a custom trace ID was supplied
  5. Rows marked CACHE indicate a cache hit occurred

Notes

  • Trace recording uses waitUntil() — zero impact on response latency
  • Streaming requests are recorded with provider_request: streaming status
  • If Secret Shield is disabled, the pii_shield step shows skip

19. Secret Shield NEW

Secret Shield automatically detects and masks sensitive information (PII) in requests before they are sent to AI providers. Define custom regex rules to redact API keys, credit card numbers, personal IDs, and more — the provider never sees the original values.

Available Plans

Secret Shield is available on the Scale plan only.

How It Works

User registers regex rules in the Dashboard
  ↓
API request received (proxy.routes.ts)
  ↓
If Secret Shield is ON → scan & mask request body before sending
  ↓
Masked request sent to AI provider
  ↓
Usage log saved with masked content (original never stored)

Masking Modes

ModeExample InputExample Output
Full mask[email protected][REDACTED]
Prefix (keep first N chars)sk-ant-api03-xxxxsk-ant-***
Suffix (keep last N chars)4111-1111-1111-1111****-1111

Built-in Presets (11 rules)

PresetMasking Mode
AWS Access KeyPrefix (AKIA***)
AWS Secret KeyPrefix
OpenAI API KeyPrefix (sk-***)
Anthropic API KeyPrefix (sk-ant-***)
GCP API KeyPrefix (AIza***)
Credit Card NumberSuffix (****-1234)
Phone Number (JP)Suffix
Email AddressFull ([REDACTED])
JWT TokenFull ([REDACTED])
My Number (JP)Full ([REDACTED])
Postal Code (JP)Full ([REDACTED])

AI-Assisted Regex Generation

Describe what you want to mask in plain English (or Japanese), and qzira will generate the regex automatically using Cloudflare Workers AI (Llama 3.1 8B). You can review and edit the generated pattern before saving.

Rule Limits

Up to 20 rules per user. Rules are applied in order; all matching rules are applied to each request.

Dashboard Setup

  1. Open Dashboard → Secret Shield
  2. Click Add Rule and describe what to mask (e.g. "employee ID starting with EMP")
  3. Click ✨ Generate Regex or enter the pattern manually
  4. Choose masking mode (Full / Prefix / Suffix) and save
  5. Toggle Secret Shield ON at the top of the page
  6. Use the Test field to verify masking behavior before going live

Security Notes

  • Regex patterns are stored server-side and are not displayed after saving — delete and recreate to update a pattern
  • Secret Shield operates as fail-open: if an error occurs during masking, the original request continues unmodified (masking failure does not block the API call)
  • Masked content is logged in usage_logs — the original value is never stored

FAQ

Can I use Secret Shield without Smart Routing?
Yes. Secret Shield and Smart Routing are independent features that can be toggled separately.
Does Secret Shield affect latency?
Masking adds minimal overhead (typically under 5ms) and does not block the request on error.
Can I view a saved rule's regex pattern?
No. For security reasons, patterns are not displayed after saving. To update a pattern, delete the rule and create a new one.

Responsibility & Disclaimer

qzira is a BYOK (Bring Your Own Key) API gateway. We clearly define the responsibility boundaries for each area.

Responsibility matrix

Areaqzira's responsibilityUser's responsibility
API key management Encrypted storage, safe handling during proxy Obtaining, managing, and renewing each provider's API keys
Request relay Accurate proxy processing, format conversion Appropriateness of request content, compliance with provider terms
Cost management Usage visibility, budget alerts, auto-stop feature provision Budget configuration, proper operation of cost management features
AI-generated content — (not involved) Reviewing generated content, usage decisions, third-party impact
Provider outages Impact mitigation via failover (supported plans only) Provider SLA and outages are the provider's responsibility
Data protection Encrypted communications, proper log management Managing transmitted data content (PII, confidential info handling)
AI agents — (not involved) Configuring, monitoring, and controlling AI agents
Provider terms — (not involved) Compliance with each provider's Terms of Service and policies
🚨 Important: qzira is a request relay service and assumes no responsibility for AI-generated content. Users are responsible for verifying the accuracy and appropriateness of generated content.

Limitations of cost management features

qzira's budget alerts and auto-stop features are designed to assist cost management. Please note:

  • Budget alerts are request-count-based and may differ from actual provider billing amounts
  • Auto-stop is determined on the next incoming request — requests already in progress are not stopped
  • Alert/stop triggering may be delayed during network delays or system failures
  • Cost management features are provided on a "best-effort" basis
  • Usage aggregation updates at intervals — spending may exceed the configured limit depending on timing

Protection level by plan

Protection featureFreeStarterProBusinessScale
Monthly request limit
Auto retry
Failover
Budget alert notifications
Daily request limit
Budget limit & auto-stop
Response cache
Semantic cache
Per-key budget
Access control (Per-Key)NEW
Smart RoutingNEW
Priority support

Notes for AI agent usage

AI agents (Cursor, Claude Code, Devin, etc.) autonomously send API requests. Agent behavior is the user's responsibility. Recommended safeguards:

  • Set a daily request limit (Starter and above)
  • Enable auto-stop (Pro and above)
  • Check usage in the dashboard regularly
  • Be especially careful with overnight automated tasks

SLA (Service Level)

qzira strives to maintain availability but does not currently offer SLA guarantees. The service may be temporarily suspended without prior notice to ensure infrastructure safety. See the Terms of Service for details.


FAQ

Pricing

Q. Is there a time limit on the Free plan?

No, the Free plan has no time limit. Up to 1,000 requests/month with 1 active provider.

Q. Are there costs beyond qzira's fees?

Yes. qzira is BYOK, so usage fees for each AI provider (OpenAI, Anthropic, Google AI, DeepSeek) are separate. qzira's monthly fee covers the gateway service itself.

Q. Can I change plans at any time?

Yes, you can upgrade or downgrade at any time from the dashboard. Upgrades take effect immediately; downgrades take effect at the end of the current billing period. Note that downgrades are limited to once per month.

Security

Q. How are my registered API keys protected?

API keys are stored encrypted. All communication is encrypted over HTTPS (TLS 1.3), and qzira staff cannot view API keys in plaintext.

Q. Does qzira use request content for model training?

No. qzira does not use request content for AI model training or service improvement. Data is used solely for request relay and usage measurement.

Q. Where is my data stored?

qzira's infrastructure runs on Cloudflare's global network. User data is stored in Cloudflare data centers.

Q. What should I do if my API key is leaked?

Immediately rotate (regenerate) the key from the "API Keys" page in the dashboard. The old key is invalidated instantly. See Section 11 for details.

Service

Q. If qzira goes down, will I lose API access?

Requests via qzira will be affected. You can immediately fall back by reverting Base URL and API key to the original values to switch to direct provider access.

Q. What happens if I specify an unsupported model?

You will receive a 400 Invalid model error. Check the supported model list in the Code Migration section.

Q. Is streaming available for all providers?

Yes. SSE-format streaming is supported for OpenAI, Anthropic, Google AI, and DeepSeek. Available on all plans.

Q. What should I do if I hit a rate limit?

If a qzira rate limit (429 error) occurs, wait a moment and retry. Starter plans and above have automatic retry handled by qzira.

Q. Are blocked requests from access control billed?

No. Access control checks happen before sending to the provider API, so blocked requests do not incur provider charges and are not counted toward qzira usage.

Q. Does access control apply to both Chat Completions API and Responses API?

Yes. Access control is checked before request routing, so it applies to both Chat Completions API (/v1/chat/completions) and Responses API (/v1/responses).

Q. What's the difference between semantic cache and response cache?

Response cache (Section 12) only hits when requests are completely identical. Semantic cache (Section 15) uses AI vector search to detect and reuse cached responses for semantically similar requests even if phrasing differs. Using both maximizes token savings.

Q. Can semantic cache return incorrect responses?

Setting the similarity threshold appropriately minimizes false hits. The recommended value is 0.92. For precision-critical use cases, set 0.95 or higher. Additionally, streaming requests and multi-turn conversations are automatically excluded, so it can be used safely in agentic interactive scenarios.

Q. How long are request logs retained?

Logs are automatically retained based on your plan: Free (3 days), Starter (30 days), Pro (30 days), Business (90 days), Scale (365 days). Logs past the retention period are automatically deleted and cannot be recovered. Use CSV export for long-term storage.

Q. Can I export usage data?

Yes. You can export as CSV from the "Request Logs" section of the dashboard. Includes model, provider, token counts, latency, and more. Available on all plans.

Cancellation

Q. What happens to my data when I cancel?

Deleting your account removes all registered API keys, usage history, and settings. This action cannot be undone.

Q. Will I be charged after cancellation?

For paid plans, cancellation downgrades you to the Free plan at the end of the current billing period. Pro-rated refunds are not available.

Smart Routing & Failover

Q. What's the difference between Smart Routing and failover?

Smart Routing is optimal distribution during normal operation — it routes requests to the best provider based on cost, latency, or load balancing (Scale only). Failover is an emergency switch during outages — it automatically switches to an equivalent model on another provider when 5xx or 429 errors occur (Starter and above). Enabling both means failover kicks in if Smart Routing's chosen provider fails.

Q. Will failover affecting my AI agent's behavior?

It might. Because JSON output formats and tool calling specs differ between providers, agents dependent on model-specific behavior may behave unexpectedly. This is why failover is disabled by default. We recommend verifying behavior across multiple providers before enabling.

Tool Calling

Q. Does qzira automatically convert Tool Calling formats between providers?

Yes. qzira accepts OpenAI-format tools / tool_choice parameters and auto-converts to each provider's native format. Tool Calling requests from AI coding tools like Cursor, Cline, and Roo Code work with OpenAI, Anthropic, and Google AI without additional configuration.

Q. Are there unsupported formats for Tool Calling auto-conversion?

Current conversion is based on OpenAI-compatible format (tools array + function definition). If you send Anthropic-native tool_use or Google-native function_declarations format directly, it passes through to that provider, but when switching providers via failover or Smart Routing, conversion may not work correctly. We recommend standardizing on OpenAI-compatible format.

Images & multimodal

Q. Can I send requests with images through qzira?

Yes. qzira passes through request bodies, so image input (Base64 encoded / URL) supported by each model can be sent as-is.

Q. Does qzira auto-convert image formats between providers?

No. There is no auto-conversion of image formats between providers. Sending an OpenAI-format (image_url) request with images to an Anthropic model will result in a format mismatch error. When sending image-containing requests, use the format specified by the target provider. Note that images are also not converted when providers are switched by failover.

Emergency fallback (rollback)

Q. What's the fastest way to recover if qzira has an outage?

Change just these two things to switch to direct provider communication in 10 seconds:

  1. Base URL: Delete (revert to default) or change to the provider's official URL
  2. API Key: Replace gw_... with your original provider key (sk-..., etc.)
⚠️ Because of BYOK, always keep your original provider API keys accessible. Even after registering keys in qzira, you can always check or regenerate them from the provider's dashboard.

qzira is designed so that just changing the Base URL enables introduction or rollback — no major SDK or code rewrites required.


Troubleshooting

Common errors

ErrorCauseSolution
401 Unauthorized Invalid or missing API key Verify the gw_ prefixed qzira API key. Update to new key after rotation.
400 Provider not configured Provider API key not registered Register the provider API key in the dashboard
400 Invalid model Specified model name not recognized Check the supported models list and specify a valid model name
403 Provider not enabled Provider not enabled Enable the provider in the dashboard
403 provider_not_allowed Violates the API key's provider restriction. Check access restriction settings in the dashboard.
403 model_not_allowed Violates the API key's model restriction. Confirm the requested model is in the allowed models list.
403 Budget Exceeded Budget limit reached Raise the limit in the dashboard, or wait for reset
403 Plan limit exceeded Monthly request limit for the plan reached Upgrade to a higher plan, or wait for next month's reset
429 Rate Limited Rate limit reached Wait a moment and retry (Starter+ plans auto-retry)
502 Bad Gateway Invalid response from provider Wait and retry. Failover auto-triggers (Starter and above).
503 Service Unavailable Provider temporarily unavailable Failover auto-triggers (Starter+). If it continues, check the provider's status page.

Immediate rollback

If you experience issues via qzira, just revert Base URL and API key to switch back to direct calls instantly.

# Rollback: qzira → direct call
client = OpenAI(
    api_key="sk-xxxxxxxx",           # Revert to original key
    # Remove base_url (reverts to default)
)

If managing via environment variables, just switch the values in your .env file:

# QZIRA_API_KEY=gw_xxxxxxxx          ← Comment out to roll back
# QZIRA_BASE_URL=https://api.qzira.com/v1
OPENAI_API_KEY=sk-xxxxxxxx            # ← Original key works as-is


Support

For questions and bug reports, please contact us through the following: