Comparisons

AI Model Speed Benchmark: Time-to-First-Token and Throughput

Updated 2026-03-10

AI Model Speed Benchmark: Time-to-First-Token and Throughput

Speed matters. Whether you are building a chatbot that needs instant responses or processing thousands of documents, the latency and throughput of your AI model directly affects user experience and cost. We benchmarked the major models on both metrics.

AI model comparisons are based on publicly available benchmarks and editorial testing. Results may vary by use case.

Speed Metrics Explained

Time-to-First-Token (TTFT): How long until the model starts generating output. This is what users “feel” in interactive applications. Lower is better.

Throughput (tokens/second): How fast the model generates output once it starts. Higher is better. This determines how quickly you get a complete response.

Total Response Time: The combined time to generate a full response. Depends on both TTFT and throughput, plus response length.

Speed Benchmark Results

ModelTTFT (median)Throughput (tokens/sec)Total Time (500 tokens)Tier
Gemini Flash0.2s180~3.0sBudget
Claude Haiku 40.3s150~3.6sBudget
GPT-4o mini0.3s140~3.9sBudget
GPT-4o0.5s90~6.1sMid
Claude Sonnet 40.5s85~6.4sMid
Gemini Pro0.6s80~6.9sMid
Mistral Large0.5s75~7.2sMid
Claude Opus 40.8s55~9.9sPremium
Gemini Ultra0.9s50~10.9sPremium
o32-15s4515-60s+Reasoning

Benchmarks measured using standard API endpoints under normal load. Results vary by time of day, prompt complexity, and server load.

Key Observations

Budget Models Are Remarkably Fast

Gemini Flash and Claude Haiku 4 respond almost instantly (0.2-0.3s TTFT) and generate text at 150-180 tokens per second. For interactive chatbots and real-time applications, these models provide the snappiest user experience.

Reasoning Models Are Slow by Design

o3 and similar reasoning models are intentionally slow because they generate internal “thinking” tokens before responding. A simple question might take 5 seconds; a complex math problem might take 60+ seconds. This is a feature, not a bug. The thinking time is what enables superior accuracy.

Premium Models Are 3-4x Slower Than Budget

Claude Opus 4 and Gemini Ultra generate output at roughly one-third the speed of their budget counterparts. This is the tradeoff for higher quality: more parameters mean more computation per token.

Streaming Masks Latency

Most applications use streaming (showing tokens as they are generated). With streaming, users perceive the model as responsive even when total generation takes 10+ seconds, because they see output appearing after just 0.5-0.9 seconds (TTFT).

Speed by Use Case

Use CaseKey Speed MetricRecommended Models
ChatbotTTFT (<0.5s ideal)Haiku 4, Flash, GPT-4o mini
Real-time suggestionsTTFT (<0.3s ideal)Flash, Haiku 4
Document processing (batch)ThroughputAny (batch = not time-sensitive)
Coding assistantTTFT + throughputSonnet 4, GPT-4o
Complex analysisQuality > speedOpus 4, o3 (speed less important)
AutocompleteTTFT (<0.2s ideal)Flash, Haiku 4

Factors That Affect Speed

  1. Prompt length. Longer prompts increase TTFT because the model must process more input before generating output. A 100K-token prompt has significantly higher TTFT than a 100-token prompt.

  2. Server load. Response times vary by time of day and overall demand. Peak hours (US business hours) tend to be slower.

  3. Region. API endpoints closer to your location provide lower latency. Most providers have multi-region deployments.

  4. Streaming vs. non-streaming. Streaming starts delivering tokens immediately but may have slightly lower throughput.

  5. Max tokens setting. Setting a lower max_tokens limit does not speed up generation but prevents unexpectedly long responses.

Optimizing for Speed

  1. Use the right model tier. Do not use Opus for tasks that Haiku can handle.
  2. Minimize prompt size. Only include necessary context. Use retrieval rather than stuffing all information into the prompt.
  3. Implement streaming. Users perceive streaming responses as faster even when total time is the same.
  4. Use prompt caching. Cached prompts reduce TTFT because the prefill computation is reused.
  5. Consider parallel requests. For batch processing, send multiple requests simultaneously to increase overall throughput.

AI Costs Explained: API Pricing, Token Limits, and Hidden Fees

Key Takeaways

  • Budget models (Gemini Flash, Claude Haiku 4) are 3-4x faster than premium models, making them ideal for interactive applications.
  • Time-to-first-token (TTFT) is the most important metric for chatbots and real-time applications. Budget models achieve 0.2-0.3 second TTFT.
  • Reasoning models (o3) are intentionally slow, trading speed for accuracy. They are not suitable for latency-sensitive applications.
  • Streaming is essential for any user-facing application. It masks total generation time by showing partial output immediately.
  • Prompt length significantly affects TTFT. Minimizing context improves responsiveness.

Next Steps


This content is for informational purposes only and reflects independently researched comparisons. AI model capabilities change frequently — verify current specs with providers. Not professional advice.