
Groq — User Guide
Very fast inference (Groq).
Strengths
- The fastest inference speed in the world (LPU chip), up to 500+ tokens/second
- Generous free quota, 30 requests per minute
- Supports mainstream open source models such as Llama 3, Mixtral, and Gemma
- OpenAI compatible API, easy migration
- Extremely low latency, suitable for real-time conversational applications
Best for
- Real-time AI applications with extremely high latency requirements
- Voice AI assistant (low latency is key)
- Live code completion and suggestions
- Highly concurrent AI services
- Free rapid prototyping with open source models
quick start
Groq's API is fully compatible with OpenAI, and you can experience extremely fast inference with just a few lines of code.
Experience extremely fast reasoning
from groq import Groq
import time
client = Groq(api_key="your-groq-api-key")
start = time.time()
response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[
{"role": "user", "content": "Write a 500-word article about artificial intelligence"}
],
stream=True
)
for chunk in response:
content = chunk.choices[0].delta.content
if content:
print(content, end="", flush=True)
elapsed = time.time()-start
print(f"\n\nTime taken: {elapsed:.2f} seconds")A 500-word article will be generated in about 2-3 seconds.
The speed is 5-10 times that of OpenAI,
Streaming output is nearly real-time,
The user experience is excellent.
Streaming output (stream=True) allows users to see the output immediately, further improving the experience.
Free quota description
Groq free version limitations (2025): Llama 3.3 70B: - Per minute: 30 requests, 6000 tokens - Daily: 14,400 requests, 500,000 tokens Llama 3.1 8B: - Per minute: 30 requests, 20,000 tokens - Daily: 14,400 requests, 500,000 tokens Suitable scenarios: - Personal projects and prototype development - Learn and test - Tools for low frequency use
The free version is sufficient for personal projects.
Production applications require the paid version,
The paid version is billed by token and the price is reasonable.
The free version has strict per-minute limits. For production applications, it is recommended to use the paid version or combine it with other platforms.
Real-time voice AI application
Groq's low-latency nature makes it particularly suitable for voice AI applications.
Building a low-latency voice assistant
# Build a voice assistant using Groq + Whisper
from groq import Groq
client = Groq(api_key="your-key")
# 1. Speech to text (Whisper)
with open("audio.mp3", "rb") as audio_file:
transcription = client.audio.transcriptions.create(
file=audio_file,
model="whisper-large-v3",
language="zh"
)
user_text = transcription.text
print(f"User said: {user_text}")
# 2. Extremely fast LLM reply
response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[{"role": "user", "content": user_text}]
)
ai_reply = response.choices[0].message.content
print(f"AI reply: {ai_reply}")
# 3. Text-to-speech (can be connected to ElevenLabs, etc.)The entire speech recognition + AI reply process takes about 1-2 seconds.
Groq’s low latency makes the voice conversation experience close to real-person conversation,
Suitable for building voice assistant products.
Groq also supports Whisper speech recognition and can build a complete speech AI pipeline.
Compared with similar tools
| Tool | Strength | Best for | Pricing |
|---|---|---|---|
| Groq This tool | The fastest inference speed in the world, generous free quota, and low latency | Real-time applications, voice AI, scenarios with extremely high latency requirements | Free version/paid version is charged by token |
| Together AI | Lower prices and more model choices | Cost sensitive, high volume calls | Pay by token |
| OpenAI API | Highest model quality | Highest quality required | Pay by token |
| Ollama | Completely local, zero-latency network | On-premises deployment, data privacy | completely free |
Sources & references:
- Groq official website (2025-03)
- Groq API documentation (2025-03)