
Together AI — User Guide
Cheap fast cloud inference.
Strengths
- Extremely low price, more than 10 times cheaper than OpenAI
- Supports mainstream open source models such as Llama 3, Qwen, Mistral, DeepSeek, etc.
- Fast inference and low latency
- OpenAI compatible API, extremely low migration cost
- Free $1 credit, enough for testing
Best for
- Reduce API call costs for AI applications
- Using open source models as an alternative to OpenAI
- Backend inference for highly concurrent AI applications
- Test and compare different open source models
- Build cost-sensitive AI products
quick start
Together AI’s API is fully compatible with OpenAI, requiring little modification to existing code.
Called with OpenAI compatible API
from openai import OpenAI
# Just modify base_url and api_key
client = OpenAI(
base_url="https://api.together.xyz/v1",
api_key="your-together-api-key"
)
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
messages=[
{"role": "user", "content": "Explain what RAG technology is"}
],
max_tokens=1000
)
print(response.choices[0].message.content)Calling the Llama 3.1 70B model,
The price is about 1/10 of OpenAI GPT-4o,
The answer quality is close to GPT-4 level.
To migrate existing OpenAI code to Together AI, just modify the base_url and api_key lines.
Choose the right model
Together AI main models and prices (2025): High quality model: - Llama-3.1-70B-Instruct: $0.88/million tokens - Qwen2.5-72B-Instruct: $1.2/million tokens - DeepSeek-V3: $0.27/million tokens Fast and lightweight model: - Llama-3.2-3B-Instruct: $0.06/million tokens - Llama-3.1-8B-Instruct: $0.18/million tokens Compare OpenAI: - GPT-4o: $2.5/million input tokens - GPT-4o-mini: $0.15/million input tokens
DeepSeek-V3 is the most cost-effective option.
The quality is close to GPT-4o and the price is only 1/10,
Suitable for production applications with large number of calls.
For cost-sensitive applications, DeepSeek-V3 is one of the most cost-effective options available.
Batch inference optimization
Together AI supports batch inference, further reducing costs and increasing throughput.
Process text in batches
import together
client = together.Together(api_key="your-key")
# Batch process multiple texts
texts = [
"Please summarize:" + text
for text in long_texts_list
]
# Concurrent calls
import asyncio
from together import AsyncTogether
async def process_batch(texts):
async_client = AsyncTogether(api_key="your-key")
tasks = [
async_client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo",
messages=[{"role": "user", "content": text}]
)
for text in texts
]
return await asyncio.gather(*tasks)
results = asyncio.run(process_batch(texts))Concurrent calls greatly increase processing speed.
Suitable for batch text processing tasks,
Together AI has looser concurrency limits than OpenAI.
Using asynchronous calls during batch processing can increase processing speed by 10-50 times.
Compared with similar tools
| Tool | Strength | Best for | Pricing |
|---|---|---|---|
| Together AI This tool | The lowest price, many open source model choices, OpenAI compatible | Cost-sensitive production applications, high volume of API calls | Pay per token (10x+ cheaper than OpenAI) |
| OpenAI API | The model has the highest quality and the most complete ecology | Highest quality required, sufficient budget | Pay by token |
| Groq | Fastest inference speed (LPU chip) | Real-time applications with extremely high latency requirements | Free quota/paid version |
| Ollama | Totally local, zero cost | There is a local GPU and high data privacy requirements | completely free |
Sources & references:
- Together AI official website (2025-03)
- Together AI Documentation (2025-03)