Automation

3 Ways to Save Big on LLM Token usage in Claude and OpenAI

As large language models (LLMs) like Anthropic’s Claude and OpenAI's offerings gain traction, optimizing token usage has become essential for developers. This post dives into three effective strategies: Batch Processing, Predicted Outputs, and Prompt Caching, each enhancing efficiency and reducin...

As large language models (LLM) become increasingly integral to applications ranging from chatbots to code generation, optimizing token usage has become a priority for developers. Efficient token management not only reduces costs but also enhances performance and user experience. Both Anthropic’s Claude and OpenAI have introduced features aimed at helping developers save on token usage. In this blog post, we’ll explore three innovative methods: Batch Prediction, Predicted Outputs, and Prompt Caching.

Let's take a look at these features.

1. Batch Processing

Claude’s Message Batches API (Beta)

What is it?

Anthropic’s Message Batches API allows developers to process large volumes of message requests asynchronously. By batching multiple requests together, you can reduce costs by up to 50% while increasing throughput.

How does it work?

Batch Creation: Create a batch containing multiple message requests.
Asynchronous Processing: Each request in the batch is processed independently.
Result Retrieval: Retrieve results once the entire batch has been processed.

Creating a Message Batch in Python

import anthropic from anthropic.types.beta.message_create_params import MessageCreateParamsNonStreaming from anthropic.types.beta.messages.batch_create_params import Requestclient = anthropic.Anthropic()# Create a batch with two message requests message_batch = client.beta.messages.batches.create( requests=[ Request( custom_id="request-1", params=MessageCreateParamsNonStreaming( model="claude-3-5-sonnet-20241022", max_tokens=1024, messages=[ {"role": "user", "content": "Hello, world"} ] ) ), Request( custom_id="request-2", params=MessageCreateParamsNonStreaming( model="claude-3-5-sonnet-20241022", max_tokens=1024, messages=[ {"role": "user", "content": "How are you today?"} ] ) ) ] )print(f"Batch ID: {message_batch.id}")

Benefits:

Cost Reduction: Batching reduces costs by 50%.
Efficiency: Ideal for tasks like content moderation, data analysis, or bulk content generation.

Limitations:

Processing Time: Results are available only after the entire batch is processed (up to 24 hours).
Size Restrictions: Limited to 10,000 requests or 32 MB per batch.

OpenAI’s Batch API

What is it?

OpenAI’s Batch API allows you to create large batches of API requests for asynchronous processing, returning completions within 24 hours at a discounted rate.

How does it work?

Batch Creation: Upload a JSONL file containing your requests.
Endpoint Specification: Specify the endpoint to be used for all requests in the batch.
Processing Window: Set a completion window (currently only 24 hours is supported).

Example Creating a Batch via cURL

curl https://api.openai.com/v1/batches \ -H "Authorization: Bearer $OPENAI_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "input_file_id": "file-abc123", "endpoint": "/v1/chat/completions", "completion_window": "24h" }'

Benefits:

Cost Savings: Offers a 50% discount on batch processing.
Large Volume Processing: Suitable for extensive tasks like data analysis or content generation.

2. Predicted Outputs (OpenAI only)

What is it?

Predicted Outputs is a feature by OpenAI that allows you to reduce latency and cost when much of the response is known ahead of time, such as when making minor modifications to text or code.

How does it work?

Include Prediction: Provide a prediction parameter with your anticipated output in the API request.
Model Processing: The model uses the prediction to speed up the response, accepting matching tokens.

Example Using Predicted Outputs to Refactor Code

import openaiopenai.api_key = 'YOUR_API_KEY'code_content = ''' class User: first_name = "" last_name = "" username = "" '''.strip()response = openai.ChatCompletion.create( model="gpt-4o", messages=[ { "role": "user", "content": "Replace the 'username' property with an 'email' property. Respond only with code." }, { "role": "user", "content": code_content } ], prediction={ "type": "content", "content": code_content } )print(response['choices'][0]['message']['content'])

Benefits:

Latency Reduction: Speeds up responses, especially for large outputs.
Cost Efficiency: Saves costs by reducing the number of new tokens generated.

Limitations:

Model Support: Only available with GPT-4o and GPT-4o-mini models.
Parameter Restrictions: Not compatible with certain parameters like n > 1, logprobs, or function calling.

3. Prompt Caching

What is it?

Prompt Caching allows frequently used prompts or context to be cached between API calls. When the same prompt is used repeatedly, the system retrieves the cached processing results, reducing the need to process the entire prompt from scratch.

Implementation in Claude

How it works:

Cache Write: Writing to the cache costs 25% more than the base input token price.
Cache Read: Using cached content costs only 10% of the base input token price.

Use Case Example:

Suppose you have a lengthy system prompt or background information for a conversational agent. By caching this prompt, subsequent interactions become faster and cheaper.

# Static content (e.g., long system prompt) system_prompt = "You are an assistant that helps users with math problems."# Dynamic content (e.g., user question) user_question = "What is the derivative of sin(x)?"# Combine prompts messages = [ {"role": "system", "content": system_prompt}, {"role": "user", "content": user_question} ]# Send request as usual response = client.chat_completion( model="claude-3-5-sonnet", messages=messages )

Benefits:

Cost Reduction: Saves up to 90% in costs for long prompts.
Latency Improvement: Decreases latency by up to 85% for lengthy prompts.

Implementation in OpenAI

Automatic Caching:

Enabled for prompts that are 1,024 tokens or longer.
Cache hits occur in increments of 128 tokens.

Monitoring Cache Usage:

The usage object in the API response provides details about cached tokens.

import openaiopenai.api_key = 'YOUR_API_KEY'# Long static prompt long_prompt = "..." # Assume this is a long prompt over 1,024 tokens# Dynamic content user_input = "Explain the significance of the Monte Carlo method."response = openai.ChatCompletion.create( model="gpt-4o", messages=[ {"role": "system", "content": long_prompt}, {"role": "user", "content": user_input} ] )print("Cached Tokens:", response['usage']['prompt_tokens_details']['cached_tokens']) print(response['choices'][0]['message']['content'])

Benefits:

Cost Reduction: Saves up to 90% in costs for long prompts.
Latency Improvement: Decreases latency by up to 85% for lengthy prompts.

Conclusion

Optimizing token usage is crucial for developers looking to build efficient and cost-effective applications using large language models. Batch Processing, Predicted Outputs, and Prompt Caching are three powerful methods to achieve this goal in both Claude and OpenAI platforms.

By leveraging Batch Processing, you can handle large volumes of requests asynchronously, significantly reducing costs. Predicted Outputs allow you to reduce latency and save on costs when much of the response is predictable. Prompt Caching enables you to reuse frequently used prompts, cutting down on both processing time and expenses.

Implementing these strategies not only enhances performance but also leads to substantial cost savings. As models continue to evolve, these methods will become even more integral to maximizing efficiency in AI-driven applications.

Links
https://platform.openai.com/docs/guides/predicted-outputs
https://platform.openai.com/docs/guides/prompt-caching
https://platform.openai.com/docs/api-reference/batch
https://www.anthropic.com/news/prompt-caching

3 Ways to Save Big on LLM Token usage in Claude and OpenAI