Amazon has been questioned repeatedly by analysts and its own customers about its AI efforts. A quick listen to its Q3 earnings call tells you they’re listening. Much of Amazon’s efforts around Generative AI have culminated in two important releases: Bedrock and Titan.
But shipping AI and winning real world customers are two very different things. How does Bedrock stack up against OpenAI? In two scenarios we found Amazon Bedrock models provided savings of 17% and 378% over corresponding Azure OpenAI models (read scenarios).
Editor’s Note: OpenAI recently released new pricing and models but these are not yet available in Azure OpenAI.
Types of Generative AI Models Available in the Cloud
Generative AI encompasses several categories of models available via OpenAI and Bedrock:
- Image Generation: As the name suggests, image generation models such as Stable Diffusion and DALL-E are designed to create images based on textual descriptions or other image inputs.
- Large Language Model (LLM): These models are extensively trained on vast text data and understand/produce textual content. LLMs power ChatGPT, the most popular consumer-facing Generative AI model.
- Base Model: A Foundation Model that serves as a starting point for further fine-tuning.
- Embeddings: Embeddings models are not focused on generating text but on representing text or data in a numerical way that captures semantic relationships, e.g. “king” + “woman” = “queen”.
- Text Generation: Specializes in producing human-like text.
- Transcription: These models convert audio into text.
Bedrock is a fully managed, serverless service, that provides users access to FMs from several third-party providers and from Amazon through a single API. After you select a FM to use you can privately customize it and connect your propriety data sources and knowledge bases.
Bedrock Supported Models
|Model Family||Provider||Type||Functionalities||Max Request Tokens|
|Claude||Anthropic||LLM||Editing, outlining searching, summarizing, and writing text. Automated workflows, coding, complex reasoning, content generation, dialogue, and detailed instruction.||100K|
|Claude Instant||Anthropic||LLM||Casual dialogue, comprehension, document comprehension, summarization, and text analysis.||100K|
|Command||Cohere||LLM||Chat, Q&A, summarization, and text generation.||4,096|
|Jurassic-2 Mid||AI21 Labs||LLM||Advanced information extraction, draft generation, ideation, Q&A, and summarization.||8,192|
|Jurassic-2 Ultra||Cohere||LLM||Advanced information extraction, draft generation, intricate ideation, Q&A, and summarization.||8,192|
|SDXL0.8 (Stable Diffusion)||Stability AI||Image Generation||Generate, edit, or update images from text.||77|
|Titan Text – Lite||Amazon||LLM||Chain of thought, chat, code generation, data formatting, extraction, paraphrasing, Q&A, rewrite, summarization, table creation, and text generation.||4,096|
|Titan Text – Express||Amazon||LLM||Chain of thought, chat, code generation, data formatting, extraction, paraphrasing, Q&A, rewrite, retrieval augmented generation, summarization, table creation, and text generation.||8,192|
|Titan Embeddings||Amazon||Embeddings||Clustering, semantic similarity, and text retrieval.||8,192|
Azure OpenAI is a partnership between Azure and OpenAI that enables Azure users to use OpenAI via an API, Python SDK, or their web-based interface while authenticated with their Azure cloud credentials. Azure OpenAI distinguishes itself from OpenAI by offering co-developed APIs, enhanced security, and private networking. Throughout this article, the term “OpenAI” refers exclusively to Azure OpenAI for the sake of brevity.
OpenAI Supported Models
|Model Family||Type||Functionalities||Max Request|
|Ada (Embeddings)||Embeddings||Anomaly detection, classification tasks, clustering, recommendations, and searching.||8,191 tokens|
|DALL-E||Image Generation||Generate, edit, or update images from text.||1,000 characters|
|Babbage-002 (GPT Base)||Base||Generate and understand text or code. Not trained with instruction following.||16,384 tokens|
|Davinci-002 (GPT Base)||Base||Generate and understand text or code. Not trained with instruction following.||16,384 tokens|
|GPT-3.5-Turbo 4k||LLM||Advanced complex reasoning and chat.||4,096 tokens|
|GPT-3.5-Turbo 16k||LLM||Advanced complex reasoning and chat.||16,384 tokens|
|GPT-4 8k||LLM||Chat.||8,192 tokens|
|GPT-4 32k||LLM||Chat.||32,768 tokens|
|Whisper||Transcription||Convert audio to text.||25 MB audio size|
Amazon Bedrock vs Azure OpenAI Functionality
OpenAI certainly has a lot of name recognition. Due to this people have a conception that it is leaps and bounds ahead of other Generative AI services. However, as Randall Hunt, VP of Cloud Strategy and Innovation at Caylent, relayed on in Yan Cui’s Real-World Serverless podcast, “There wasn’t anything crazy great about what OpenAI did anything it just happened to be one of the first times we could see the power of these LLMs through an interface.” Still, GPT-4 is generally recognized as the leader in terms of pure quality.
Let’s compare some functionality on the service and model level to see how they fare. As this is the cloud, when comparing Bedrock to OpenAI we need to consider things like supported regions and security.
- Documentation/Community: Documentation and community support are challenging to quantify precisely, but based on anecdotal assessments, it’s fair to say that the documentation for both services is satisfactory at best. There is a lot of missing information and instructions that are all over the place. This is likely because both services and the models within the services are so new and constantly changing.
- No-Code Playgrounds: While both services are accessible via APIs and SDK as well, a no-code playground can be a helpful interface to utilize some of the models.
- Provisioned Throughput: Bedrock offers a Provisioned Throughput payment plan for some model types that is advantageous for large workloads.
Similarly, there are several factors to take into consideration when comparing models within their respective categories such as max tokens, supported languages, and training data date. Later, we’ll go in-depth on pricing and performance.
- Max Tokens: Max tokens vary for model categories and types.
- Embeddings Models: Both models have 8,192 tokens.
- Image Generation Models: Characters to tokens vary, however, 1000 characters correspond to roughly 250 tokens. So DALL-E has more tokens than Stable Diffusion.
- LLMs: Both Bedrock and OpenAI provide models with 4k and 8k token options. OpenAI extends the range with 16k and 32k token options. However, the Claude model takes the trophy with an impressive maximum capacity of 100k tokens. That corresponds to 75,000 words or hundreds of pages.
- Supported Regions: Bedrock is available in Asia Pacific (Singapore), Asia Pacific (Tokyo), Europe (Frankfurt), US East (N. Virginia), and US West (Oregon). ChatGPT regions vary per model as follows:
|Australia East||Ada, GPT-3.5, GPT-4.5|
|Canada East||Ada, GPT-3.5, GPT-4.5|
|East US||Ada, Dalle, GPT-3.5, GPT-4.5|
|East US 2||Ada, GPT-3.5|
|France Central||Ada, GPT-3.5, GPT-4.5|
|Japan East||Ada, GPT-3.5, GPT-4.5|
|North Central US||Ada, Base Models, GPT-3.5|
|South Central US||Ada, Whisper|
|Sweden Central||Base Models, GPT-3.5, GPT-4.5|
|Switzerland North||Ada, GPT-3.5, GPT-4.5|
|UK South||Ada, GPT-3.5, GPT-4.5|
|West Europe||Ada, Whisper|
- Supported Languages: Bedrock language capacity is model-specific. Command, Llama, Stable Diffusion, and Titan Text - Lite support only English. Jurassic supports 7 languages, Claude supports 12+, Titan Embeddings supports 25+, while Titan Text – Express supports over 100. See the model pages for the specific languages. OpenAI has less information available on which languages are supported, however, this response claims it is available for use in a variety of languages.
- Training Data Date: OpenAIs models Ada, GPT-3.5, GPT-4.5, and the Base Models are trained until Sept 2021. Bedrock’s training data date was a little harder to find, we had to go to the retrospective provider’s sites to find that the only publicly available dates are Claude’s (December 2022) and Jurassic’s (current up to mid-2022).
Charges for Bedrock are applied for model inference and customization. There are two plans available for model inference, On-Demand and Provisioned Throughput. Model customization and Provisioned Throughput are not be available for all models.
The non-committal, pay-by-usage option. Charges vary depending on the model type. Text generation models incur charges per input token processed and output token generated. Embeddings models charge per input token processed. Image generation models charge per image generated.
|Model||Price per 1000 Input Tokens||Price per 1000 Output Tokens|
|Llama 2 Chat (13B)||$0.00075||$0.001|
|Titan Text – Lite||$0.0003||$0.0004|
|Titan Text – Express||$0.0013||$0.0017|
|Image Resolution||Standard Quality (<51 steps)||Premium Quality (>51 steps)|
|512x512 or smaller||$0.018 per image||$0.036 per image|
|Larger than 512x512||$0.036 per image||$0.072 per image|
You have the option to buy model units (specific throughput measured by the maximum number of input/output tokens processed per minute) for a specific model (including custom models). Pricing is charged hourly and you can choose a one-month or six-month term. This pricing model is best suited for “large consistent inference workloads that need guaranteed throughput.”
|Model||Price per Hour per Model Unit with No Commitment (Max 1 Custom Model Unit Inference)||Price per Hour per Model Unit with a 1-month Commitment (Includes Inference)||Price per Hour per Model Unit with a 6-month Commitment (Includes Inference)|
|Llama 2 Chat (13B)||N/A||$21.20||$13.20|
|SDXL1.0 (Stable Diffusion)||N/A||$49.86||$46.18|
|Titan Text– Lite||$7.10||$6.40||$5.10|
|Titan Text – Express||$20.50||$18.40||$14.80|
You’re charged for text generation model customization based on the number of processed tokens and model storage. Keep in mind inference on more than one model unit is only available for Provisioned Throughput.
|Model||Price to Train 1000 Tokens||Price for Storage per Custom Model per Month||Price to Infer from a Custom Model for One Model Unit per Hour with No Commitment|
|Titan Text – Lite||$0.0004||$1.95||$7.10|
|Titan Text – Express||$0.0080||$1.95||$20.50|
Azure OpenAI Pricing
Charges for OpenAI are fairly simple. It is a pay-as-you-go, with no commitment. There are additional customization charges. Price varies per region.
Charges vary for different model types and if applicable context. Text generation models charge per prompt tokens and completion tokens. Embeddings models and base models charge per usage tokens. Image generation models charge per 100 images generated.
|Model||Context||Price per 1000 Input Tokens||Price per 1000 Output Tokens|
|Babbage-002 (GPT Base)||N/A||$0.0004||$0.0004|
|Davinci-002 (GPT Base)||N/A||$0.002||$0.002|
|Models||Price per 100 images|
Model customization charges are based on training time and hosting time with slightly different pricing per region.
|Model||Price for training per Compute Hour||Price for Hosting per Hour|
|Babbage-002 (GPT Base)||$34||$1.70|
|Davinci-002 (GPT Base)||$68||$3|
On a model by model comparison, Bedrock is cheaper than OpenAI. However, cost does not tell the full story and the scenarios below are based purely on an analysis of pricing. The Titan vs Ada embeddings models have the same price and max tokens so we will skip them.
Image Generation Models: Stable Diffusion vs DALL-E
There is more flexibility in cost for image generation with Stable Diffusion since you can choose the resolution and quality. With DALL-E you are still able to choose between 256x256, 512x512, or 1024x1024 pixels. You are also charged per image as opposed to DALL-E which charges per 100 images. Therefore, unless you are working with standard quality 512x512 images or lesser quantities, DALL-E is the more cost-effective solution.
Standard Context Window: Command, Titan Text vs GPT-3.5-Turbo 4k
For a lower capacity model where we want to perform tasks such as chat, summarization on an article-length passage, Q&A, etc, we can consider one of the models with a 4,096 token max. There is one model from OpenAI that fits the criteria, GPT-3.5-Turbo 4k, and a couple of Bedrock, Command, Titan Text – Lite, and Titan Text – Express. The pricing for Command and GPT-3.5-Turbo 4k are the same, at $0.0015 per 1000 input tokens and $0.0020 per 1000 output tokens.
Titan Text - Lite which can perform many of the same capabilities is much cheaper at $0.0003 per 1000 input tokens and $0.0004 per 1000 output tokens. Another option is Titan Text - Express, the difference between the Lite version is that it has retrieval augmented generation ability and a maximum of 8,192 tokens. The price is $0.0013 per 1000 input tokens and $0.0017 per 1000 output tokens, cheaper than GPT-3.5-Turbo 4k.
Chatbot Scenario: Titan Text - Express vs GPT-3.5-Turbo 4k
Consider a scenario where you want to develop a simple customer service chatbot. The chatbot will need to be able to handle customer inquiries, provide assistance, and answer questions on a range of topics related to your products and services. The model will need to handle short sentences as well as more detailed discussions.
A standard question could be about 15 tokens and the answer could be 85. If your chatbot is answering 250,000 similar tokened questions a month the estimated price would be:
15 tokens X 250,000 questions = 3,750,000 input tokens
85 tokens X 250,000 answers = 21,250,000 output tokens
Titan Text - Express: 3,750,000 input tokens / 1000 X $0.0013 + 21,250,000 output tokens / 1000 X $0.0017 = $41
GPT-3.5-Turbo 4k: 3,750,000 input tokens / 1000 _ $0.0015 + 21,250,000 output tokens / 1000 _ $0.0020 = $48.13
GPT-3.5-Turbo 4k is 17% more than Titan Text - Express making Bedrock the cheaper option for a lower-capacity model.
Long Context Window: Jurassic-2 vs GPT-4 8k
For more advanced tasks such as advanced information extraction, draft generation, and summarization on larger passages let’s compare some of the models with an 8,192 token max. Between the Jurassic-2 models the Ultra model stands apart because of intricate ideation and compares well with GPT-4 8k. Jurassic-2 model is much cheaper at $0.0188 per 1000 input tokens and $0.0188 per 1000 output tokens when compared to GPT-4 8k’s $0.03 per 1000 input tokens and $0.06 per 1000 output tokens.
Long Context Window: Claude Instant vs GPT-3.5-Turbo 16k
For even larger tasks consider Claude Instant (100K token max) and GPT-3.5-Turbo 16k (16,384 token max). The capabilities and pricing are relatively similar. However, the choice is much more case-dependent since with Claude Instant you are charged $0.00163 per 1000 input tokens and $0.00551 per 1000 output tokens compared to $0.003 per 1000 input tokens and $0.004 per 1000 output tokens for GPT-3.5-Turbo 16k. So, because of the lower pricing Claude Instant would be a great choice with a higher token input amount and lower output token amount.
Extra Long Context Window: Claude vs GPT-4 32k
For high-capacity models with very advanced tasks such as content generation and complex reasoning consider Claude vs GPT-4 32k. Claude has an impressive maximum of 100K tokens, while GPT-4 32k provides 32,768 tokens. Claude is a great choice since it is way cheaper at $0.01102 per 1000 input tokens and $0.03268 per 1000 output tokens. GPT-4 32k is $0.06 per 1000 input tokens and $0.12 per 1000 output tokens.
Text Summarization Scenario: Claude vs GPT-4 32k
You work at a content creation agency and need to summarize lengthy articles and reports for clients. You want to process articles at around 25,000 tokens and summarize them to about 5,000 tokens. If you process 300 articles a month consider the estimated prices:
25,000 tokens X 300 articles = 7,500,000 input tokens
5,000 tokens X 300 responses = 1,500,000 output tokens
Claude: 7,500,000 input tokens / 1000 X $0.01102 + 1,500,000 output tokens / 1000 X $0.03268 = $131.67
GPT-4 32k: 7,500,000 input tokens / 1000 X $0.06 + 1,500,000 output tokens / 1000 X $0.12 = $630
GPT-4 32k is 378% more than Claude making Bedrock (Anthropic) the cheaper option in this scenario.
In terms of functionality this article does a great in-depth comparison of the the functionality of the two models and concludes that GPT-4 32k performs slightly better. Some takeaways are a similar performance for code generation and conversion, a better performance for GPT-4 32k’s dataset analysis and math skills, and highlights Claude’s distinct ability to summarize text over 32k tokens.
There are a number of dimensions to consider when comparing Bedrock and OpenAI, such as region availability, tokens, model quality, and price. Based on the variety of models, lower price, and large token max from the Claude Model, we think Bedrock is increasingly competitive for applications where the absolute best performance is not required.