Foundation Model Capabilities and Limitations

Overview

Not all foundation models are created equal. They vary in context window size, speed, reasoning ability, and what types of input they accept. Picking the right model means understanding these tradeoffs.

Amazon Bedrock gives you access to models from Anthropic, Meta, Amazon, Mistral, and others. Each has strengths and weaknesses. The exam expects you to know when to use what.

The Core Tradeoff

Every model trades off between capability, cost, speed, and context size. There's no "best" model. There's only the best model for your specific use case.

Exam Tip

Know which models have larger context windows, faster latency, or better reasoning. Hallucination mitigation (especially RAG) comes up constantly.

Architecture Diagram

Foundation Model Capability Dimensions — Figure 1: Key capability dimensions for evaluating foundation models

Token Limits and Context Windows

Context Windows

The context window is the maximum number of tokens a model can handle in one request. Input and output combined.

Why this matters:

Limits how much conversation history you can include
Determines whether a document fits in one request or needs chunking
Affects RAG chunk sizes
Larger windows let the model reason over more information

Current context windows: | Model | Context Window | |-------|----------------| | Claude 3.5 Sonnet/Opus | 200K tokens | | Llama 3.1/3.2/3.3 | 128K tokens | | Llama 3 (8B, 70B) | 8K tokens | | Amazon Nova Pro | 300K tokens | | Amazon Titan Text | 8K tokens | | Mistral Large | 128K tokens | | Qwen3-Coder | 256K-1M tokens |

Token Limits

Token limits cap both input and output sizes.

What you need to know:

Input tokens: what you send (prompt, context, history)
Output tokens: what the model generates
Max tokens parameter: caps response length
Combined limit: input + output can't exceed the context window

Common limits:

Output usually capped at 4K-8K tokens per request
Input limits vary wildly (8K to 300K)
Embedding models often limited to 8K input
Exceed the limit and you get a ValidationException

Cost impact:

You pay per token, input and output priced separately
Output tokens typically cost more than input
Larger context = higher cost per request

Context Windows by Model

Provider	Model	Context Window	Max Output
Anthropic	Claude 3.5 Sonnet	200K tokens	8K tokens
Anthropic	Claude 3 Haiku	200K tokens	4K tokens
Meta	Llama 3.1 70B	128K tokens	8K tokens
Meta	Llama 3 70B	8K tokens	2K tokens
Amazon	Nova Pro	300K tokens	5K tokens
Amazon	Titan Text G1	8K tokens	4K tokens
Mistral	Mistral Large	128K tokens	8K tokens

Model Strengths by Task

Different models excel at different things. Know the general patterns.

Task Strengths

Claude (Anthropic):

Complex reasoning and analysis
Long-form content
Code review and explanation
Nuanced conversation
Strong safety alignment

Llama (Meta):

Multilingual (8+ languages)
Code generation
General-purpose tasks
Open weights (you can customize)
Cheaper inference

Amazon Titan:

Native AWS integration
Embeddings for RAG
Enterprise compliance features
Image generation with watermarks

Mistral:

Efficient inference
Strong multilingual
Mixture-of-experts architecture
Good at code and reasoning

Model Recommendations by Use Case

Use Case	Model	Why
Complex reasoning	Claude 3.5 Sonnet/Opus	Best analytical capabilities
Real-time chat	Claude 3 Haiku, Llama 3 8B	Low latency
Code generation	Claude 3.5 Sonnet, Codestral	Strong code understanding
Embeddings/RAG	Titan Embeddings V2	Native integration, solid quality
Image generation	Titan Image, Stable Diffusion	High-quality outputs
Multilingual	Mistral Large, Llama 3	Strong non-English support
Cost-sensitive	Llama 3 8B, Claude Haiku	Lower per-token costs

Hallucinations

Models make things up. They do it confidently. This is the biggest risk in production GenAI.

What Hallucinations Look Like

Hallucinations = plausible-sounding but factually wrong or fabricated content.

Types:

Factual errors: wrong facts, stats, dates
Fabrication: made-up citations, names, events
Conflation: mixing info from different sources
Logical inconsistencies: contradicting itself

Why it happens:

Models predict probable tokens, not verified facts
Training data contains errors
No real-time fact-checking
Models don't know what they don't know

Mitigation Strategies

How to reduce hallucinations:

RAG (Retrieval-Augmented Generation)
- Ground responses in retrieved documents
- Model cites sources instead of making things up
- Primary strategy for knowledge-intensive apps
Bedrock Guardrails
- Contextual grounding checks
- Validates responses against provided sources
- AWS claims 99% accuracy for policy checks
Prompt engineering
- Tell the model to say "I don't know" when uncertain
- Request citations
- Use chain-of-thought reasoning
Human review
- Review high-stakes outputs
- Set confidence thresholds for escalation
- Build feedback loops

Exam Focus

RAG is the primary hallucination mitigation strategy. Know that Bedrock Guardrails with contextual grounding checks validate responses against provided sources.

Latency and Throughput

Speed matters for user experience. Know what affects it.

Latency

Latency = time from request to response.

Metrics that matter:

Time to First Token (TTFT): how long until the first word appears
Time to Last Token (TTLT): total generation time
Tokens per second: generation speed

What affects latency:

Model size (bigger = slower)
Input length (more tokens = more processing)
Output length (longer responses take longer)
On-demand vs provisioned throughput
Region (network distance)

Rough latency by model: | Model | Relative Speed | |-------|---------------| | Claude 3 Haiku | Fast (~500ms TTFT) | | Claude 3.5 Sonnet | Medium (~1s TTFT) | | Claude 3 Opus | Slow (~2s TTFT) | | Llama 3 8B | Fast | | Llama 3 70B | Medium |

Throughput

Throughput = how many requests you can handle.

Bedrock options:

On-Demand
- Pay per token, shared capacity
- Can get throttled under load
- Good for variable workloads
Provisioned Throughput
- Reserved capacity, guaranteed model units
- Lower latency, consistent performance
- Required for custom/fine-tuned models
- Higher fixed cost

Limits to know:

Requests per minute (RPM)
Tokens per minute (input/output)
Varies by model and region
You can request quota increases

Latency vs Capability

Priority	Best Choice	Tradeoff
Lowest latency	Claude 3 Haiku, Llama 3 8B	Less complex reasoning
Best reasoning	Claude 3 Opus	Higher latency and cost
Balanced	Claude 3.5 Sonnet, Llama 3 70B	Good for most use cases
Highest throughput	Provisioned throughput	Fixed cost commitment

Some models handle more than text. Images, video, audio.

Multi-Modal Models

Vision (image input):

Claude 3 models accept images
Llama 3.2 Vision processes images
Amazon Nova handles images and video
Use cases: document analysis, visual Q&A, diagram understanding

Image generation:

Amazon Titan Image Generator
Stable Diffusion XL
Amazon Nova Canvas
Outputs include watermarks for provenance

Video:

Amazon Nova Reel generates video
Nova Pro understands video input

Multimodal embeddings:

Amazon Titan Multimodal Embeddings
Text and images in the same vector space
Enables searching images with text queries

Multi-Modal Capabilities

Model	Text In	Image In	Image Out	Video
Claude 3.5	Yes	Yes	No	No
Llama 3.2 Vision	Yes	Yes	No	No
Amazon Nova Pro	Yes	Yes	No	Yes (input)
Titan Image Generator	Yes (prompt)	No	Yes	No
Stable Diffusion	Yes (prompt)	Yes (img2img)	Yes	No
Titan Multimodal Embeddings	Yes	Yes	No	No

Model Selection Flow

Model Selection Decision Flow — Figure 2: Decision tree for selecting the right model

Context Window Management — Figure 3: Strategies when hitting context window limits

Use Cases

Long Document Summarization

Scenario: Summarize 100-page legal contracts. Need full document context.

Model choice: Claude 3.5 Sonnet (200K context) or Nova Pro (300K context)

Why:

Large context window fits the entire document
Strong reasoning for complex legal language
Good at extracting key terms

Watch out for:

Cost scales with tokens
May still need chunking for very large docs
Always verify with legal review

Real-Time Customer Chat

Scenario: Support chatbot that needs sub-second responses.

Model choice: Claude 3 Haiku or Llama 3 8B

Why:

Fast time-to-first-token
Sufficient for FAQ-style responses
Cost-effective at scale

Watch out for:

May need to route complex queries to a bigger model
Use streaming for perceived speed

Scenario: Extract info from scanned documents with images and tables.

Model choice: Claude 3.5 Sonnet with vision or Nova Pro

Why:

Vision capabilities handle images
Can reason about visual elements
Good at structured extraction

Watch out for:

Image tokens cost more
Accuracy drops on handwritten content
May need preprocessing for image quality

Best Practices

Model Selection

Start with requirements. Define latency, accuracy, and cost constraints first.
Right-size the model. Don't use Opus when Haiku will do.
Test with real data. Benchmark on your actual use cases, not generic tests.
Plan for context limits. Implement chunking or summarization before you hit walls.
Monitor hallucinations. Use guardrails and measure hallucination rates.
Consider multi-model routing. Send simple queries to small models, complex ones to big models.

Exam Scenarios

What AWS Wants You to Know

Scenario	Answer	Why
Need to process 150K token document	Claude 3.5 or Nova Pro	Only models with big enough context
Real-time chat, <1s latency	Claude 3 Haiku or Llama 3 8B	Smallest/fastest models
App experiencing hallucinations	RAG + Guardrails grounding	Ground responses in facts
ValidationException errors	Check token limits	Probably exceeding context window
Need image understanding	Claude 3.5 vision or Nova Pro	Multi-modal input support
Cost-sensitive batch processing	Smaller model + batch inference	Lower per-token cost, 50% batch discount

Common Mistakes

Ignoring Context Window Limits

The mistake: Sending prompts that exceed the context window without checking.

What happens: ValidationException errors or truncated context, leading to bad outputs.

Fix: Count tokens before sending. Implement chunking for long documents. Use summarization for conversation history. Pick models with appropriate context windows.

Over-Provisioning

The mistake: Using Claude 3 Opus for simple FAQ responses.

What happens: You waste money and add latency for no benefit.

Fix: Match model capability to task complexity. Use tiered routing. Benchmark multiple models. Optimize for cost-performance ratio.

Trusting Outputs Without Verification

The mistake: Deploying GenAI for critical applications without hallucination mitigation.

What happens: Models confidently state incorrect information. Users trust it. Bad things follow.

Fix: Implement RAG for knowledge-grounded responses. Use Guardrails contextual grounding. Add human review for high-stakes decisions. Monitor hallucination rates.

Test Your Knowledge

A company needs to summarize legal documents that are 180,000 tokens long. Which Amazon Bedrock model should they use?

AAmazon Titan Text G1 Express

BClaude 3 Haiku

CClaude 3.5 Sonnet

DLlama 3 70B

What is the PRIMARY strategy for reducing hallucinations in GenAI applications?

AUsing larger models with more parameters

BImplementing Retrieval-Augmented Generation (RAG)

CIncreasing the temperature parameter

DUsing synchronous instead of asynchronous inference

An application requires sub-second response times for customer chat. Which model characteristic should be prioritized?

ALargest context window

BLowest time-to-first-token latency

CHighest parameter count

DMulti-modal capabilities

A model keeps returning ValidationException errors when processing long documents. What's the most likely cause?

When should you use provisioned throughput instead of on-demand?

AFor development and testing

BWhen you need guaranteed capacity and consistent latency

CWhen cost is the primary concern

DFor infrequent batch processing

AI/ML

Amazon Bedrock

Managed access to foundation models with varying capabilities.

AI/ML

Amazon Bedrock Guardrails

Content filtering and contextual grounding checks for hallucination mitigation.

AI/ML

Amazon Bedrock Knowledge Bases

Managed RAG service for grounding responses in your data.

Monitoring

Amazon CloudWatch

Monitor latency, token usage, and error rates.

Quick Reference

Context Window Cheat Sheet

TEXTContext Windows

Claude 3.5 Sonnet/Opus:  200,000 tokens
Claude 3 Haiku:         200,000 tokens
Amazon Nova Pro:        300,000 tokens
Amazon Nova Lite:        40,000 tokens
Llama 3.1 (all sizes):  128,000 tokens
Llama 3 (8B, 70B):        8,000 tokens
Mistral Large:          128,000 tokens
Amazon Titan Text:        8,000 tokens
Qwen3-Coder:        256K-1M tokens

Latency Optimization

Latency Strategies

Strategy	Impact	Tradeoff
Use smaller model	High	Reduced capability
Reduce input tokens	Medium	Less context
Limit output tokens	Medium	Shorter responses
Provisioned throughput	High	Fixed cost
Regional proximity	Low-Medium	May limit model availability
Enable streaming	Perceived improvement	Same total time

Key Concepts

Foundation Model Capabilities and Limitations

Overview

The Core Tradeoff

Architecture Diagram

Token Limits and Context Windows

Context Windows

Token Limits

Context Windows by Model

Model Strengths by Task

Task Strengths

Model Recommendations by Use Case

Hallucinations

What Hallucinations Look Like

Mitigation Strategies

Exam Focus

Latency and Throughput

Latency

Throughput

Latency vs Capability

Multi-Modal Capabilities

Multi-Modal Models

Multi-Modal Capabilities

Model Selection Flow

Use Cases

Long Document Summarization

Real-Time Customer Chat

Multi-Modal Document Processing

Best Practices

Model Selection

Exam Scenarios

What AWS Wants You to Know

Common Mistakes

Ignoring Context Window Limits

Over-Provisioning

Trusting Outputs Without Verification

Test Your Knowledge

Related Services

Amazon Bedrock

Amazon Bedrock Guardrails

Amazon Bedrock Knowledge Bases

Amazon CloudWatch

Quick Reference

Context Window Cheat Sheet

Latency Optimization

Latency Strategies

Further Reading

Related AWS Services