GenAI Architecture Fundamentals

Overview

GenAI architecture is different from traditional ML. With traditional ML, you train a model for a specific task. With GenAI, you're calling pre-trained Foundation Models (FMs) that already know how to do things. You just need to ask correctly.

Amazon Bedrock gives you access to dozens of these models as a managed service. No GPUs to provision, no training pipelines to build. You call an API, you get a response.

The Big Shift

You're not training models anymore. You're orchestrating them: managing tokens, context windows, latency, and prompts. Different skillset.

Exam Tip

Expect questions on sync vs async processing and stateless design. AWS loves testing whether you understand that FMs don't remember anything between calls.

Architecture Diagram

GenAI Architecture Overview — Figure 1: High-level GenAI architecture showing client applications, Amazon Bedrock, and supporting services

Foundation Models vs Traditional ML

Foundation Models (FMs)

Foundation Models are massive pre-trained neural networks. They already know how to write, reason, code, and generate images. You don't train them. You prompt them.

What makes them different:

Billions of parameters, already trained
One model handles text, code, reasoning, images (depending on the model)
You access them through APIs
Adaptable via prompts or fine-tuning

On Bedrock, you get:

Claude (Anthropic): reasoning, conversation, long context
Titan (Amazon): text, embeddings, image generation
Llama (Meta): open-weight, good for customization
Stable Diffusion: image generation

Traditional ML

Traditional ML means training your own model for a specific task. Fraud detection, demand forecasting, recommendations. Each needs its own model.

What that involves:

Collecting and labeling training data
Feature engineering (often the hard part)
Training infrastructure (SageMaker, typically)
Separate model for each task

Still the right choice for structured data problems. But for text, code, and conversation? FMs are usually faster to deploy.

Foundation Models vs Traditional ML

Aspect	Foundation Models	Traditional ML
Training	Already done, inference only	You train it yourself
Data needed	Prompts and context	Large labeled datasets
Skills required	Prompt engineering	ML engineering, feature work
Flexibility	One model, many tasks	One model per task
Cost	Pay per token	Training + inference
Time to production	Hours to days	Weeks to months
AWS Service	Bedrock	SageMaker

Inference vs Training

For this exam, focus on inference. Most questions are about building apps that call models, not training them.

Inference

Inference = calling a trained model to get output. This is what you'll do 99% of the time with Bedrock.

What it looks like:

Latency matters (users are waiting)
Stateless: each call is independent
Scales automatically
Pay per request/token

Bedrock handles:

Load balancing across model instances
Auto-scaling based on demand
Regional endpoints for lower latency

Training (Custom Models)

Training = teaching a model new patterns. With Bedrock, this means fine-tuning or continued pre-training.

When you'd do this:

Model needs domain-specific knowledge
You want consistent style/format
Base model isn't performing well enough

Bedrock options:

Supervised Fine-Tuning (SFT): labeled examples
Continued Pre-Training (CPT): unlabeled domain text
Takes hours, model stored in your account

Exam Focus

Most exam questions involve inference architectures. Know the tradeoffs: on-demand vs provisioned throughput vs batch. Training questions are less common.

Sync vs Async Processing

The choice is simple: is someone waiting for the response?

Synchronous

User is staring at a loading spinner. They need an answer now.

When to use it:

Chatbots
Search interfaces
Code completion in IDEs
Any interactive UI

Bedrock APIs:

InvokeModel: single request, single response
InvokeModelWithResponseStream: tokens stream back as generated
Converse: multi-turn conversation helper

Constraints:

Need sub-10-second responses (ideally faster)
Can hit rate limits if you're not careful
More expensive per token than batch

Asynchronous (Batch)

Nobody's waiting. You've got 50,000 documents to process overnight.

When to use it:

Bulk content generation
Document classification at scale
Data extraction from archives
Anything where "done by morning" is fine

Bedrock Batch Inference:

Submit prompts in bulk, results land in S3
Up to 50% cheaper than sync (finally, a discount that matters)
No latency requirements

Typical architecture:

S3 for input/output
Step Functions for orchestration
SQS if you need queuing
SNS for completion notifications

Sync vs Async

Aspect	Synchronous	Asynchronous
Latency	Real-time (seconds)	Minutes to hours
Cost	Standard rates	Up to 50% cheaper
Use case	Interactive apps	Bulk processing
Scaling	Bedrock handles it	Queue-based
Error handling	Retry immediately	DLQ + retry policies

Stateless Design (This Is Important)

Foundation models don't remember you. Every API call starts fresh. That conversation you had 30 seconds ago? The model has no idea.

Stateless Architecture

FMs are stateless. Each request is independent. If you want conversation memory, you have to build it yourself.

What this means:

Pass the full conversation history with every request
Store that history somewhere (DynamoDB works well)
Use session IDs to track who's who
Make your operations idempotent. Retries will happen.

The upside? Scaling is trivial. Any instance can handle any request. No sticky sessions, no state synchronization headaches.

PYPassing Conversation History

import boto3

bedrock = boto3.client('bedrock-runtime')

def chat(user_message, history):
    """
    Every call includes the full conversation history.
    The model doesn't remember anything on its own.
    """
    messages = history + [{"role": "user", "content": user_message}]

    response = bedrock.converse(
        modelId="anthropic.claude-3-sonnet-20240229-v1:0",
        messages=messages,
        inferenceConfig={"maxTokens": 1024}
    )

    return response['output']['message']['content'][0]['text']

# History stored in DynamoDB, fetched per request
history = [
    {"role": "user", "content": "What is AWS?"},
    {"role": "assistant", "content": "AWS is Amazon Web Services..."}
]

# This call includes everything above
response = chat("Tell me about Bedrock", history)

Context Management

Context = everything you send to the model. System instructions, conversation history, retrieved documents, user preferences. All of it counts against the context window.

Context Window

Context window is how much the model can "see" at once. Think of it like RAM: there's a limit.

What counts:

System prompt
Conversation history
Retrieved documents (RAG)
Your actual question
The model's response (yes, output counts too)

Limits vary:

Some models: 8K tokens
Claude on Bedrock: up to 200K tokens
Exceeding the limit = errors or truncation

Context Strategies

Strategy	How it works	Good for

Bedrock Memory Options

Bedrock now has Session Management APIs and AgentCore Memory. AgentCore handles both short-term (session) and long-term (user profile) memory. Worth checking if it fits your use case. Saves you from building memory infrastructure.

Request Flow

Architecture Decision Flow — Figure 3: Choosing between sync and async

Real-World Examples

Customer Support Chatbot

The problem: Users expect instant responses. Conversations need to feel continuous.

Architecture:

API Gateway WebSocket API (bidirectional, supports streaming)
Lambda for request handling
DynamoDB for conversation history
Bedrock Converse API with streaming

Why this works:

WebSocket lets you stream tokens to the UI as they're generated
DynamoDB stores history; Lambda stays stateless
Sliding window keeps context manageable

Bulk Document Processing

The problem: 100,000 legal documents need classification and summaries. Nobody's waiting.

Architecture:

Documents in S3
Step Functions orchestrates the workflow
Bedrock Batch Inference does the work
Results land back in S3
SNS notifies when done

Why this works:

Batch inference = 50% cost savings
Step Functions handles retries and parallelism
No latency pressure means you can optimize for cost

IDE Code Assistant

The problem: Developers want suggestions as they type. Speed matters.

Architecture:

Direct HTTPS from IDE plugin to Bedrock
Streaming API for fast first-token display
Lambda@Edge for auth
CloudWatch for monitoring

Why this works:

Streaming shows results before generation completes
Regional endpoints minimize latency
Stateless: context passed each request

Best Practices

What Actually Matters

Assume statelessness. Store history externally, pass it every time.
Pick the right mode. Sync for interactive, batch for bulk.
Watch your context window. Implement sliding window or summarization before you hit limits.
Handle throttling. Bedrock will throttle you eventually. Exponential backoff, maybe a fallback model.
Track tokens. Input + output tokens = your bill. Monitor this.
Stream for UX. Users perceive streaming as faster, even when it isn't.

Exam Scenarios

What AWS Wants You to Know

Scenario	Answer	Why

Common Mistakes

Assuming the Model Remembers

The mistake: Expecting the model to remember your last message.

Reality: It doesn't. Every call is independent.

Fix: Store history in DynamoDB (or similar), include it in every request. Or use Bedrock's memory APIs if they fit your use case.

Using Sync for Bulk Work

The mistake: Processing 10,000 documents with individual InvokeModel calls.

Reality: You'll pay more, hit rate limits, and probably timeout somewhere.

Fix: Batch inference. It exists for this exact reason.

Ignoring Context Limits

The mistake: Appending to conversation history forever until something breaks.

Reality: You'll get 400 errors or truncated responses. The error message won't always tell you why.

Fix: Track token counts. Implement sliding window or summarization before you hit the wall.

Test Your Knowledge

A company wants to build a chatbot that maintains conversation context across multiple interactions. What is the BEST approach for managing conversation state with Amazon Bedrock?

ARely on Bedrock to automatically maintain conversation state

BStore conversation history in DynamoDB and include it in each request

CUse a single long-running Lambda function to maintain state

DEnable session persistence in the Bedrock console

An organization needs to process 100,000 documents for classification. Which Amazon Bedrock inference pattern should they use?

ASynchronous InvokeModel API calls

BBedrock Batch Inference with S3

CProvisioned Throughput with dedicated capacity

DReal-time streaming API

What is the PRIMARY difference between foundation models and traditional ML models?

AFoundation models require more training data

BFoundation models can perform multiple tasks without task-specific training

CTraditional ML models are more accurate

DFoundation models only work with text data

A GenAI application is returning errors when conversations exceed 20 messages. What's the MOST LIKELY cause?

When should you use synchronous inference over batch inference?

AI/ML

Amazon Bedrock

Managed access to foundation models. No infrastructure to manage.

AI/ML

Amazon SageMaker

For when you need to train your own models. Different use case.

Integration

AWS Step Functions

Orchestrate multi-step workflows. Handles retries, parallel execution, state management.

Database

Amazon DynamoDB

Store conversation history and session state. Fast, scales automatically.

Integration

Amazon SQS

Queue requests when you need controlled throughput or async processing.

Key Concepts

GenAI Architecture Fundamentals

Overview

The Big Shift

Architecture Diagram

Foundation Models vs Traditional ML

Foundation Models (FMs)

Traditional ML

Foundation Models vs Traditional ML

Inference vs Training

Inference

Training (Custom Models)

Exam Focus

Sync vs Async Processing

Synchronous

Asynchronous (Batch)

Sync vs Async

Stateless Design (This Is Important)

Stateless Architecture

Context Management

Context Window

Context Strategies

Bedrock Memory Options

Request Flow

Real-World Examples

Customer Support Chatbot

Bulk Document Processing

IDE Code Assistant

Best Practices

What Actually Matters

Exam Scenarios

What AWS Wants You to Know

Common Mistakes

Assuming the Model Remembers

Using Sync for Bulk Work

Ignoring Context Limits

Test Your Knowledge

Related Services

Amazon Bedrock

Amazon SageMaker

AWS Step Functions

Amazon DynamoDB

Amazon SQS

Further Reading

Related AWS Services