Preporato

AIP-C01 Study Guide

Foundation Model Integration, Data Management, and ComplianceGenAI Solution Design and ArchitectureGenAI Architecture Fundamentals

Key Concepts

  • Foundation models vs traditional ML

  • Inference vs training architectures

  • Synchronous vs asynchronous processing

  • Stateless application design for GenAI

  • Context management strategies

GenAI Architecture Fundamentals

Overview

GenAI architecture is different from traditional ML. With traditional ML, you train a model for a specific task. With GenAI, you're calling pre-trained Foundation Models (FMs) that already know how to do things. You just need to ask correctly.

Amazon Bedrock gives you access to dozens of these models as a managed service. No GPUs to provision, no training pipelines to build. You call an API, you get a response.

The Big Shift

You're not training models anymore. You're orchestrating them: managing tokens, context windows, latency, and prompts. Different skillset.

Exam Tip

Expect questions on sync vs async processing and stateless design. AWS loves testing whether you understand that FMs don't remember anything between calls.


Architecture Diagram

GenAI Architecture Overview
Figure 1: High-level GenAI architecture showing client applications, Amazon Bedrock, and supporting services

Foundation Models vs Traditional ML

Foundation Models (FMs)

Foundation Models are massive pre-trained neural networks. They already know how to write, reason, code, and generate images. You don't train them. You prompt them.

What makes them different:

  • Billions of parameters, already trained
  • One model handles text, code, reasoning, images (depending on the model)
  • You access them through APIs
  • Adaptable via prompts or fine-tuning

On Bedrock, you get:

  • Claude (Anthropic): reasoning, conversation, long context
  • Titan (Amazon): text, embeddings, image generation
  • Llama (Meta): open-weight, good for customization
  • Stable Diffusion: image generation

Traditional ML

Traditional ML means training your own model for a specific task. Fraud detection, demand forecasting, recommendations. Each needs its own model.

What that involves:

  • Collecting and labeling training data
  • Feature engineering (often the hard part)
  • Training infrastructure (SageMaker, typically)
  • Separate model for each task

Still the right choice for structured data problems. But for text, code, and conversation? FMs are usually faster to deploy.

Foundation Models vs Traditional ML

AspectFoundation ModelsTraditional ML
TrainingAlready done, inference onlyYou train it yourself
Data neededPrompts and contextLarge labeled datasets
Skills requiredPrompt engineeringML engineering, feature work
FlexibilityOne model, many tasksOne model per task
CostPay per tokenTraining + inference
Time to productionHours to daysWeeks to months
AWS ServiceBedrockSageMaker

Inference vs Training

For this exam, focus on inference. Most questions are about building apps that call models, not training them.

Inference

Inference = calling a trained model to get output. This is what you'll do 99% of the time with Bedrock.

What it looks like:

  • Latency matters (users are waiting)
  • Stateless: each call is independent
  • Scales automatically
  • Pay per request/token

Bedrock handles:

  • Load balancing across model instances
  • Auto-scaling based on demand
  • Regional endpoints for lower latency

Training (Custom Models)

Training = teaching a model new patterns. With Bedrock, this means fine-tuning or continued pre-training.

When you'd do this:

  • Model needs domain-specific knowledge
  • You want consistent style/format
  • Base model isn't performing well enough

Bedrock options:

  • Supervised Fine-Tuning (SFT): labeled examples
  • Continued Pre-Training (CPT): unlabeled domain text
  • Takes hours, model stored in your account
Exam Focus

Most exam questions involve inference architectures. Know the tradeoffs: on-demand vs provisioned throughput vs batch. Training questions are less common.


Sync vs Async Processing

The choice is simple: is someone waiting for the response?

Synchronous

User is staring at a loading spinner. They need an answer now.

When to use it:

  • Chatbots
  • Search interfaces
  • Code completion in IDEs
  • Any interactive UI

Bedrock APIs:

  • InvokeModel: single request, single response
  • InvokeModelWithResponseStream: tokens stream back as generated
  • Converse: multi-turn conversation helper

Constraints:

  • Need sub-10-second responses (ideally faster)
  • Can hit rate limits if you're not careful
  • More expensive per token than batch

Asynchronous (Batch)

Nobody's waiting. You've got 50,000 documents to process overnight.

When to use it:

  • Bulk content generation
  • Document classification at scale
  • Data extraction from archives
  • Anything where "done by morning" is fine

Bedrock Batch Inference:

  • Submit prompts in bulk, results land in S3
  • Up to 50% cheaper than sync (finally, a discount that matters)
  • No latency requirements

Typical architecture:

  • S3 for input/output
  • Step Functions for orchestration
  • SQS if you need queuing
  • SNS for completion notifications

Sync vs Async

AspectSynchronousAsynchronous
LatencyReal-time (seconds)Minutes to hours
CostStandard ratesUp to 50% cheaper
Use caseInteractive appsBulk processing
ScalingBedrock handles itQueue-based
Error handlingRetry immediatelyDLQ + retry policies

Stateless Design (This Is Important)

Foundation models don't remember you. Every API call starts fresh. That conversation you had 30 seconds ago? The model has no idea.

Stateless Architecture

FMs are stateless. Each request is independent. If you want conversation memory, you have to build it yourself.

What this means:

  • Pass the full conversation history with every request
  • Store that history somewhere (DynamoDB works well)
  • Use session IDs to track who's who
  • Make your operations idempotent. Retries will happen.

The upside? Scaling is trivial. Any instance can handle any request. No sticky sessions, no state synchronization headaches.

PYPassing Conversation History
import boto3

bedrock = boto3.client('bedrock-runtime')

def chat(user_message, history):
    """
    Every call includes the full conversation history.
    The model doesn't remember anything on its own.
    """
    messages = history + [{"role": "user", "content": user_message}]

    response = bedrock.converse(
        modelId="anthropic.claude-3-sonnet-20240229-v1:0",
        messages=messages,
        inferenceConfig={"maxTokens": 1024}
    )

    return response['output']['message']['content'][0]['text']

# History stored in DynamoDB, fetched per request
history = [
    {"role": "user", "content": "What is AWS?"},
    {"role": "assistant", "content": "AWS is Amazon Web Services..."}
]

# This call includes everything above
response = chat("Tell me about Bedrock", history)

Context Management

Context = everything you send to the model. System instructions, conversation history, retrieved documents, user preferences. All of it counts against the context window.

Context Window

Context window is how much the model can "see" at once. Think of it like RAM: there's a limit.

What counts:

  • System prompt
  • Conversation history
  • Retrieved documents (RAG)
  • Your actual question
  • The model's response (yes, output counts too)

Limits vary:

  • Some models: 8K tokens
  • Claude on Bedrock: up to 200K tokens
  • Exceeding the limit = errors or truncation

Context Strategies

StrategyHow it worksGood for
Bedrock Memory Options

Bedrock now has Session Management APIs and AgentCore Memory. AgentCore handles both short-term (session) and long-term (user profile) memory. Worth checking if it fits your use case. Saves you from building memory infrastructure.


Request Flow

GenAI Request Flow
Figure 2: Request flow from client through API Gateway, Lambda, and Bedrock
Architecture Decision Flow
Figure 3: Choosing between sync and async

Real-World Examples

Customer Support Chatbot

The problem: Users expect instant responses. Conversations need to feel continuous.

Architecture:

  • API Gateway WebSocket API (bidirectional, supports streaming)
  • Lambda for request handling
  • DynamoDB for conversation history
  • Bedrock Converse API with streaming

Why this works:

  • WebSocket lets you stream tokens to the UI as they're generated
  • DynamoDB stores history; Lambda stays stateless
  • Sliding window keeps context manageable

Bulk Document Processing

The problem: 100,000 legal documents need classification and summaries. Nobody's waiting.

Architecture:

  • Documents in S3
  • Step Functions orchestrates the workflow
  • Bedrock Batch Inference does the work
  • Results land back in S3
  • SNS notifies when done

Why this works:

  • Batch inference = 50% cost savings
  • Step Functions handles retries and parallelism
  • No latency pressure means you can optimize for cost

IDE Code Assistant

The problem: Developers want suggestions as they type. Speed matters.

Architecture:

  • Direct HTTPS from IDE plugin to Bedrock
  • Streaming API for fast first-token display
  • Lambda@Edge for auth
  • CloudWatch for monitoring

Why this works:

  • Streaming shows results before generation completes
  • Regional endpoints minimize latency
  • Stateless: context passed each request

Best Practices

What Actually Matters
  1. Assume statelessness. Store history externally, pass it every time.
  2. Pick the right mode. Sync for interactive, batch for bulk.
  3. Watch your context window. Implement sliding window or summarization before you hit limits.
  4. Handle throttling. Bedrock will throttle you eventually. Exponential backoff, maybe a fallback model.
  5. Track tokens. Input + output tokens = your bill. Monitor this.
  6. Stream for UX. Users perceive streaming as faster, even when it isn't.

Exam Scenarios

What AWS Wants You to Know

ScenarioAnswerWhy

Common Mistakes

Assuming the Model Remembers

The mistake: Expecting the model to remember your last message.

Reality: It doesn't. Every call is independent.

Fix: Store history in DynamoDB (or similar), include it in every request. Or use Bedrock's memory APIs if they fit your use case.

Using Sync for Bulk Work

The mistake: Processing 10,000 documents with individual InvokeModel calls.

Reality: You'll pay more, hit rate limits, and probably timeout somewhere.

Fix: Batch inference. It exists for this exact reason.

Ignoring Context Limits

The mistake: Appending to conversation history forever until something breaks.

Reality: You'll get 400 errors or truncated responses. The error message won't always tell you why.

Fix: Track token counts. Implement sliding window or summarization before you hit the wall.


Test Your Knowledge

Q

A company wants to build a chatbot that maintains conversation context across multiple interactions. What is the BEST approach for managing conversation state with Amazon Bedrock?

ARely on Bedrock to automatically maintain conversation state
BStore conversation history in DynamoDB and include it in each request
CUse a single long-running Lambda function to maintain state
DEnable session persistence in the Bedrock console
Q

An organization needs to process 100,000 documents for classification. Which Amazon Bedrock inference pattern should they use?

ASynchronous InvokeModel API calls
BBedrock Batch Inference with S3
CProvisioned Throughput with dedicated capacity
DReal-time streaming API
Q

What is the PRIMARY difference between foundation models and traditional ML models?

AFoundation models require more training data
BFoundation models can perform multiple tasks without task-specific training
CTraditional ML models are more accurate
DFoundation models only work with text data
Q

A GenAI application is returning errors when conversations exceed 20 messages. What's the MOST LIKELY cause?

Q

When should you use synchronous inference over batch inference?



Further Reading

Related AWS Services

Amazon BedrockSageMaker AILambda