AIF-C01 Exam Study Guide

Getting ready for the AIF-C01 exam covers a wide range of topics, from basic machine learning theory and generative AI to AWS services, responsible AI, and governance frameworks.

The real challenge is not just understanding the material, but also figuring out which service to choose when several seem similar.

This guide helps by offering clear summaries, direct service comparisons, and warnings about common exam pitfalls based on real question trends.

Exam Domain Overview

Domain	Approx. Weight
Fundamentals of AI and ML	~20%
Fundamentals of Generative AI	~24%
Applications of Foundation Models	~28%
Guidelines for Responsible AI	~14%
Security, Compliance, and Governance	~14%

Domain 1: Fundamentals of AI and ML

1.1 ML Paradigms

Paradigm	How It Works	Example Use Cases
Supervised	Labeled data, learns input→output mapping	Classification, regression, fraud detection
Unsupervised	Unlabeled data, finds hidden patterns	Clustering, customer segmentation, anomaly detection
Semi-supervised	Mix of labeled + unlabeled data	When labeling is expensive
Reinforcement	Agent takes actions, receives rewards/penalties	Robotics, game AI, AWS DeepRacer

The deciding factor is the data, not the task.

Exam scenarios describe a business use case and ask which learning paradigm applies. Candidates often default to supervised learning because the task sounds familiar, even when the scenario explicitly states there are no labels.

Always identify whether the training data is labeled or unlabeled first, then map to the paradigm. If data has no labels, unsupervised is correct regardless of the application domain.

1.2 ML Algorithms — When to Use What

Algorithm	Type	Use Case
Linear regression	Supervised	Numeric prediction (prices, demand)
Logistic regression	Supervised	Binary/multi-class classification; interpretable
Decision tree	Supervised	Classification & regression; interpretable
K-means	Unsupervised	Clustering customers/data into groups
K-nearest neighbors (k-NN)	Supervised	Classification based on proximity to labeled examples
SVM	Supervised	Classification
Random forest / Ensemble	Supervised	Combining models for higher accuracy
CNN	Supervised	Image classification, object detection
RNN	Supervised	Sequential/time-series data
GAN	Generative	Synthetic data generation
Autoencoder	Unsupervised	Anomaly detection with no labeled data
ARIMA	Statistical	Time-series forecasting
BERT	Transformer	Contextual NLP, text fill-in

K-means and K-NN share a letter but nothing else.

K-means is an unsupervised clustering algorithm — it groups data points by similarity without any labels.

K-nearest neighbors is a supervised classification algorithm — it assigns a label to a new data point by looking at the labels of its closest neighbors in the training set.

Because the names are visually similar, this is a reliable exam trap. When you see either algorithm as an option, immediately check whether the scenario involves labeled or unlabeled data to eliminate the wrong one.

When interpretability is a requirement, deep learning is wrong.

Certain domains — regulated industries, financial decisions, medical risk scoring — require that a model's reasoning be traceable and explainable to non-technical stakeholders.

Neural networks and deep learning architectures are black boxes: they can produce accurate outputs but cannot clearly articulate why.

Logistic regression and decision trees make their decision boundaries explicit and auditable.

If a scenario mentions regulatory requirements, auditability, or the need to explain a decision to a human, interpretable models are always the correct choice.

1.3 The ML Lifecycle (in order)

Business goal identification — define objectives; compliance requirements determined here
Data collection
Data preprocessing — filtering, cleaning, handling missing values (imputation), normalization
Exploratory Data Analysis (EDA) — correlation matrices, statistics, visualizations, pattern discovery
Feature engineering — creating/selecting input variables
Model training
Model evaluation — test against metrics
Deployment — inference begins here
Monitoring — detect drift, retrain

EDA and data preprocessing are neighboring steps with entirely different purposes.

Preprocessing is a corrective step: it resolves known data quality problems such as missing values, inconsistent formats, outliers, and unscaled features.

EDA is a discovery step: it uses statistics and visualizations to uncover patterns, distributions, and correlations that inform later decisions.

A scenario describing correlation analysis, distribution plots, or anomaly discovery is describing EDA.

A scenario describing normalization, imputation, or data cleaning is describing preprocessing.

Conflating the two leads to selecting the wrong lifecycle phase.

Compliance and regulatory requirements belong in the business goal phase, not later stages.

Many candidates assume compliance surfaces during data handling or model evaluation. In the ML lifecycle, determining which legal or regulatory frameworks apply to a solution is part of scoping the problem — it is the first step.

Questions that ask when compliance considerations are identified should point to business goal identification.

1.4 Evaluation Metrics

Metric	Best For	Description
Accuracy	Balanced classes	% of total correct predictions
Precision	Minimizing false positives	Of all flagged positives, how many are actually positive?
Recall	Minimizing false negatives	Of all actual positives, how many did the model catch?
F1 Score	Imbalanced classes	Harmonic mean of precision and recall
ROC / AUC	Binary classifiers	Trade-off between sensitivity and specificity
Confusion matrix	Multi-class	Shows all correct + misclassification patterns
MSE / RMSE	Regression	Numeric prediction error
R-squared	Regression	Variance explained by model
BLEU	Translation quality	Compare machine vs. human translation
ROUGE	Summarization quality	Recall-based; compare generated vs. reference summaries
BERTScore	Semantic text similarity	For style/coherence evaluation

Precision and recall require reading the scenario carefully for which type of error matters most.

These two metrics measure opposite risks.

Precision answers: "When the model flags something as positive, how often is it actually positive?" It matters when acting on a false positive is costly — wasted resources, unnecessary interventions, eroded trust.

Recall answers: "Of all the things that are actually positive, how many did the model catch?" It matters when missing a true positive is costly — an undetected threat, a missed diagnosis.

Identifying which error is described in the scenario determines the correct metric.

Accuracy is a misleading metric on imbalanced datasets.

A model that always predicts the majority class can achieve high accuracy while failing entirely at its task. When class distribution is skewed — such as rare events or minority categories — F1 score, precision, recall, or AUC give a more honest picture.

Accuracy should only be treated as a reliable standalone metric when classes are roughly balanced.

BLEU and ROUGE are task-specific and not interchangeable.

BLEU was designed for machine translation evaluation: it measures n-gram overlap between a machine-generated translation and a human reference translation.

ROUGE was designed for summarization evaluation: it measures recall of key content from a reference summary.

Neither is appropriate for regression or classification tasks. Applying BLEU to summarization or ROUGE to translation is a common wrong answer.

1.5 Bias, Variance, Overfitting, Underfitting

Problem	Bias	Variance	Fix
Underfitting	High	Low	More data, more features, more epochs, less regularization
Overfitting	Low	High	More data, regularization, fewer features, less training
Ideal model	Low	Low	—

The fix for overfitting requires increasing regularization, which is counterintuitive.

Overfitting means the model has learned the training data too specifically and fails to generalize. The instinctive response is to train more or increase model complexity — both of which make the problem worse. The correct remedies are to increase the regularization parameter (adding a penalty for complexity), reduce the number of features, or supply more diverse training data.

Decreasing the regularization parameter relaxes constraints on the model and exacerbates overfitting.

1.6 Key ML Concepts

Epoch: One complete pass through the entire training dataset
Gradient descent: Optimization algorithm to minimize the loss function
Backpropagation: Algorithm for updating neural network weights based on error
Normalization: Scaling features so they contribute equally to the model
Imputation: Technique for handling missing data
Transfer learning: Reusing a pre-trained model for a new related task
Ensemble learning: Combining multiple models to improve performance
Tokenization (NLP): Breaking text into smaller units for processing
Embeddings: Numerical vector representations that capture semantic meaning; enable mathematical comparison of texts
Context window: Maximum text an LLM can process in one operation
Inference: Using a trained model to make predictions on new data

1.7 Inference Types

Type	Latency	Use Case
Real-time	Low (immediate)	Patient check-ins, live predictions, immediate response required
Serverless	Medium (cold starts)	Intermittent workloads, no infrastructure management
Asynchronous	Higher (queued)	Large payloads up to 1 GB, processing up to 1 hour, near-real-time OK
Batch transform	High (scheduled)	Large datasets, once-per-day inference, no immediacy needed

Asynchronous and batch inference are not the same despite both being non-immediate.

Asynchronous inference is designed for large individual requests — single inputs up to 1 GB that take a long time to process — where the result is retrieved shortly after completion rather than immediately.

Batch transform is designed for bulk processing of entire datasets at a scheduled time, with no real-time component at all.

Serverless inference is optimized for sporadic, low-frequency workloads where cold starts are acceptable; it is not the same as real-time, which requires consistently low latency.

Reading the scenario for payload size, frequency, and urgency is the key to differentiating these.

Domain 2: Fundamentals of Generative AI

2.1 Key Concepts

Foundation Model (FM): Large model pre-trained on massive data; broad generalized capabilities; base for many AI applications
Large Language Model (LLM): A type of FM specialized for language understanding and generation
LLMs are non-deterministic: Same input can produce different outputs; this is expected behavior
Tokens: Basic units (words/subwords) that LLMs process; inference cost is driven by token count
Embeddings: High-dimensional vectors that capture semantic relationships; enable similarity searches
Context window: Max tokens in a single prompt+response; if exceeded, model fails
Hallucinations: Model generates plausible but false information; reduce by lowering temperature or using RAG/guardrails

2.2 Inference Parameters

Parameter	Controls	Effect
Temperature	Randomness/creativity	Higher = more creative/diverse; Lower = more deterministic/consistent
Top K	Number of candidate tokens considered	Controls vocabulary breadth
Top P	Cumulative probability of candidates	Percentage of likely candidates considered
Max tokens	Output length	Hard cap on response length
Stop sequences	Where generation stops	Specific strings that terminate output

Temperature is about output consistency

Temperature controls the probability distribution over possible next tokens at each generation step. Setting it to 0 makes the model deterministically select the most probable token every time, producing stable and repeatable output.

Setting it high allows lower-probability tokens to be selected, increasing variety and creativity at the cost of reliability.

For any scenario that requires consistent, repeatable, or predictable outputs, temperature should be set as close to 0 as possible.

Top K and Top P both narrow token selection but operate on different principles.

Top K sets an absolute limit: at each step, only the K most probable tokens are eligible, regardless of how their probabilities are distributed.

Top P sets a probabilistic threshold: tokens are added to the candidate pool in descending probability order until their cumulative probability reaches P, meaning the actual number of candidates varies by context.

Both reduce randomness, but Top P is more adaptive to the shape of the probability distribution.

The exam may present both as options — Top K controls a count, Top P controls a cumulative percentage.

LLM inference cost is determined by token volume

What drives cost is the total number of tokens processed: input tokens (the prompt, any examples, injected context) plus output tokens (the generated response). To reduce inference cost, reduce prompt length, eliminate unnecessary examples, or constrain maximum output tokens.

2.3 Prompt Engineering Techniques

Technique	Description	When to Use
Zero-shot	No examples provided	Simple tasks, general instructions
One-shot	One example provided	When a single example clarifies format
Few-shot	Multiple examples (2–10+)	Format matching, style alignment, classification
Chain-of-thought (CoT)	Ask model to reason step-by-step	Complex reasoning, math, multi-step problems
Negative prompting	Specify what to exclude	Image generation, content control
Prompt chaining	Break complex task into sequential subtasks	Multi-step workflows
ReAct prompting	Reasoning + acting with real-time tool calls	Chatbots that query live data (e.g., inventory)
Least-to-most	Decompose problem from simple to complex	—
Directional stimulus	Guide model with hints about desired output	—

Chain-of-thought and few-shot are frequently confused because both involve additional content in the prompt.

Few-shot prompting provides labeled input-output examples that demonstrate a desired format, style, or classification pattern — the model learns by imitation.

Chain-of-thought prompting asks the model to work through intermediate reasoning steps before arriving at a final answer — the model learns to reason, not just imitate.

If the scenario involves showing the model examples of what good output looks like, that is few-shot. If the scenario involves asking the model to explain its reasoning or solve a problem step by step, that is chain-of-thought.

Prompt engineering is always the first and lowest-cost intervention before any model customization.

When a model produces output in the wrong format, language, or structure, the correct first response is to refine the prompt — not to resize the model, adjust the architecture, or initiate fine-tuning.

Prompt changes require no training, no infrastructure, and no cost beyond the token count.

Fine-tuning and retraining are appropriate only after prompt engineering has been genuinely exhausted.

2.4 Model Customization Options (Least → Most Effort)

Prompt engineering — no training, cheapest, try first
RAG (Retrieval Augmented Generation) — inject external knowledge at query time; best for frequently changing data
Fine-tuning — supervised training on labeled prompt/completion pairs; adapts style/domain
Continued pre-training — training on unlabeled domain text; for domain vocabulary adaptation
Training from scratch — maximum control, maximum cost

RAG and fine-tuning address fundamentally different problems and cannot substitute for each other.

Fine-tuning reshapes how the model behaves — its tone, vocabulary, output structure, and domain-specific response patterns — but it does not efficiently update the model's factual knowledge over time.

RAG does not change the model at all; instead it retrieves current, relevant information at query time and injects it into the prompt, keeping responses grounded in up-to-date sources.

When knowledge changes frequently, fine-tuning is the wrong tool: you would need to retrain the model every time facts change. RAG handles dynamic knowledge without any retraining.

Fine-tuning and continued pre-training require different data formats and serve different goals.

Fine-tuning requires labeled pairs of prompts and expected completions — structured input-output examples that teach the model to behave in a specific way for a specific task.

Continued pre-training uses large volumes of raw, unlabeled domain text to expose the model to domain-specific terminology and writing patterns, without teaching it to produce any particular output format.

Providing unlabeled text when fine-tuning is expected, or providing labeled pairs when continued pre-training is described, will not produce the intended result.

2.5 Model Types

Model Type	Purpose
Text generation / GPT	Generate text, code, SQL from natural language
Text embedding	Convert text to vectors for similarity search/RAG
Multi-modal embedding	Handle text AND images in queries
Multi-modal generation (large multi-modal LLM)	Accept text/image input; produce text/image output
Diffusion model	Image generation via iterative denoising
GAN	Two competing networks; synthetic data
Transformer	Self-attention mechanism; basis for most modern LLMs
WaveNet	Audio/speech synthesis
VAE	Generative model using latent space compression

Embedding models and generation models serve opposite purposes and cannot be swapped.

An embedding model converts text or images into dense numerical vectors for use in similarity comparisons, semantic search, and RAG retrieval pipelines — it does not generate readable output.

A generation model produces new content: text, images, or audio.

When a use case involves searching, matching, or ranking content by semantic similarity, an embedding model is needed. When it involves producing new content from a prompt, a generation model is needed.

Multi-modal variants exist for both — they accept or produce combinations of text and images, but the embedding vs. generation distinction still applies.

2.6 Evaluation Metrics for Generative AI

BLEU — translation quality (relative comparison to human reference)
ROUGE — summarization quality (recall-oriented)
BERTScore — semantic similarity; for style/coherence tasks
F1 score — precision+recall balance; classification accuracy after fine-tuning

BLEU is a comparative metric, not an absolute quality score.

A BLEU score only has meaning when comparing two systems translating the same source content against the same human reference. It cannot tell you whether a translation is "good" in isolation — only whether it resembles the human reference more or less than another system does.

BLEU also does not measure fluency, meaning, or style directly. It is specifically a translation evaluation tool and should not be applied to summarization, classification, or other generative tasks.

Domain 3: Applications of Foundation Models

3.1 Amazon Bedrock

Core platform for accessing and customizing foundation models without managing infrastructure.

Feature	Purpose
Foundation Models	Access models from Anthropic, Meta, Amazon, etc.
Knowledge Bases	Fully managed RAG; default vector store = OpenSearch Serverless
Agents	Orchestrate multi-step tasks; call APIs, query databases, take actions
Guardrails	Filter harmful content, block topics, protect PII
Fine-tuning	Customize models with your labeled data (prompt/completion pairs)
Provisioned Throughput	For steady, predictable workloads (cost-effective at scale)
On-Demand Throughput	Pay-per-use; best for experimentation or unpredictable usage
Invocation Logging	Log model inputs/outputs for monitoring
Watermark detection	Identifies images created by Amazon Titan Image Generator
PartyRock	Free playground for experimenting with generative AI
bedrock-runtime API	Makes inference requests
bedrock-agent-runtime API	Invokes agents and queries knowledge bases

Guardrails and watermark detection are completely separate features that do not overlap in purpose.

Guardrails are a runtime content control mechanism: they inspect model inputs and outputs and enforce rules around harmful content categories, blocked topics, sensitive information, and specific words. ** Watermark detection** is a provenance tool: it determines whether a given image was generated by the Amazon Titan Image Generator, helping identify AI-created images after the fact.

One filters what the model says; the other identifies the origin of an image. They are not interchangeable and serve unrelated use cases.

Amazon Bedrock does not expose user data to the underlying model providers.

A common assumption is that using a third-party model through Bedrock means the model vendor can see your prompts and responses. This is not the case: Amazon Bedrock does not share user inputs or model outputs with any third-party model provider. This data privacy assurance is frequently tested and is a key differentiator from direct API access to those same model vendors.

A fine-tuned model in Bedrock cannot serve traffic until Provisioned Throughput is purchased.

Fine-tuning in Bedrock creates a custom model artifact, but that artifact is not automatically deployable. Unlike base models, which can be invoked on demand, custom fine-tuned models require a Provisioned Throughput commitment before they can receive inference requests. Skipping this step means the custom model has no serving capacity and cannot be used in production.

3.2 Amazon Bedrock Guardrails — Content Filter Categories

Filters for: Violence, Hate speech, Sexual content, Insults, Misconduct

Built-in content filters and configured denied topics are not the same mechanism, and the default content filter categories are narrower than most candidates expect.

Content filters target universally harmful content: violence, hate, sexual content, insults, and misconduct. They do not, by default, block topics that are sensitive but not inherently harmful — such as politics, religion, competitor products, or gambling. Those types of restrictions require explicitly configuring a denied topics list.

Assuming that content filters cover all unwanted content is a common error; topic-based restrictions require a separate and deliberate configuration step.

3.3 SageMaker Services Reference

Service	Purpose
SageMaker Canvas	No-code ML model building
SageMaker Ground Truth	Data labeling with human annotators
SageMaker Ground Truth Plus	Fully managed labeling (no app development needed)
SageMaker Data Wrangler	Data preparation, transformation, feature engineering
SageMaker Feature Store	Centralized feature repository; share features across teams/models
SageMaker Experiments	Track and compare ML experiments
SageMaker Clarify	Bias detection and model explainability
SageMaker Model Monitor	Detect data/model drift in production
SageMaker Model Registry	Store, version, and manage ML models
SageMaker Model Cards	Document model purpose, metrics, limitations
SageMaker Model Dashboard	Monitor and manage multiple deployed models
SageMaker JumpStart	Pre-built models and solutions; accelerate development
SageMaker Autopilot	Automated model tuning (AutoML)
SageMaker Debugger	Real-time training metrics
SageMaker HyperPod	Distributed training; reduces training time up to 40%
SageMaker Studio Lab	Free environment for ML experimentation
Amazon A2I (Augmented AI)	Human review workflows for ML predictions

SageMaker Clarify and SageMaker Model Monitor are both quality tools but address different problems at different lifecycle stages.

Clarify is used during development: it statistically evaluates training data and model outputs for bias across demographic groups, and it generates feature-level explanations (using Shapley values) for why a model made specific predictions.

Model Monitor is used in production: it continuously compares live inference data against a baseline to detect when the model's input distribution or output behavior has drifted from its original state.

Clarify does not detect drift; Model Monitor does not explain predictions or detect bias in training data.

SageMaker Model Cards and SageMaker Model Registry are commonly confused because both involve recording information about models.

Model Cards are documentation artifacts for transparency and compliance: they describe the model's intended use, training methodology, performance characteristics, limitations, and ethical considerations.

Model Registry is a versioning and lifecycle management system: it stores the trained model artifacts, tracks versions, and controls the promotion workflow from development to production.

One is a human-readable document; the other is a software catalog and deployment pipeline.

3.4 AWS AI Services (Managed, No ML Expertise Needed)

Service	Function
Amazon Textract	Extract text and data from documents/PDFs/scanned images
Amazon Transcribe	Speech-to-text; subtitles
Amazon Transcribe Medical	Speech-to-text with healthcare compliance
Amazon Comprehend	NLP: sentiment analysis, entity recognition, PII detection, toxicity detection
Amazon Comprehend Medical	Extract medical info from clinical notes
Amazon Rekognition	Computer vision: object detection, face recognition, image/video analysis
Amazon Translate	Language translation
Amazon Polly	Text-to-speech
Amazon Lex	Build conversational chatbots
Amazon Kendra	ML-powered enterprise search
Amazon Personalize	Real-time personalization and recommendations
Amazon Forecast	Time-series demand/traffic forecasting
Amazon Macie	Detect and protect sensitive/PII data in S3
AWS HealthScribe	Generative AI for clinical note dictation

Amazon Rekognition is a computer vision service with no language or translation capabilities.

Rekognition analyzes images and video frames to detect objects, faces, scenes, and text embedded within visual content. It does not translate, understand natural language, or handle multilingual content.

Any scenario involving multiple spoken or written languages should point to Amazon Translate for text conversion between languages, or Amazon Polly for synthesizing speech in a target language — not Rekognition.

Performing sentiment analysis on audio is a two-service pipeline, not a one-service task.

Amazon Transcribe converts audio to text; it does not analyze sentiment.

Amazon Comprehend analyzes text for sentiment, entities, and key phrases; it cannot process audio.

Neither service covers the full workflow alone. Selecting only Transcribe leaves the analysis undone; selecting only Comprehend ignores the fact that the input is audio. The correct answer always requires both services in sequence.

Amazon Textract and Amazon Comprehend both deal with text but operate at completely different layers.

Textract is an extraction tool: it reads scanned documents, PDFs, and images and pulls out the raw text and structured data embedded within them, such as tables and form fields.

Comprehend is an analysis tool: it processes text that has already been extracted and derives meaning from it — sentiment, named entities, key phrases, PII, language.

Textract sees pixels and produces text; Comprehend sees text and produces insight. They are complementary services, not alternatives.

3.5 Amazon Q Services

Service	Use Case
Amazon Q Business	Enterprise AI assistant; answers questions from internal data
Amazon Q Developer	AI coding assistant in IDE (code, test, document)
Amazon Q in QuickSight	BI dashboards via natural language
Amazon Q in Connect	Customer service agent assistance
Amazon Q Apps	Create and share generative AI-powered apps within Q Business

The Amazon Q variants target completely different user personas and are not interchangeable.

Q Business serves general enterprise employees querying internal company knowledge bases.

Q Developer serves software engineers within an IDE for coding, testing, and documentation tasks.

Q in QuickSight serves business analysts creating data visualizations through natural language.

Q in Connect supports live customer service agents during active customer interactions.

Selecting Q Business for a developer productivity scenario, or Q Developer for an enterprise knowledge question, are common mistakes caused by treating "Amazon Q" as a single product rather than a family of purpose-specific tools.

3.6 RAG (Retrieval Augmented Generation)

RAG retrieves relevant content from a knowledge base at query time and injects it as context into the prompt.

When to use RAG:

Knowledge base changes frequently
Large documentation base
Cost-effective alternative to fine-tuning
Need factual, grounded responses

RAG Pipeline — offline batch processing steps (done ahead of query time):

Generation of content embeddings
Creation of the search index

RAG Pipeline — online (done at query time):

Generation of embeddings for user queries
Retrieval of relevant content
Response generation

Vector databases for RAG: Amazon OpenSearch Service, Amazon Aurora PostgreSQL (with pgvector), Amazon Redshift

RAG addresses knowledge gaps; fine-tuning addresses behavioral gaps. Applying the wrong one wastes time and money.

If the issue is that the model does not know certain facts, lacks access to specific documents, or produces outdated information, RAG is the appropriate solution — it retrieves the relevant knowledge at query time without modifying the model.

If the issue is that the model produces output in the wrong format, tone, structure, or domain-specific style despite having the necessary knowledge, fine-tuning is the appropriate solution — it shapes behavior through additional training.

Attempting to fine-tune a model on factual documents to keep it current requires retraining every time facts change, which is expensive and operationally impractical.

Domain 4: Guidelines for Responsible AI

4.1 Core Responsible AI Principles

Principle	Description
Fairness	Model treats all groups equitably; diverse, balanced training data
Transparency	Stakeholders can understand how the system works
Explainability	Model can provide rationale for individual decisions (e.g., Shapley values, PDPs)
Privacy & Security	Protect personal data; prevent exposure
Governance	Policies, guidelines, auditing, compliance frameworks
Safety	Prevent harm; human oversight

Fairness and explainability are related concepts that test different things.

Fairness is a population-level property: it asks whether the model produces equitable outcomes across different demographic or social groups, and whether certain groups are systematically advantaged or disadvantaged by the model's predictions.

Explainability is a decision-level property: it asks whether the reasoning behind a specific prediction can be articulated in a way that a human can understand, verify, and act on. A model can be explainable but unfair if its transparent reasoning is based on biased patterns.

The exam differentiates them by whether the scenario describes a group-level outcome disparity (fairness) or the need to justify a specific individual decision (explainability).

4.2 Types of AI Bias

Bias Type	Description	Example
Sampling bias	Training data not representative of population	Security camera model trained mainly on one demographic
Measurement bias	Inconsistent data collection	Systematic errors in how data is recorded
Confirmation bias	Model reinforces existing assumptions	—
Observer bias	Human annotator introduces personal bias	—

How to address bias:

Use diverse, balanced training datasets
Data augmentation for underrepresented groups
Apply fairness metrics during evaluation
Use SageMaker Clarify for bias detection

Sampling bias and measurement bias are both data problems but have different root causes and different solutions.

Sampling bias is a coverage problem: the data collection process failed to represent certain groups, conditions, or scenarios in proportion to how they appear in the real world. The model then learns a skewed view of reality.

Measurement bias is an accuracy problem: data was collected from the right population but recorded, labeled, or quantified inconsistently — the same real-world condition is captured differently across subgroups or time periods.

Fixing sampling bias requires collecting more representative data; fixing measurement bias requires standardizing collection and labeling procedures.

4.3 AWS Tools for Responsible AI

Tool	Purpose
SageMaker Clarify	Bias detection + explainability (Shapley values)
SageMaker Model Cards	Document model purpose, risks, performance for transparency
AWS AI Service Cards	Transparency documentation for AWS-managed AI services
Amazon Bedrock Guardrails	Filter harmful content, block topics, protect PII in generative AI
Amazon A2I	Human review of ML predictions at defined confidence thresholds
RLHF	Incorporate human preferences during model training
Amazon Comprehend	PII detection and redaction; toxicity detection

Pitfall — RLHF and Amazon A2I both involve human feedback but operate at completely different points in the AI lifecycle and should never be confused.

RLHF (Reinforcement Learning from Human Feedback) is a training-time technique: human raters evaluate model-generated outputs during the training process, and those preferences are used as reward signals to steer the model toward better behavior before deployment.

Amazon A2I is an inference-time mechanism: it routes individual production predictions to human reviewers when the model's confidence falls below a defined threshold, providing a safety net around live outputs without modifying the model.

RLHF improves the model itself during training; A2I supplements the deployed model with human oversight in production.

4.4 Guardrails for Amazon Bedrock — Filter Types

Filter	Blocks
Content filters	Violence, hate, sexual content (harmful categories)
Denied topics	Specific topics you define (e.g., politics, competitor products)
Sensitive information filters	PII and other sensitive data
Word filters	Specific words or phrases

4.5 Hallucinations

LLMs generate confident but false information
Reduce by: lowering temperature, using RAG to ground responses in facts, using Bedrock Guardrails
Amazon Q Business prevents hallucinations by confining responses to existing enterprise data

Domain 5: Security, Compliance, and Governance

5.1 Shared Responsibility Model for AI

AWS secures: Infrastructure, physical hardware, underlying cloud services
Customer secures: Data, access controls (IAM), model inputs/outputs, application logic

Security responsibility increases as you build more yourself:

Using third-party SaaS with embedded AI → lowest customer responsibility
Building application on existing FM → moderate
Fine-tuning an existing FM → more responsibility
Building and training FM from scratch → maximum customer responsibility

Customer security responsibility scales with how much of the stack the customer owns, and the exam tests this spectrum directly.

When a company consumes AI through a fully managed third-party application, the vendor and AWS handle nearly all infrastructure and model security; the customer is responsible only for access management and appropriate usage.

As customers move toward building, fine-tuning, or training their own models, they absorb increasing responsibility for training data security, model artifact protection, output validation, and infrastructure configuration.

The exam will describe an implementation approach and ask where responsibility lies — always map the approach to its position on the spectrum from fully managed to fully custom.

5.2 IAM and Access Control

Use IAM policies to restrict which foundation models employees can access
Use custom service roles per team to restrict data access in Bedrock (e.g., each team only sees their S3 data)
Use AWS IAM Identity Center to securely integrate Bedrock into enterprise systems

Data access isolation across multiple teams in Bedrock requires separate, scoped service roles — not a single shared role filtered at the application layer.

A single Bedrock service role with broad S3 permissions satisfies basic functionality but violates least privilege and creates no enforcement boundary between teams. Each team should be assigned a role scoped exclusively to their own data resources.

Relying on application logic or user self-reporting to limit data access is not a security control — it is an assumption, and it fails the moment the application has a bug or a user makes a mistake.

5.3 Network Security for AI

Service/Feature	Purpose
AWS PrivateLink	Private connectivity between VPC and Bedrock/SageMaker — no public internet
VPC with S3 endpoint	Manage secure data flow from S3 to SageMaker
SageMaker network isolation	Run training/inference jobs without internet access
AWS Shield	DDoS protection

AWS PrivateLink is the correct and only answer when traffic between a VPC and an AWS service must not traverse the public internet.

CloudFront is a content delivery network that accelerates public-facing traffic — it does not create private network paths. Internet gateways route traffic through the public internet by definition, which is the opposite of what private connectivity requires.

A VPN connects on-premises networks to AWS but does not create a private path between AWS services within the same cloud environment.

PrivateLink creates an internal network endpoint within the AWS backbone, ensuring that data between the VPC and the service never leaves the AWS network and never touches the public internet.

5.4 Monitoring and Audit

Service	Use
AWS CloudTrail	Log API calls; identify unauthorized access attempts; audit trail
Amazon CloudWatch	Operational monitoring, metrics, alarms for model performance
AWS Config	Track configuration compliance against rules
AWS Audit Manager	Assess compliance with frameworks; generate audit reports
AWS Artifact	Access AWS compliance reports and certifications
Amazon Macie	Detect sensitive/PII data in S3; automated alerts
Amazon Inspector	Security vulnerability assessment

CloudTrail, CloudWatch, Config, and Audit Manager all relate to monitoring but serve non-overlapping purposes, and selecting the wrong one is a common exam error.

CloudTrail records who made API calls, from where, and when — it is the authoritative source for access auditing and investigating unauthorized activity.

CloudWatch monitors what the system is doing right now — collecting operational metrics, logs, and events to trigger alarms and dashboards.

Config evaluates whether AWS resource configurations comply with defined rules over time, answering whether infrastructure is set up according to policy.

Audit Manager aggregates evidence across services to support formal compliance assessments against frameworks like ISO or SOC, producing structured audit reports.

Each serves a distinct governance layer; the scenario's question — who accessed what, what is the system doing, is the configuration compliant, or what does our compliance evidence show — determines the correct service.

Amazon Macie is the purpose-built service for automated sensitive data detection across S3.

Comprehend can detect and redact PII within text that is already extracted and passed to it programmatically, but it requires integration work and does not natively scan S3 buckets.

Macie continuously monitors S3 objects for sensitive data patterns — including PII, credentials, and financial information — and generates automated findings and alerts without requiring application code to orchestrate it.

When a scenario describes automated, ongoing discovery of sensitive content in S3 with minimal development effort, Macie is the answer.

5.5 Encryption and Data Protection

AWS KMS — Customer-managed encryption keys for model artifacts and data
SSE-S3 — S3-managed encryption (Bedrock service role must have decrypt permissions)
Federated learning — Train on distributed data without centralizing it; preserves privacy/compliance

Encrypting training data protects it at rest and in transit, but does not prevent a trained model from learning and reproducing that information.

A common misconception is that if sensitive data is encrypted before being used for training, the model cannot expose it. Encryption governs access to the raw data — it does not affect what patterns the model learns. If personally identifiable, confidential, or regulated information is present in training data, the model may reproduce or infer that information in its outputs regardless of how the source data was stored. The correct approach to preventing a model from generating responses based on sensitive training content is to remove or de-identify that data before training ever begins.

5.6 Data Governance Concepts

Concept	Description
Data residency	Where data is physically stored geographically
Data retention	Policies for how long data is kept before deletion
Data lineage	Tracking data flow for compliance and audit
Data de-identification	Removing PII from datasets
Data quality	Standards ensuring accuracy and reliability

Data residency and data lineage are distinct governance concerns that are frequently conflated.

Data residency is a geographic constraint: regulations in many jurisdictions prohibit certain types of data — patient records, financial data, citizen data — from being stored or processed outside a defined geographic boundary. It is a where question.

Data lineage is an audit and traceability concern: it tracks how data has moved, been transformed, and been used throughout its lifecycle. It is a history question.

A scenario about regulatory requirements preventing data from crossing a national border is a data residency concern; a scenario about tracing the origin of a dataset used to train a model is a data lineage concern.

5.7 Model Documentation and Governance

SageMaker Model Cards — Standardize documentation: intended use, training details, performance, limitations, risks
Infrastructure as Code (IaC) — Enables consistent, scalable, repeatable ML deployments
MLflow with SageMaker — Manage and track ML experiments collaboratively

5.8 Prompt Injection and Security

Prompt injection — Attacker crafts inputs to override a model's instructions or extract system prompt contents
Extracting the prompt template — Specific attack that exposes the configured system behavior of an LLM
Defense: Use Bedrock Guardrails, dynamic context-aware prompt templates, denied topics

Prompt injection is an input-layer attack and cannot be mitigated by changing the underlying model.

The vulnerability exists because LLMs process user-supplied input and system instructions in the same text stream, making it possible for a crafted user message to override, neutralize, or extract the system's configured behavior.

Switching to a different base model or fine-tuning does not close this vulnerability because the architecture of how instructions and inputs are combined remains unchanged.

Defenses must be applied at the input handling and prompt construction layer: using Guardrails to inspect inputs, structuring prompts to separate system instructions from user content, and using denied topics to block extraction attempts.

High-Frequency Tricky Scenarios

"Most cost-effective" questions

Scenario	Answer
Improve FM accuracy	Prompt engineering first (cheapest)
Frequently changing knowledge base	RAG (not fine-tuning)
Steady request rate for custom model	Provisioned Throughput
Unpredictable/experimental usage	On-Demand Throughput
Reduce token costs	Decrease tokens in prompt
Monthly reports, not immediate	Batch transform

"Least effort / least operational overhead" questions

Scenario	Answer
Apply safeguards to LLM	Amazon Bedrock Guardrails
Add subtitles/voice-overs to film	Transcribe + Translate + Polly
Detect sensitive data in S3	Amazon Macie
Fine-tune open-source LLM	SageMaker JumpStart
Build ML model without code	SageMaker Canvas
Human data labeling without managing workforce	SageMaker Ground Truth Plus
Experiment with generative AI for free	PartyRock

Model performance problems

Symptom	Diagnosis	Fix
Good on training, bad on new data	Overfitting	Increase regularization, more training data
Bad on training and new data	Underfitting	More epochs, features, reduce regularization
Model performance degrades over time	Data/concept drift	Retrain with fresh data; SageMaker Model Monitor
Disproportionate outcomes for groups	Bias	SageMaker Clarify; diverse training data

Quick Service Disambiguation

Textract vs. Transcribe vs. Translate vs. Comprehend

Textract = extract text FROM documents/images (OCR)
Transcribe = convert speech/audio TO text
Translate = convert text from one language TO another language
Comprehend = understand/analyze text content (NLP)

Clarify vs. Model Monitor vs. Model Cards

Clarify = detect bias + explain predictions (Shapley values)
Model Monitor = detect drift in production
Model Cards = document model for transparency/compliance

RAG vs. Agents vs. Knowledge Bases

RAG = the technique (retrieve + generate)
Knowledge Bases = AWS managed RAG implementation in Bedrock
Agents = orchestrate multi-step tasks (retrieve + act + loop)

CloudTrail vs. CloudWatch vs. Config vs. Audit Manager

CloudTrail = API call logging (who did what)
CloudWatch = metrics, alarms, operational monitoring
Config = resource configuration compliance rules
Audit Manager = compliance framework reporting and evidence collection

Ready to put this knowledge to the test? CertVista AIF-C01 offers a realistic test environment that mirrors the real exam experience — along with domain breakdowns and the latest updates to the question format.

If you're still weighing whether this certification is right for you, our AWS Certified AI Practitioner overview covers what the credential entails, who it's designed for, and where it fits in the broader AWS certification path.

Last updated: Sunday, 08 March 2026