MLA-C01 AWS Machine Learning Engineer – Associate Study Guide

Getting ready for the MLA-C01 exam requires a deep, hands-on understanding of the full ML lifecycle on AWS - from raw data ingestion and feature engineering, through model training and evaluation, to production deployment, monitoring, and security. The real challenge is not just knowing the services, but knowing which service to use, when, and why - especially when several options seem equally valid.

This guide is built from real exam patterns and covers the four core domains with clear summaries, direct service comparisons, decision-making frameworks, and warnings about common exam traps.

Exam Domain Overview

Domain Approx. Weight
Data Preparation for Machine Learning ~28%
ML Model Development ~26%
Deployment and Orchestration of ML Workflows ~22%
ML Solution Monitoring, Maintenance, and Security ~24%

Domain 1: Data Preparation for Machine Learning

1.1 Data Ingestion and Aggregation

Before any ML model can be trained, data must be collected from disparate sources and unified into a consistent format. The exam frequently presents multi-source scenarios and tests whether you can identify the right aggregation tool.

Tool Best For Key Characteristic
AWS Lake Formation Centralized data lake from S3, RDS, on-premises databases Governed ingestion + access control; ideal for heterogeneous sources
Amazon EMR (Spark) Large-scale distributed data processing Heavy compute; requires cluster management
AWS Glue ETL jobs, schema discovery, Data Catalog Serverless; best for structured data transformation
Amazon Kinesis Data Streams Real-time streaming data ingestion Low latency; not for batch aggregation from static sources
Amazon DynamoDB NoSQL key-value store A database, not a data aggregation service

Lake Formation is the aggregation answer; Kinesis is the streaming answer — do not swap them.

When a scenario describes pulling data from Amazon S3, on-premises MySQL databases, and other static sources into a unified data lake, the answer is AWS Lake Formation. It is purpose-built to govern, catalog, and aggregate data from multiple heterogeneous sources.

Amazon Kinesis Data Streams is for real-time event streaming — not for batch ingestion from relational databases or S3. Selecting Kinesis when the data is static is a frequent trap.

Amazon DynamoDB is a database and stores data but does not perform aggregation from multiple source systems.

1.2 Data Transformation and ETL

Once data is ingested, it must be transformed into a model-ready format. The exam tests whether you can choose the right ETL tool for the job's complexity and operational overhead.

Tool Use Case Code Required?
AWS Glue (ETL jobs) Schema discovery, format conversion, Data Catalog updates PySpark / Python
AWS Glue DataBrew Visual, no-code data cleaning and transformation No (recipe-based)
AWS Glue Relationalize Flatten nested/JSON data into tabular format PySpark transform
SageMaker Data Wrangler Interactive data prep inside SageMaker Studio No (UI-based)
Amazon Athena SQL queries over S3 data SQL
Amazon EMR Large-scale Spark/Hadoop jobs Spark/Scala/Python

Format conversions that appear on the exam:

  • CSV → Apache Parquet: The standard optimization for query performance; use an AWS Glue ETL job. Parquet is columnar and dramatically faster for analytical queries and ML feature generation.
  • JSON (nested) → tabular: Use the AWS Glue Relationalize transform — it flattens nested JSON automatically with minimal code.
  • DeepAR time series format: DeepAR requires JSON Lines (.jsonl) format with gzip compression, not Parquet or RecordIO.

No-code vs. low-code tools are not interchangeable in exam answers.

AWS Glue DataBrew is a fully visual, no-code tool. You build "recipes" through a point-and-click interface — no scripting. It is the answer whenever the requirement specifies "no-code transformation."

AWS Glue ETL jobs require writing PySpark or Python code. They are more flexible but carry more implementation effort.

SageMaker Data Wrangler is the no-code/low-code option within SageMaker Studio specifically. It is the right answer when the question is scoped to the SageMaker ecosystem and asks for data preparation with the least effort.

Selecting a code-based tool when the requirement specifies "least development effort" or "no-code" will cost you the point.

The correct compression format for DeepAR is gzip, not Snappy or XZ.

Amazon SageMaker DeepAR expects JSON Lines (.jsonl) format for time series data, compressed with gzip. Snappy is common with Parquet but is not the correct answer for DeepAR. XZ offers high compression ratios but is not supported. Mixing format and compression (e.g., Parquet + XZ, RecordIO + gzip for DeepAR) is a common distractor pattern.

1.3 Data Sources and Connections in SageMaker

Connection Type What It Does When to Use
SageMaker Data Wrangler — Direct connection Live query at import time; always pulls latest data When data must always be up to date
SageMaker Data Wrangler — Cataloged connection Uses a snapshot registered in AWS Glue Data Catalog When consistency matters more than recency
SageMaker Feature Store Persistent, versioned feature repository for training and inference When features must be shared across models and stay synchronized

Direct connections — not cataloged connections — ensure always-current data.

When the exam states that "ingested data must always be up to date with the latest changes in the source systems," the answer is a direct connection in SageMaker Data Wrangler.

Cataloged connections reference a snapshot in the Glue Data Catalog, which may lag behind the source. If freshness is the requirement, direct connections are correct.

1.4 Feature Engineering

Feature engineering transforms raw data into the numerical representations that ML algorithms can learn from. The exam heavily tests encoding choices and their implications.

Encoding Techniques:

Technique When to Apply Result
One-hot encoding Nominal/categorical features with no inherent order Adds one binary column per category; increases dimensionality
Label (ordinal) encoding Ordinal categories with a natural ranking Maps categories to integers; preserves order
Binary encoding High-cardinality nominal categories Fewer new columns than one-hot; compromise approach
Tokenization Text data for NLP models Splits text into sub-word or word tokens

One-hot encoding increases dimensionality; label encoding does not — this distinction drives the correct answer.

When a feature has a clear ordinal relationship (e.g., job_seniority_level: Junior < Senior < Principal), label encoding is appropriate because the integer ordering carries meaning.

When a feature is nominal with no ordering (e.g., location: Berlin, Paris, Tokyo), one-hot encoding is appropriate because you do not want the model to assume Berlin < Paris < Tokyo.

When the question explicitly states "must not increase the dimensionality of the dataset" and the values are binary (yes/no, true/false), label encoding or binary encoding is correct — one-hot encoding would add columns.

1.5 Handling Class Imbalance

Class imbalance occurs when one class in the target variable is far more common than another (e.g., 99% legitimate transactions vs. 1% fraudulent). This is one of the most tested topics in the data preparation domain.

Technique What It Does When to Use
Random oversampling Duplicates minority class examples When you cannot generate new data
SMOTE (Synthetic Minority Oversampling Technique) Synthesizes new minority class examples When you need more diversity in minority class without losing data
Random undersampling Removes majority class examples Only when you can afford to lose data
Class weights / scale_pos_weight Adjusts loss function to penalize minority class misses Algorithmic fix; no data modification

Difference in Proportions of Labels (DPL):

DPL is a pre-training bias metric used to quantify representation imbalance across demographic groups. A DPL value near 0 indicates balanced representation. A high positive DPL (e.g., +0.9) for a specific subgroup means that subgroup is dramatically over-represented in the positive class compared to the rest of the dataset.

When DPL is high for a subgroup, you undersample that subgroup — not the entire class.

A DPL of +0.9 for the age range 40–45 in the positive class means that age group is over-represented relative to all other age ranges. The correct remedy is to undersample the positive class for the age range 40–45 to bring it in line with other groups.

Oversampling the positive class for that age group would worsen the imbalance. Undersampling all other age ranges would lose data unnecessarily. The fix is targeted at the over-represented group only.

SMOTE is for balancing classes without losing data; undersampling discards existing data.

When the requirement explicitly states "without losing any existing training data," the correct technique is oversampling (e.g., SMOTE), not undersampling.

Use SageMaker Clarify to detect the imbalance (CI metric > 0 indicates imbalance) and SageMaker Data Wrangler to apply SMOTE.

1.6 PII Detection and Data Masking

Service Capability Limitation
AWS Glue — Detect PII transform Find and mask PII in Glue jobs using Spark Requires writing a Glue job but uses built-in transform
Amazon Macie Discover sensitive data in S3 Identifies PII but does not mask it in place
Amazon Comprehend Detect entities (names, addresses) in text NLP-based; does not natively mask
Custom Spark/regex on EC2 Maximum flexibility Maximum development effort

Macie discovers PII but does not mask it — it is a detection service, not a transformation service.

Amazon Macie will classify S3 objects containing sensitive data and alert you, but it does not modify the data. If the requirement is to "find and mask" PII, the correct answer involves AWS Glue with the Sensitive Data Detection transform (or a custom Glue job with regex). Macie alone does not satisfy the masking requirement.

1.7 Data Labeling

Service Workforce Type Best For
SageMaker Ground Truth Private, public (Mechanical Turk), or vendor Flexible labeling with custom UIs; requires setup
SageMaker Ground Truth Plus Managed vendor workforce Fully managed; no app development required
Amazon Augmented AI (A2I) Human review of ML predictions Reviewing and correcting model outputs; not initial labeling
Amazon Mechanical Turk (AMT) Public crowdworkers via AMT website Not the AWS ML-integrated path

Ground Truth and A2I solve different problems in the labeling workflow.

SageMaker Ground Truth is for creating initial labels on raw data — it is used before a model exists, to build the training dataset.

Amazon Augmented AI (A2I) is for human review of predictions that a model has already made — used in production to catch low-confidence outputs and improve them.

When a scenario involves unlabeled data that needs to be annotated for the first time, the answer is Ground Truth. When it involves reviewing model outputs in production, the answer is A2I.

A private workforce in Ground Truth is the correct answer for employee-only access.

When the requirement states that only employees should perform labeling tasks, the answer is to create a private workforce in SageMaker Ground Truth. Amazon Mechanical Turk is a public crowdsourcing platform where random members of the public complete tasks — this does not satisfy any requirement for restricted access.

1.8 Data Quality and Profiling

Service Purpose
AWS Glue Data Quality Rule-based data quality checks on Glue Data Catalog tables
AWS Glue DataBrew (Profile job) Statistical profiling — distributions, nulls, correlations
AWS Glue DataBrew (Recipe job) Applying transformations and cleaning steps
SageMaker Data Wrangler Interactive data quality exploration within SageMaker

DataBrew profile jobs analyze data quality; recipe jobs transform it — they are not the same job type.

A profile job in DataBrew generates statistical summaries: column distributions, null counts, duplicate rates, and correlation matrices. It is a read-only analysis step.

A recipe job applies a sequence of transformation steps (the "recipe") to actually clean, normalize, encode, or filter the data and write the output.

When the requirement is to "clean and normalize" data, the answer is a recipe job, not a profile job.

1.9 SageMaker Feature Store

Feature Store provides a centralized repository for storing, sharing, and serving ML features consistently between training and inference.

Store Type Latency Use Case
Online store Low (milliseconds) Real-time inference; always returns latest feature value
Offline store Higher (batch) Training jobs; historical snapshots in S3

Enabling both online and offline stores is required when you need real-time inference AND batch training consistency.

When a scenario requires that the online store is updated with the most recent data as it arrives AND a complete historical record is maintained for batch processing, you must enable both stores simultaneously.

Using only the online store loses the historical record. Using only the offline store cannot serve real-time inference at low latency. The Feature Store Spark connector supports ingesting data into both stores in a single operation.

Domain 2: ML Model Development

2.1 Choosing the Right Algorithm

The exam presents business scenarios and asks you to select the appropriate SageMaker built-in algorithm. The key is mapping problem type to algorithm family.

Algorithm Problem Type Key Signal Words
XGBoost Classification, regression, ranking Tabular data; "predict likelihood"; "loan ranking"
Linear Learner Binary/multi-class classification, regression Linear relationships; interpretability
DeepAR Time series forecasting "Historical measurements at regular intervals"; "predict future values"
Random Cut Forest (RCF) Anomaly detection Unlabeled data; "detect irregularities"; "unexpected values"
K-means Clustering (unsupervised) "Group similar items"; no labels
K-nearest neighbors (k-NN) Classification/regression (supervised) "Based on similar examples"; labeled training data
PCA Dimensionality reduction "Reduce features"; "high dimensionality"
Factorization Machines Recommendations on sparse data "High-dimensional sparse data"; collaborative filtering
Neural Topic Model (NTM) Topic modeling "Discover topics in text documents"
Seq2Seq Sequence-to-sequence Machine translation, text summarization
BlazingText Word2Vec / text classification Word embeddings; fast text classification

DeepAR requires specific hyperparameters — context_length and prediction_length must match your forecasting horizon.

The context_length controls how much historical data the model looks at to make each prediction. The prediction_length controls how far into the future the model predicts. Both must be set to match the time granularity and horizon of your problem. Setting scale_pos_weight (an XGBoost hyperparameter) on a DeepAR job, or applying k-means clustering to a forecasting problem, are common wrong answers the exam uses as distractors.

RCF is the anomaly detection algorithm; k-means is a clustering algorithm — they are not interchangeable even though both are unsupervised.

Random Cut Forest is explicitly designed to score data points by their anomaly likelihood. It works on unlabeled data and assigns an anomaly score to each input — higher scores indicate anomalies.

K-means groups data into clusters based on similarity. It does not produce an anomaly score and is not designed to flag outliers.

When the scenario involves an unlabeled dataset and asks to "detect irregularities," "identify unusual patterns," or "find outliers," the answer is RCF. The num_trees hyperparameter controls model stability (more trees = more stable scores).

Factorization Machines, not k-NN, is correct for high-dimensional sparse recommendation data.

Recommendation systems typically involve user-item interaction matrices that are extremely sparse — most users have interacted with only a tiny fraction of all items. Factorization Machines are designed for exactly this use case and handle sparse, high-dimensional data efficiently.

K-NN struggles with sparse high-dimensional spaces (the curse of dimensionality) and is not the correct choice for recommendation problems at scale.

2.2 Overfitting, Underfitting, and Regularization

Problem Symptom Root Cause Solution
Overfitting High training accuracy, low validation accuracy Model too complex; memorizing noise Regularization (L1/L2), dropout, k-fold CV, more data
Underfitting Low accuracy on both training and validation Model too simple; not learning patterns More features, more epochs, less regularization, more complex model

L1 vs. L2 Regularization:

Regularization Effect Key Benefit
L1 (Lasso) Drives some feature weights to exactly zero Automatic feature selection; removes irrelevant features
L2 (Ridge) Shrinks all feature weights toward zero but not to zero Distributes weight reduction across all features

L1 regularization is the correct answer when the problem is both overfitting AND unnecessary features.

When a scenario states that the model is overfitting AND the training data contains unnecessary features that should be removed, L1 regularization is the correct choice. L1 performs implicit feature selection by zeroing out the weights of irrelevant features, addressing both problems simultaneously.

L2 reduces weights but does not eliminate features. Increasing training iterations worsens overfitting. Decreasing iterations may underfit. SageMaker Debugger can help profile training but does not apply regularization to a live model.

2.3 Evaluation Metrics

Metric Problem Type When It Matters Most
Accuracy Classification Balanced class distributions only
Precision Classification Cost of false positives is high (e.g., spam detection)
Recall Classification Cost of false negatives is high (e.g., fraud, disease detection)
F1 Score Classification Imbalanced classes; need balance of precision and recall
AUC-ROC Binary classification Evaluating classifier performance across all thresholds
RMSE Regression Numeric prediction; penalizes large errors more
MAE Regression Numeric prediction; equal weight to all errors

Recall is the correct metric when the business cannot afford to miss positive cases.

Recall = (True Positives) / (True Positives + False Negatives). It measures how many of the actual positives the model correctly identified.

In fraud detection, the cost of missing a fraudulent transaction (false negative) is far higher than incorrectly flagging a legitimate one (false positive). Therefore, maximizing recall — catching as many fraudulent transactions as possible — is the correct objective. Optimizing for precision would minimize false alarms but allow more fraud to pass through undetected.

The same logic applies to medical diagnosis: missing a positive case (disease) is typically far more costly than over-investigating a healthy patient.

For continuous numeric predictions, RMSE is correct — not accuracy or AUC.

Accuracy, F1, AUC, and precision/recall are all classification metrics. They require a discrete class label prediction. When the model output is a continuous number (home price, sales forecast, patient measurement), these metrics are inapplicable.

RMSE (Root Mean Square Error) measures the average deviation of predictions from actual values, with larger errors penalized more heavily. For models that "make continuous numeric predictions," RMSE is the correct evaluation metric, not accuracy.

2.4 Bias Detection with SageMaker Clarify

SageMaker Clarify provides pre-training and post-training bias metrics and model explainability reports.

Pre-training Bias Metrics:

Metric Measures
Class Imbalance (CI) Whether classes in the target variable are proportionally balanced
Difference in Proportions of Labels (DPL) Whether the positive class rate differs across demographic groups
KL Divergence / JS Divergence Statistical difference between label distributions of subgroups

DPL, not MSE or SSIM, is the correct metric to confirm class imbalance across demographic groups.

DPL (Difference in Proportions of Labels) is specifically designed as a pre-training bias metric. A DPL value significantly different from 0 indicates that the positive class is not evenly distributed across subgroups — confirming bias in the dataset.

MSE (Mean Square Error) is a regression error metric — it has nothing to do with bias detection.

SSIM (Structural Similarity Index Measure) is an image quality metric — also unrelated.

Silhouette score is a clustering quality metric — unrelated to bias.

When the question asks for a pre-training bias metric to confirm class imbalance, the answer is DPL.

2.5 Hyperparameter Tuning and Training Optimization

Key learning rate guidance:

  • If accuracy is not increasing and loss is decreasing very slowly with SGD, the learning rate is too low — increase it to escape the slow convergence plateau.
  • If loss is oscillating wildly and not converging, the learning rate is too high — decrease it.

Distributed training with SageMaker:

Technique When to Use
Distributed Data Parallel (DDP) Large datasets that don't fit on one GPU; split data across instances
Model Parallel Models too large to fit in one GPU's memory; split model layers
Increase number of instances Primary lever to reduce wall-clock training time for DDP jobs

To reduce DDP training time, increase the number of instances — not epochs, neurons, or layers.

When a training job uses distributed data parallelism across instances and the goal is to decrease training time, the correct action is to increase the number of instances. More instances means each instance processes a smaller shard of the data, and the total time decreases proportionally.

Adding epochs increases training time. Adding neurons or layers increases model complexity and per-instance compute time but does not leverage distributed parallelism.

2.6 SageMaker Training Methods

Method Purpose
estimator.fit() Initiates a training job with a dataset
estimator.deploy() Deploys a trained model to an endpoint
predictor.predict() Calls inference on a deployed endpoint
estimator.create_model() Creates a model object without deploying it

fit() starts training; deploy() starts hosting — they are sequential, not interchangeable.

The SageMaker SDK workflow is: (1) configure an Estimator, (2) call .fit() to train the model, (3) call .deploy() to host it, (4) call .predict() on the resulting predictor.

Calling .deploy() before .fit() will fail because there is no trained model artifact yet. Calling .predict() without a deployed endpoint will also fail. Questions that ask how to "initiate a training job" are asking for .fit().

2.7 Experiment Tracking and Model Management

Service Purpose
SageMaker Experiments Track and compare training runs: hyperparameters, metrics, artifacts
SageMaker Debugger / Profiler Identify bottlenecks in training: GPU utilization, CPU usage, memory
SageMaker Model Registry Version, stage, and deploy approved models
SageMaker Profiler annotations Mark specific code sections to measure resource usage during training

SageMaker Experiments is the correct answer for logging training run characteristics — not CloudWatch, S3, or Model Registry.

When an ML engineer needs to track hundreds of training iterations with different hyperparameters, features, and algorithms and compare their results, the answer is SageMaker Experiments. It is purpose-built for this workflow and requires minimal implementation effort — no custom metrics, no S3 log parsing, and no extra infrastructure.

CloudWatch can surface metrics but is not designed for systematic experiment comparison. Model Registry manages approved models for deployment — it is not the right tool for logging in-progress experiment results.

SageMaker Profiler annotations are the right tool to pinpoint where in the training script resources are underutilized.

When GPU utilization is low and the engineer needs to find the specific bottleneck in the training script, the answer is to add SageMaker Profiler annotations to the code and generate a profiler report. This instruments the script at the function level and shows exactly where compute time is being spent.

CloudWatch and CloudTrail provide coarser-grained metrics at the endpoint or instance level — they cannot pinpoint specific lines of Python training code.

Domain 3: Deployment and Orchestration of ML Workflows

3.1 SageMaker Inference Types

Choosing the right inference type is one of the highest-weighted topics in this domain. The decision depends on four factors: latency requirement, payload size, traffic pattern, and cost sensitivity.

Inference Type Latency Max Payload Traffic Pattern Idle Cost
Real-time Milliseconds 6 MB Consistent, predictable Always running
Serverless Seconds (cold start) 4 MB Sporadic, unpredictable Zero when idle
Asynchronous Minutes (queued) 1 GB Large, intermittent Auto-scales to zero
Batch Transform Hours (scheduled) Entire dataset Bulk, offline Runs only during job

Asynchronous inference is the correct answer for large payloads (>6 MB) with processing times up to 60 minutes.

The exam frequently presents scenarios with payload sizes of 100–300 MB and processing times measured in minutes. Asynchronous inference is the only SageMaker inference type designed for this use case. It queues requests, processes them, and stores results in S3 — the client polls for completion.

Real-time inference has a hard 6 MB payload limit. Serverless inference has a 4 MB limit and is designed for sporadic low-volume workloads, not large payloads. Batch transform is for processing entire datasets at once, not individual requests.

Serverless inference is the most cost-effective option for intermittent CPU-based workloads with periods of zero traffic.

When traffic is intermittent (e.g., only during business hours) and the model runs on CPU, Serverless Inference charges only per inference request — there is no cost during idle periods. It is superior to a real-time endpoint that bills by the hour even when idle.

Asynchronous inference can also scale to zero, but it is designed for large payloads and long processing times — not for low-latency interactive predictions.

3.2 Endpoint Strategies for Multiple Models

Strategy Use Case Cost Model
Multi-model endpoint (MME) Many models sharing one container (same framework) Pay for one endpoint; models loaded on-demand
Multi-container endpoint (MCE) A few models with different frameworks; sequential or direct invocation Pay for one endpoint; containers always loaded
Single-model endpoint One model per endpoint One endpoint per model
Inference pipeline Sequential preprocessing → model → postprocessing steps Chained containers in one endpoint

Multi-model endpoints and multi-container endpoints solve different problems.

A multi-model endpoint hosts many models (potentially hundreds) that share the same framework and container. Models are dynamically loaded and unloaded into memory based on traffic. This is the most cost-effective option when you have many models built on the same framework.

A multi-container endpoint hosts a small number of models with different frameworks in separate containers within one endpoint. Use this when models cannot share a container because they need different runtimes, but you still want to consolidate onto one endpoint.

When the question says "multiple models, same framework," MME is the answer. When it says "different frameworks," MCE is the answer.

3.3 Auto Scaling for SageMaker Endpoints

High-resolution metrics + appropriate cooldown periods are required for rapid scaling response.

For endpoints that must scale quickly in response to sudden traffic changes:

  • Use high-resolution CloudWatch metrics (10-second intervals) rather than standard 60-second intervals — this gives the scaling policy faster feedback.
  • Use the InvocationsPerInstance metric as the target tracking metric.
  • Set a longer scale-in cooldown (e.g., 600 seconds) to prevent premature scaling down after a traffic spike subsides.

A shorter scale-in cooldown combined with standard metrics results in unstable scaling behavior — the endpoint may scale in too aggressively before traffic has truly subsided.

When endpoints scale to zero instances overnight, apply a scheduled scale-out policy to pre-warm before business hours.

Auto scaling with a target tracking policy can scale the minimum instance count to zero during idle periods, which reduces cost. However, when traffic arrives at the start of business hours, there are zero instances available to handle it — causing delays while new instances spin up.

The solution is to apply a scheduled scaling action (via a CloudWatch alarm and step scaling policy) that increases the minimum instance count from 0 to 1 (or more) before business hours begin, ensuring the endpoint is ready when traffic arrives.

3.4 Deployment Strategies

Strategy Purpose When to Use
Blue/Green deployment Zero-downtime cutover; full traffic shift When you want a clean switch with rollback capability
Shadow variant Test a new model on live traffic without affecting user responses When comparing new model performance to current model before promoting
Canary deployment Gradually shift a percentage of traffic to new model When you want a controlled incremental rollout
A/B test (production variant) Split traffic between model versions with metrics When comparing multiple model variants in production

Shadow variants are the least-effort path to comparing a new model against a production model.

A shadow variant receives a copy of live production traffic and processes it, but its responses are not returned to users. This allows you to measure the new model's performance on real traffic with zero risk to the user experience.

To deploy a shadow variant: deploy the new model as a shadow variant on the same endpoint as the current model. Route a configured percentage of live traffic to it. Evaluate its outputs against the production model's outputs.

This requires less operational effort than deploying to a separate endpoint, managing DNS routing, or writing custom Lambda routing logic.

3.5 Model Registry and Versioning

Service Purpose
SageMaker Model Registry Version control and lifecycle management for trained models; supports approval workflows
SageMaker Inference Recommender Benchmarks model performance across EC2 instance types; recommends best-fit instance
Amazon ECR Container image registry; stores Docker images but does not manage ML model versions

SageMaker Inference Recommender, not Autopilot or Compute Optimizer, provides instance type recommendations for hosting.

When the requirement is to determine the best EC2 instance type for hosting an ML model on SageMaker, the answer is SageMaker Inference Recommender. It runs load tests against multiple instance configurations and ranks them by performance and cost.

SageMaker Autopilot is an AutoML service for automatically building and training models — it does not recommend hosting instance types.

AWS Compute Optimizer makes EC2 sizing recommendations for general workloads based on CloudWatch metrics — not for SageMaker model hosting.

3.6 Amazon Bedrock Deployment

Concept Detail
Custom model import Import Hugging Face or other models into Bedrock for API access
S3 URI requirement Model files must be in S3 in the same AWS account as the Bedrock import job
VPC endpoint (PrivateLink) Connect EC2 in private subnets to Bedrock without traversing the internet
Bedrock Knowledge Bases Fully managed RAG; supports S3 data sources; default vector store is OpenSearch Serverless

PrivateLink (interface VPC endpoint) is the correct solution for private subnet EC2 to Bedrock connectivity.

When EC2 instances are in a private subnet and must remain private while calling the Amazon Bedrock API, the answer is AWS PrivateLink — specifically an interface VPC endpoint for Bedrock. This routes all API traffic through the AWS private network without requiring internet access.

A NAT gateway would route traffic to the internet (against the requirement). AWS Direct Connect connects on-premises networks, not VPC-to-service traffic. Modifying Bedrock to use a private subnet is not possible — Bedrock is a managed service that the user cannot configure at the network level.

RAG via Bedrock Knowledge Bases is the least-operational-overhead path — not fine-tuning, not Neptune.

When the requirement is to supplement an LLM with documents from S3 using RAG, the answer is to create a Knowledge Base for Amazon Bedrock and configure the S3 bucket as a data source. Bedrock handles chunking, embedding, vector storage (OpenSearch Serverless by default), and retrieval automatically.

Fine-tuning (AutoML or SageMaker Pipelines) updates model weights — it does not inject external documents at query time. Amazon Neptune is a graph database — not a managed vector store for RAG.

3.7 Networking and Privacy

Requirement Solution
Network isolation from internet during training Private subnet + S3 gateway VPC endpoint
Private IP addresses only Private subnet + VPC endpoint (no NAT gateway)
Opt out of SageMaker metadata collection Set OPT_OUT_TRACKING environment variable or opt-out parameter per training job
Cross-account S3 access VPC peering or S3 bucket policies with cross-account permissions

Network isolation requires a private subnet AND an S3 VPC gateway endpoint — not a NAT gateway.

To ensure SageMaker training jobs have no internet connectivity while still accessing S3:

  1. Run training jobs in private subnets (no route to internet gateway).
  2. Create an S3 gateway VPC endpoint so traffic to S3 stays on the AWS private network.

A NAT gateway routes traffic to the internet — this breaks the network isolation requirement. A public subnet with security group inbound rules still exposes the training job to the internet on the outbound path.

3.8 Custom Containers and Infrastructure

Approach When to Use Provisioning Method
BYOC (Bring Your Own Container) Custom libraries, frameworks not in SageMaker containers Docker image in ECR
SageMaker pre-built containers Standard frameworks (PyTorch, TensorFlow, sklearn) Built-in; no Docker work needed
EKS with CDK Kubernetes; managed control plane; Python provisioning AWS CDK (Python)
EKS with CloudFormation Kubernetes; managed control plane; declarative templates CloudFormation (YAML/JSON)

AWS CDK is the correct answer when the infrastructure must be provisioned using Python specifically.

When requirements specify (1) no managing the Kubernetes control plane, (2) using Kubernetes, and (3) Python for provisioning, the answer is AWS CDK to provision an Amazon EKS cluster. CDK supports Python natively and handles the Kubernetes control plane automatically via EKS (managed).

AWS CloudFormation can also provision EKS but uses YAML/JSON templates, not Python. The AWS CLI is not a repeatable provisioning method — it is imperative, not declarative. EC2-based self-managed Kubernetes requires managing the control plane.

Domain 4: ML Solution Monitoring, Maintenance, and Security

4.1 SageMaker Model Monitor

Model Monitor continuously evaluates deployed models against a baseline and raises alerts when violations are detected.

Monitor Type What It Detects Baseline Required
Data Quality Monitor Shifts in input feature distributions (data drift) Statistical baseline from training data
Model Quality Monitor Degradation in prediction accuracy/performance Ground truth labels + prediction logs
Bias Drift Monitor Changes in model fairness metrics over time Clarify bias baseline
Feature Attribution Drift Changes in which features drive predictions Clarify SHAP baseline

Correct sequence for setting up Model Monitor:

  1. Create a baseline from training/validation data (generates constraints and statistics)
  2. Create a monitoring job that compares live inference data against the baseline

The baseline must be created BEFORE the monitoring job — not after.

A monitoring job in SageMaker Model Monitor compares live inference data against a pre-established baseline. If the monitoring job is configured before the baseline exists, it has nothing to compare against and will not produce meaningful alerts.

The correct order is always: (1) establish the baseline from training data, (2) configure the monitoring schedule.

Data drift requires a Data Quality Monitor with a data quality baseline — not a model quality baseline.

When the requirement is to "detect changes in the input data distribution of model features" (i.e., data drift), the answer is to configure a Data Quality Monitor and establish a data quality baseline. The baseline captures the statistical distribution of features at training time, and the monitor alerts when live data deviates.

A model quality baseline monitors output accuracy — it requires ground truth labels to compare predictions against. It does not detect input distribution shifts.

4.2 Automated Retraining Pipelines

Model Monitor + Lambda is the standard pattern for detecting drift and automatically triggering retraining.

When a pipeline must automatically initiate retraining upon detecting data drift:

  1. SageMaker Model Monitor detects the drift and publishes a CloudWatch metric violation.
  2. An AWS Lambda function (triggered by the CloudWatch alarm) initiates a new SageMaker training job or kicks off a SageMaker Pipeline.

AWS Glue, Apache Flink, and QuickSight are not designed for model drift detection. Step Functions can orchestrate retraining but adds complexity — Lambda is the simplest trigger mechanism.

After retraining on new data, the Model Monitor baseline must also be updated.

When a model is retrained with new data, the distribution of inputs changes. Using the original baseline from the old training data will cause Model Monitor to flag legitimate input patterns as violations.

After retraining: (1) retrain with new data, (2) reestablish the baseline from the new training data, (3) update the monitoring job to use the new baseline. Reusing the original baseline with newly trained models introduces false alerts and undermines the monitoring system.

4.3 Identifying the Root Cause of Performance Degradation

Symptom Root Cause Solution
Model accuracy degrades gradually over time Data drift — production data distribution shifted Retrain on recent data; update monitoring baseline
Model accuracy degrades suddenly after a deployment Bug in new code or data pipeline Roll back deployment; debug pipeline
High training accuracy, low validation accuracy Overfitting Regularization, more data, cross-validation
Low accuracy on both training and validation Underfitting More complex model, more features, more training
Accuracy suddenly drops after months of stability Concept drift or data pipeline failure Investigate data freshness; check upstream pipelines

Data drift in production — not overfitting — is the correct cause when a well-performing model degrades over time.

A model that has been deployed and performing well for months with metrics above thresholds is, by definition, not overfitting at deployment time. Overfitting is a training-time phenomenon.

When performance degrades after a period of stable production operation, the cause is almost always drift in the production data distribution — the real-world inputs have changed in ways that were not represented in the original training data. The fix is retraining on current data, not debugging the original model.

4.4 Content Moderation and Real-Time Video Analysis

Amazon Rekognition + Lambda is the least-overhead solution for real-time video frame content moderation.

Amazon Rekognition includes built-in content moderation capabilities: it can detect inappropriate content categories (explicit nudity, violence, etc.) in images and video frames with a single API call.

Using a custom SageMaker model requires training, hosting, and maintaining a computer vision model — significant operational overhead. Using Transcribe + Comprehend analyzes audio/text, not visual content. Rekognition with Lambda extracts and analyzes image frames in a serverless, managed fashion.

4.5 Security and Access Control

Requirement Solution
Block traffic from a specific IP Network ACL inbound rule (deny rule); security groups only support allow rules
Prevent misuse of presigned URLs from outside a VPN IAM condition aws:sourceVpc validation
Restrict presigned URLs to specific IP range IAM condition aws:sourceIp validation
SageMaker training jobs with KMS-encrypted S3 IAM execution role must have kms:Encrypt and kms:Decrypt permissions
Opt out of SageMaker metadata collection OPT_OUT_TRACKING env variable or per-job opt-out parameter

Security groups cannot deny traffic — only Network ACLs can create deny rules.

AWS security groups are stateful and support allow rules only. You cannot create a deny rule in a security group. To block traffic from a specific IP address, you must create a Network ACL (NACL) deny rule for the specific IP and apply it to the subnet.

VPC route tables do not support deny rules either — they determine routing paths, not access control decisions. Creating a shadow variant to redirect traffic is not a security mechanism.

KMS-encrypted S3 access failures after switching from SSE-S3 to SSE-KMS require adding KMS permissions to the execution role.

When S3 buckets use SSE-S3, SageMaker training jobs can access them with standard S3 permissions. When you switch to SSE-KMS, the SageMaker execution role must also have kms:Encrypt and kms:Decrypt (and optionally kms:GenerateDataKey) permissions on the specific KMS key. Without these, all S3 reads and writes fail with AccessDenied.

Adding s3:ListBucket does not resolve KMS access errors. Updating the aws:SecureTransport bucket policy condition enforces HTTPS but does not grant KMS permissions.

aws:sourceVpc validates the VPC origin of a request; aws:sourceIp validates the IP address.

When the company shares SageMaker Studio notebooks over a VPN and needs to ensure presigned URLs can only be used from within that VPN (the VPC), the correct IAM policy condition is aws:sourceVpc. This ensures that even if a presigned URL is leaked externally, it cannot be used from outside the authorized VPC.

aws:sourceIp restricts access to specific IP ranges — appropriate when the restriction is IP-based rather than VPC-based. aws:PrincipalTag is for attribute-based access control tied to IAM principals, not network origin.

4.6 Monitoring Tools Reference

Service What It Monitors
SageMaker Model Monitor Data quality, model quality, bias drift, feature attribution drift
SageMaker Clarify Pre/post-training bias metrics; SHAP-based feature explanations
SageMaker Debugger / Profiler Training job resource utilization (GPU, CPU, memory); training anomalies
Amazon CloudWatch Infrastructure metrics (CPU, memory, invocations); alarm-based notifications
AWS CloudTrail API call audit logs; who did what and when
Amazon CloudWatch Logs Insights Query and analyze log data for error patterns

Model Monitor captures bias metrics; CloudWatch visualizes them — use both together for dashboards.

SageMaker Clarify bias metrics generated by a monitoring job are emitted as Amazon CloudWatch metrics. To display these on a dashboard, the ML engineer captures the CloudWatch metrics from SageMaker Clarify/Model Monitor and builds a CloudWatch dashboard.

CloudTrail captures API activity logs — it does not capture bias metrics. EventBridge can trigger actions based on events but is not the metrics collection path. SNS is for sending notifications, not for capturing or displaying metrics.

Quick-Reference: Key Service Decision Trees

Which data transformation tool?

  • No-code, visual interface → AWS Glue DataBrew (recipe job)
  • Serverless ETL with Python/PySpark → AWS Glue ETL job
  • Interactive data prep in SageMaker → SageMaker Data Wrangler
  • Large-scale Spark → Amazon EMR
  • SQL over S3 → Amazon Athena

Which inference type?

  • Millisecond latency, small payload → Real-time
  • Sporadic traffic, no idle cost, CPU → Serverless
  • Large payload (>6 MB), long processing, notify on complete → Asynchronous
  • Bulk dataset, once daily → Batch Transform

Which encoding?

  • Nominal category, few values, no ordering → One-hot
  • Ordinal category with ranking → Label/ordinal
  • Binary yes/no, no dimensionality increase → Label (binary)
  • High-cardinality nominal, minimize columns → Binary encoding

Which anomaly detection approach?

  • Unlabeled data → Random Cut Forest (RCF)
  • Labeled data, classification → XGBoost or Linear Learner

Which monitoring tool?

  • Input data drift → Model Monitor (Data Quality)
  • Prediction accuracy over time → Model Monitor (Model Quality)
  • Bias in live predictions → Model Monitor (Bias Drift) + Clarify
  • Training bottleneck → SageMaker Debugger / Profiler
  • Security audit → AWS CloudTrail

Common Exam Traps: Summary

Trap Correct Answer
"Aggregate from S3 + on-prem MySQL" → EMR AWS Lake Formation
"Data always up to date in Data Wrangler" → cataloged Direct connection
"Large payload 100–300 MB, 60 min processing" → batch transform Asynchronous inference
"Sporadic, no traffic overnight, CPU model" → real-time Serverless inference
"Detect irregularities, unlabeled" → k-means Random Cut Forest (RCF)
"Recommendation, high-dimensional sparse" → k-NN Factorization Machines
"Catch all fraud, minimize misses" → precision Recall
"Continuous numeric predictions" → accuracy RMSE
"Overfitting + unnecessary features" → L2 L1 regularization
"Block specific IP" → security group deny Network ACL deny rule
"KMS AccessDenied after SSE-KMS switch" → add s3 permissions Add kms:Encrypt / kms:Decrypt to execution role
"Monitor creates baseline → schedule" → reverse order Baseline first, then monitoring job
"Retrain on new data → reuse old baseline" Establish new baseline after retraining
"Compare new model to production, least effort" → separate endpoint Shadow variant on same endpoint
"No-code transformation" → Glue ETL job Glue DataBrew (recipe job)
"Label employees-only data" → Mechanical Turk Private workforce in Ground Truth
"RAG from S3 docs, least overhead" → fine-tuning Bedrock Knowledge Bases
"Private subnet EC2 → Bedrock" → NAT gateway AWS PrivateLink (interface VPC endpoint)
"Python provisioning of EKS" → CloudFormation AWS CDK

Last updated: Monday, 01 June 2026

Certifications
New exams