MLA-C01 AWS Machine Learning Engineer

To get ready for the MLA-C01 exam, you should have hands-on experience with every part of the ML lifecycle on AWS. This means collecting and transforming data, training and evaluating models, and then deploying, monitoring, and securing them. The biggest challenge is not just learning the services, but knowing which one to use, when to use it, and why. This can be tricky when several services seem like good choices.

This guide uses real exam scenarios. It covers the four main domains with clear summaries, service comparisons, practical tips for making decisions, and warnings about common mistakes to avoid on the test.

Exam Domain Overview

Domain	Approx. Weight
Data Preparation for Machine Learning	~28%
ML Model Development	~26%
Deployment and Orchestration of ML Workflows	~22%
ML Solution Monitoring, Maintenance, and Security	~24%

Domain 1: Data Preparation for Machine Learning

1.1 Data Ingestion and Aggregation

Before any ML model can be trained, data must be collected from disparate sources and unified into a consistent format. The exam frequently presents multi-source scenarios and tests whether you can identify the right aggregation tool.

Tool	Best For	Key Characteristic
AWS Lake Formation	Centralized data lake from S3, RDS, on-premises databases	Governed ingestion + access control; ideal for heterogeneous sources
Amazon EMR (Spark)	Large-scale distributed data processing	Heavy compute; requires cluster management
AWS Glue	ETL jobs, schema discovery, Data Catalog	Serverless; best for structured data transformation
Amazon Kinesis Data Streams	Real-time streaming data ingestion	Low latency; not for batch aggregation from static sources
Amazon DynamoDB	NoSQL key-value store	A database, not a data aggregation service

Lake Formation is the aggregation answer; Kinesis is the streaming answer - do not swap them.

When a scenario describes pulling data from Amazon S3, on-premises MySQL databases, and other static sources into a unified data lake, the answer is AWS Lake Formation. It is purpose-built to govern, catalog, and aggregate data from multiple heterogeneous sources.

Amazon Kinesis Data Streams is for real-time event streaming — not for batch ingestion from relational databases or S3. Selecting Kinesis when the data is static is a frequent trap.

Amazon DynamoDB is a database and stores data but does not perform aggregation from multiple source systems.

1.2 Data Transformation and ETL

Once data is ingested, it must be transformed into a model-ready format. The exam tests whether you can choose the right ETL tool for the job's complexity and operational overhead.

Tool	Use Case	Code Required?
AWS Glue (ETL jobs)	Schema discovery, format conversion, Data Catalog updates	PySpark / Python
AWS Glue DataBrew	Visual, no-code data cleaning and transformation	No (recipe-based)
AWS Glue Relationalize	Flatten nested/JSON data into tabular format	PySpark transform
SageMaker Data Wrangler	Interactive data prep inside SageMaker Studio	No (UI-based)
Amazon Athena	SQL queries over S3 data	SQL
Amazon EMR	Large-scale Spark/Hadoop jobs	Spark/Scala/Python

Format conversions that appear on the exam:

CSV → Apache Parquet: The standard optimization for query performance; use an AWS Glue ETL job. Parquet is columnar and dramatically faster for analytical queries and ML feature generation.
JSON (nested) → tabular: Use the AWS Glue Relationalize transform — it flattens nested JSON automatically with minimal code.
DeepAR time series format: DeepAR requires JSON Lines (.jsonl) format with gzip compression, not Parquet or RecordIO.

No-code vs. low-code tools are not interchangeable in exam answers.

AWS Glue DataBrew is a fully visual, no-code tool. You build "recipes" through a point-and-click interface — no scripting. It is the answer whenever the requirement specifies "no-code transformation."

AWS Glue ETL jobs require writing PySpark or Python code. They are more flexible but carry more implementation effort.

SageMaker Data Wrangler is the no-code/low-code option within SageMaker Studio specifically. It is the right answer when the question is scoped to the SageMaker ecosystem and asks for data preparation with the least effort.

Selecting a code-based tool when the requirement specifies "least development effort" or "no-code" will cost you the point.

The correct compression format for DeepAR is gzip, not Snappy or XZ.

Amazon SageMaker DeepAR expects JSON Lines (.jsonl) format for time series data, compressed with gzip. Snappy is common with Parquet, but it is not the correct answer for DeepAR. XZ offers high compression ratios but is not supported. Mixing format and compression (e.g., Parquet + XZ, RecordIO + gzip for DeepAR) is a common distractor pattern.

1.3 Data Sources and Connections in SageMaker

Connection Type	What It Does	When to Use
SageMaker Data Wrangler — Direct connection	Live query at import time; always pulls latest data	When data must always be up to date
SageMaker Data Wrangler — Cataloged connection	Uses a snapshot registered in AWS Glue Data Catalog	When consistency matters more than recency
SageMaker Feature Store	Persistent, versioned feature repository for training and inference	When features must be shared across models and stay synchronized

Direct connections — not cataloged connections — ensure always-current data.

When the exam states that "ingested data must always be up to date with the latest changes in the source systems," the answer is a direct connection in SageMaker Data Wrangler.

Cataloged connections reference a snapshot in the Glue Data Catalog, which may lag behind the source. If freshness is the requirement, direct connections are correct.

1.4 Feature Engineering

Feature engineering transforms raw data into the numerical representations that ML algorithms can learn from. The exam heavily tests encoding choices and their implications.

Encoding Techniques:

Technique	When to Apply	Result
One-hot encoding	Nominal/categorical features with no inherent order	Adds one binary column per category; increases dimensionality
Label (ordinal) encoding	Ordinal categories with a natural ranking	Maps categories to integers; preserves order
Binary encoding	High-cardinality nominal categories	Fewer new columns than one-hot; compromise approach
Tokenization	Text data for NLP models	Splits text into sub-word or word tokens

One-hot encoding increases dimensionality; label encoding does not — this distinction drives the correct answer.

When a feature has a clear ordinal relationship (e.g., job_seniority_level: Junior < Senior < Principal), label encoding is appropriate because the integer ordering carries meaning.

When a feature is nominal with no ordering (e.g., location: Berlin, Paris, Tokyo), one-hot encoding is appropriate because you do not want the model to assume Berlin < Paris < Tokyo.

When the question explicitly states "must not increase the dimensionality of the dataset" and the values are binary (yes/no, true/false), label encoding or binary encoding is correct — one-hot encoding would add columns.

1.5 Handling Class Imbalance

Class imbalance occurs when one class in the target variable is far more common than another (e.g., 99% legitimate transactions vs. 1% fraudulent). This is one of the most tested topics in the data preparation domain.

Technique	What It Does	When to Use
Random oversampling	Duplicates minority class examples	When you cannot generate new data
SMOTE (Synthetic Minority Oversampling Technique)	Synthesizes new minority class examples	When you need more diversity in minority class without losing data
Random undersampling	Removes majority class examples	Only when you can afford to lose data
Class weights / scale_pos_weight	Adjusts loss function to penalize minority class misses	Algorithmic fix; no data modification

Difference in Proportions of Labels (DPL):

DPL is a pre-training bias metric that quantifies representation imbalance across demographic groups. A DPL value near 0 indicates balanced representation. A high positive DPL (e.g., +0.9) for a specific subgroup means that the subgroup is dramatically over-represented in the positive class compared to the rest of the dataset.

When DPL is high for a subgroup, you undersample that subgroup — not the entire class.

A DPL of +0.9 for the age range 40–45 in the positive class indicates that this age group is over-represented relative to other age ranges. The correct remedy is to undersample the positive class for the age range 40–45 to bring it in line with other groups.

Oversampling the positive class for that age group would worsen the imbalance. Undersampling across all other age ranges would unnecessarily lose data. The fix is targeted only at the over-represented group.

SMOTE is for balancing classes without losing data; undersampling discards existing data.

When the requirement explicitly states "without losing any existing training data," the correct technique is oversampling (e.g., SMOTE), not undersampling.

Use SageMaker Clarify to detect the imbalance (CI metric > 0 indicates imbalance) and SageMaker Data Wrangler to apply SMOTE.

1.6 PII Detection and Data Masking

Service	Capability	Limitation
AWS Glue - Detect PII transform	Find and mask PII in Glue jobs using Spark	Requires writing a Glue job but uses built-in transform
Amazon Macie	Discover sensitive data in S3	Identifies PII but does not mask it in place
Amazon Comprehend	Detect entities (names, addresses) in text	NLP-based; does not natively mask
Custom Spark/regex on EC2	Maximum flexibility	Maximum development effort

Macie discovers PII but does not mask it - it is a detection service, not a transformation service.

Amazon Macie will classify S3 objects containing sensitive data and alert you, but it does not modify the data. If the requirement is to "find and mask" PII, the correct answer involves AWS Glue with the Sensitive Data Detection transform (or a custom Glue job with regex). Macie alone does not satisfy the masking requirement.

1.7 Data Labeling

Service	Workforce Type	Best For
SageMaker Ground Truth	Private, public (Mechanical Turk), or vendor	Flexible labeling with custom UIs; requires setup
SageMaker Ground Truth Plus	Managed vendor workforce	Fully managed; no app development required
Amazon Augmented AI (A2I)	Human review of ML predictions	Reviewing and correcting model outputs; not initial labeling
Amazon Mechanical Turk (AMT)	Public crowdworkers via AMT website	Not the AWS ML-integrated path

Ground Truth and A2I solve different problems in the labeling workflow.

SageMaker Ground Truth is for creating initial labels on raw data - it is used before a model exists, to build the training dataset.

Amazon Augmented AI (A2I) is for human review of predictions that a model has already made - used in production to catch low-confidence outputs and improve them.

When a scenario involves unlabeled data that needs to be annotated for the first time, the answer is Ground Truth. When it involves reviewing model outputs in production, the answer is A2I.

A private workforce in Ground Truth is the correct answer for employee-only access.

When the requirement states that only employees should perform labeling tasks, the answer is to create a private workforce in SageMaker Ground Truth. Amazon Mechanical Turk is a public crowdsourcing platform where random members of the public complete tasks; this does not meet any requirement for restricted access.

1.8 Data Quality and Profiling

Service	Purpose
AWS Glue Data Quality	Rule-based data quality checks on Glue Data Catalog tables
AWS Glue DataBrew (Profile job)	Statistical profiling - distributions, nulls, correlations
AWS Glue DataBrew (Recipe job)	Applying transformations and cleaning steps
SageMaker Data Wrangler	Interactive data quality exploration within SageMaker

DataBrew profile jobs analyze data quality; recipe jobs transform it - they are not the same job type.

A profile job in DataBrew generates statistical summaries: column distributions, null counts, duplicate rates, and correlation matrices. It is a read-only analysis step.

A recipe job applies a sequence of transformation steps (the "recipe") to actually clean, normalize, encode, or filter the data and write the output.

When the requirement is to "clean and normalize" data, the answer is a recipe job, not a profile job.

1.9 SageMaker Feature Store

Feature Store provides a centralized repository for storing, sharing, and serving ML features consistently between training and inference.

Store Type	Latency	Use Case
Online store	Low (milliseconds)	Real-time inference; always returns latest feature value
Offline store	Higher (batch)	Training jobs; historical snapshots in S3

Enabling both online and offline stores is required when you need real-time inference AND batch training consistency.

When a scenario requires that the online store is updated with the most recent data as it arrives AND a complete historical record is maintained for batch processing, you must enable both stores simultaneously.

Using only the online store loses the historical record. Using only the offline store cannot serve real-time inference at low latency. The Feature Store Spark connector supports ingesting data into both stores in a single operation.

Domain 2: ML Model Development

2.1 Choosing the Right Algorithm

The exam presents business scenarios and asks you to select the appropriate SageMaker built-in algorithm for each scenario. The key is mapping problem type to algorithm family.

Algorithm	Problem Type	Key Signal Words
XGBoost	Classification, regression, ranking	Tabular data; "predict likelihood"; "loan ranking"
Linear Learner	Binary/multi-class classification, regression	Linear relationships; interpretability
DeepAR	Time series forecasting	"Historical measurements at regular intervals"; "predict future values"
Random Cut Forest (RCF)	Anomaly detection	Unlabeled data; "detect irregularities"; "unexpected values"
K-means	Clustering (unsupervised)	"Group similar items"; no labels
K-nearest neighbors (k-NN)	Classification/regression (supervised)	"Based on similar examples"; labeled training data
PCA	Dimensionality reduction	"Reduce features"; "high dimensionality"
Factorization Machines	Recommendations on sparse data	"High-dimensional sparse data"; collaborative filtering
Neural Topic Model (NTM)	Topic modeling	"Discover topics in text documents"
Seq2Seq	Sequence-to-sequence	Machine translation, text summarization
BlazingText	Word2Vec / text classification	Word embeddings; fast text classification

DeepAR requires specific hyperparameters - context_length and prediction_length must match your forecasting horizon.

The context_length parameter controls how much historical data the model considers for each prediction. The prediction_length controls how far into the future the model predicts. Both must be set to match the time granularity and horizon of your problem. Setting scale_pos_weight (an XGBoost hyperparameter) on a DeepAR job, or applying k-means clustering to a forecasting problem, are common wrong answers the exam uses as distractors.

RCF is the anomaly detection algorithm; k-means is a clustering algorithm — they are not interchangeable even though both are unsupervised.

Random Cut Forest is explicitly designed to score data points by their anomaly likelihood. It works on unlabeled data and assigns an anomaly score to each input — higher scores indicate anomalies.

K-means groups data into clusters based on similarity. It does not produce an anomaly score and is not designed to flag outliers.

When the scenario involves an unlabeled dataset and asks to "detect irregularities," "identify unusual patterns," or "find outliers," the answer is RCF. The num_trees hyperparameter controls model stability (more trees = more stable scores).

Factorization Machines, not k-NN, is correct for high-dimensional sparse recommendation data.

Recommendation systems typically involve user-item interaction matrices that are extremely sparse — most users have interacted with only a tiny fraction of all items. Factorization Machines are designed for exactly this use case and handle sparse, high-dimensional data efficiently.

K-NN struggles with sparse high-dimensional spaces (the curse of dimensionality) and is not the correct choice for recommendation problems at scale.

2.2 Overfitting, Underfitting, and Regularization

Problem	Symptom	Root Cause	Solution
Overfitting	High training accuracy, low validation accuracy	Model too complex; memorizing noise	Regularization (L1/L2), dropout, k-fold CV, more data
Underfitting	Low accuracy on both training and validation	Model too simple; not learning patterns	More features, more epochs, less regularization, more complex model

L1 vs. L2 Regularization:

Regularization	Effect	Key Benefit
L1 (Lasso)	Drives some feature weights to exactly zero	Automatic feature selection; removes irrelevant features
L2 (Ridge)	Shrinks all feature weights toward zero but not to zero	Distributes weight reduction across all features

L1 regularization is the correct answer when the problem is both overfitting AND unnecessary features.

When a scenario states that the model is overfitting AND the training data contains unnecessary features that should be removed, L1 regularization is the correct choice. L1 performs implicit feature selection by setting the weights of irrelevant features to zero, addressing both problems simultaneously.

L2 reduces weights but does not eliminate features. Increasing training iterations worsens overfitting. Decreasing iterations may underfit. SageMaker Debugger can help profile training, but does not apply regularization to a live model.

2.3 Evaluation Metrics

Metric	Problem Type	When It Matters Most
Accuracy	Classification	Balanced class distributions only
Precision	Classification	Cost of false positives is high (e.g., spam detection)
Recall	Classification	Cost of false negatives is high (e.g., fraud, disease detection)
F1 Score	Classification	Imbalanced classes; need balance of precision and recall
AUC-ROC	Binary classification	Evaluating classifier performance across all thresholds
RMSE	Regression	Numeric prediction; penalizes large errors more
MAE	Regression	Numeric prediction; equal weight to all errors

Recall is the correct metric when the business cannot afford to miss positive cases.

Recall = (True Positives) / (True Positives + False Negatives). It measures how many of the actual positives the model correctly identified.

In fraud detection, the cost of missing a fraudulent transaction (false negative) is far higher than the cost of incorrectly flagging a legitimate transaction (false positive). Therefore, maximizing recall — catching as many fraudulent transactions as possible — is the correct objective. Optimizing for precision would minimize false alarms but allow more fraud to pass through undetected.

The same logic applies to medical diagnosis: missing a positive case (disease) is typically far more costly than over-investigating a healthy patient.

For continuous numeric predictions, RMSE is correct — not accuracy or AUC.

Accuracy, F1, AUC, and precision/recall are all classification metrics. They require discrete class-label predictions. When the model output is a continuous value (e.g., home price, sales forecast, patient measurement), these metrics are inapplicable.

RMSE (Root Mean Square Error) measures the average deviation of predictions from actual values, with larger errors penalized more heavily. For models that "make continuous numeric predictions," RMSE is the correct evaluation metric, not accuracy.

2.4 Bias Detection with SageMaker Clarify

SageMaker Clarify provides pre-training and post-training bias metrics and model explainability reports.

Pre-training Bias Metrics:

Metric	Measures
Class Imbalance (CI)	Whether classes in the target variable are proportionally balanced
Difference in Proportions of Labels (DPL)	Whether the positive class rate differs across demographic groups
KL Divergence / JS Divergence	Statistical difference between label distributions of subgroups

DPL, not MSE or SSIM, is the correct metric to confirm class imbalance across demographic groups.

DPL (Difference in Proportions of Labels) is specifically designed as a pre-training bias metric. A DPL value significantly different from 0 indicates that the positive class is not evenly distributed across subgroups — confirming bias in the dataset.

MSE (Mean Square Error) is a regression error metric — it has nothing to do with bias detection.

SSIM (Structural Similarity Index Measure) is an image quality metric — also unrelated.

Silhouette score is a clustering quality metric — unrelated to bias.

When the question asks for a pre-training bias metric to confirm class imbalance, the answer is DPL.

2.5 Hyperparameter Tuning and Training Optimization

Key learning rate guidance:

If accuracy is not increasing and loss is decreasing very slowly with SGD, the learning rate is too low — increase it to escape the slow convergence plateau.
If loss is oscillating wildly and not converging, the learning rate is too high — decrease it.

Distributed training with SageMaker:

Technique	When to Use
Distributed Data Parallel (DDP)	Large datasets that don't fit on one GPU; split data across instances
Model Parallel	Models too large to fit in one GPU's memory; split model layers
Increase number of instances	Primary lever to reduce wall-clock training time for DDP jobs

To reduce DDP training time, increase the number of instances — not epochs, neurons, or layers.

When a training job uses distributed data parallelism across instances and the goal is to decrease training time, the correct action is to increase the number of instances. More instances mean each instance processes a smaller shard of the data, and the total time decreases proportionally.

Adding epochs increases training time. Adding neurons or layers increases model complexity and per-instance compute time but does not leverage distributed parallelism.

2.6 SageMaker Training Methods

Method	Purpose
`estimator.fit()`	Initiates a training job with a dataset
`estimator.deploy()`	Deploys a trained model to an endpoint
`predictor.predict()`	Calls inference on a deployed endpoint
`estimator.create_model()`	Creates a model object without deploying it

fit() starts training; deploy() starts hosting - they are sequential, not interchangeable.

The SageMaker SDK workflow is: (1) configure an Estimator, (2) call .fit() to train the model, (3) call .deploy() to host it, (4) call .predict() on the resulting predictor.

Calling .deploy() before .fit() will fail because there is no trained model artifact yet. Calling .predict() without a deployed endpoint will also fail. Questions that ask how to "initiate a training job" are asking for .fit().

2.7 Experiment Tracking and Model Management

Service	Purpose
SageMaker Experiments	Track and compare training runs: hyperparameters, metrics, artifacts
SageMaker Debugger / Profiler	Identify bottlenecks in training: GPU utilization, CPU usage, memory
SageMaker Model Registry	Version, stage, and deploy approved models
SageMaker Profiler annotations	Mark specific code sections to measure resource usage during training

SageMaker Experiments is the correct answer for logging training run characteristics — not CloudWatch, S3, or Model Registry.

When an ML engineer needs to track hundreds of training iterations with different hyperparameters, features, and algorithms and compare their results, the answer is SageMaker Experiments. It is purpose-built for this workflow and requires minimal implementation effort — no custom metrics, no S3 log parsing, and no extra infrastructure.

CloudWatch can surface metrics, but it is not designed for systematic experiment comparison. Model Registry manages approved models for deployment — it is not the right tool for logging in-progress experiment results.

SageMaker Profiler annotations are the right tool to pinpoint where in the training script resources are underutilized.

When GPU utilization is low and the engineer needs to find the specific bottleneck in the training script, the answer is to add SageMaker Profiler annotations to the code and generate a profiler report. This instruments the script at the function level and shows exactly where compute time is being spent.

CloudWatch and CloudTrail provide coarser-grained metrics at the endpoint or instance level — they cannot pinpoint specific lines of Python training code.

Domain 3: Deployment and Orchestration of ML Workflows

3.1 SageMaker Inference Types

Choosing the right inference type is one of the highest-weighted topics in this domain. The decision depends on four factors: latency requirement, payload size, traffic pattern, and cost sensitivity.

Inference Type	Latency	Max Payload	Traffic Pattern	Idle Cost
Real-time	Milliseconds	6 MB	Consistent, predictable	Always running
Serverless	Seconds (cold start)	4 MB	Sporadic, unpredictable	Zero when idle
Asynchronous	Minutes (queued)	1 GB	Large, intermittent	Auto-scales to zero
Batch Transform	Hours (scheduled)	Entire dataset	Bulk, offline	Runs only during job

Asynchronous inference is the correct answer for large payloads (>6 MB) with processing times up to 60 minutes.

The exam frequently presents scenarios with payload sizes of 100–300 MB and processing times measured in minutes. Asynchronous inference is the only SageMaker inference type designed for this use case. It queues requests, processes them, and stores results in S3 — the client polls for completion.

Real-time inference has a hard limit of 6 MB per payload. Serverless inference has a 4 MB limit and is designed for sporadic low-volume workloads, not large payloads. Batch transform processes entire datasets at once rather than processing them one request at a time.

Serverless inference is the most cost-effective option for intermittent CPU-based workloads with periods of zero traffic.

When traffic is intermittent (e.g., only during business hours) and the model runs on CPU, Serverless Inference charges only per inference request — there is no cost during idle periods. It is superior to a real-time endpoint that bills by the hour even when idle.

Asynchronous inference can also scale to zero, but it is designed for large payloads and long processing times — not for low-latency interactive predictions.

3.2 Endpoint Strategies for Multiple Models

Strategy	Use Case	Cost Model
Multi-model endpoint (MME)	Many models sharing one container (same framework)	Pay for one endpoint; models loaded on-demand
Multi-container endpoint (MCE)	A few models with different frameworks; sequential or direct invocation	Pay for one endpoint; containers always loaded
Single-model endpoint	One model per endpoint	One endpoint per model
Inference pipeline	Sequential preprocessing → model → postprocessing steps	Chained containers in one endpoint

Multi-model endpoints and multi-container endpoints solve different problems.

A multi-model endpoint hosts many models (potentially hundreds) that share the same framework and container. Models are dynamically loaded and unloaded into memory based on traffic. This is the most cost-effective option when you have many models built on the same framework.

A multi-container endpoint hosts a small number of models with different frameworks in separate containers within one endpoint. Use this when models cannot share a container because they require different runtimes, but you still want to consolidate onto a single endpoint.

When the question says "multiple models, same framework," MME is the answer. When it says "different frameworks," MCE is the answer.

3.3 Auto Scaling for SageMaker Endpoints

High-resolution metrics + appropriate cooldown periods are required for rapid scaling response.

For endpoints that must scale quickly in response to sudden traffic changes:

Use high-resolution CloudWatch metrics (10-second intervals) rather than standard 60-second intervals — this gives the scaling policy faster feedback.
Use the InvocationsPerInstance metric as the target tracking metric.
Set a longer scale-in cooldown (e.g., 600 seconds) to prevent premature scaling down after a traffic spike subsides.

A shorter scale-in cooldown combined with standard metrics results in unstable scaling behavior — the endpoint may scale in too aggressively before traffic has truly subsided.

When endpoints scale to zero instances overnight, apply a scheduled scale-out policy to pre-warm before business hours.

Auto scaling with a target-tracking policy can set the minimum instance count to zero during idle periods, reducing costs. However, when traffic arrives at the start of business hours, there are zero instances available to handle it — causing delays while new instances spin up.

The solution is to apply a scheduled scaling action (via a CloudWatch alarm and step scaling policy) that increases the minimum instance count from 0 to 1 (or more) before business hours begin, ensuring the endpoint is ready when traffic arrives.

3.4 Deployment Strategies

Strategy	Purpose	When to Use
Blue/Green deployment	Zero-downtime cutover; full traffic shift	When you want a clean switch with rollback capability
Shadow variant	Test a new model on live traffic without affecting user responses	When comparing new model performance to current model before promoting
Canary deployment	Gradually shift a percentage of traffic to new model	When you want a controlled incremental rollout
A/B test (production variant)	Split traffic between model versions with metrics	When comparing multiple model variants in production

Shadow variants are the least-effort path to comparing a new model against a production model.

A shadow variant receives a copy of live production traffic and processes it, but its responses are not returned to users. This allows you to measure the new model's performance on real traffic with zero risk to the user experience.

To deploy a shadow variant, deploy the new model on the same endpoint as the current model. Route a configured percentage of live traffic to it. Evaluate its outputs against the production model's outputs.

This requires less operational effort than deploying to a separate endpoint, managing DNS routing, or writing custom Lambda routing logic.

3.5 Model Registry and Versioning

Service	Purpose
SageMaker Model Registry	Version control and lifecycle management for trained models; supports approval workflows
SageMaker Inference Recommender	Benchmarks model performance across EC2 instance types; recommends best-fit instance
Amazon ECR	Container image registry; stores Docker images but does not manage ML model versions

SageMaker Inference Recommender, not Autopilot or Compute Optimizer, provides instance type recommendations for hosting.

When the requirement is to determine the best EC2 instance type for hosting an ML model on SageMaker, the answer is SageMaker Inference Recommender. It runs load tests against multiple instance configurations and ranks them by performance and cost.

SageMaker Autopilot is an AutoML service for automatically building and training models — it does not recommend hosting instance types.

AWS Compute Optimizer makes EC2 sizing recommendations for general workloads based on CloudWatch metrics — not for SageMaker model hosting.

3.6 Amazon Bedrock Deployment

Concept	Detail
Custom model import	Import Hugging Face or other models into Bedrock for API access
S3 URI requirement	Model files must be in S3 in the same AWS account as the Bedrock import job
VPC endpoint (PrivateLink)	Connect EC2 in private subnets to Bedrock without traversing the internet
Bedrock Knowledge Bases	Fully managed RAG; supports S3 data sources; default vector store is OpenSearch Serverless

PrivateLink (interface VPC endpoint) is the correct solution for private subnet EC2 to Bedrock connectivity.

When EC2 instances are in a private subnet and must remain private while calling the Amazon Bedrock API, the answer is AWS PrivateLink — specifically an interface VPC endpoint for Bedrock. This routes all API traffic through the AWS private network, eliminating the need for internet access.

A NAT gateway would route traffic to the internet (against the requirement). AWS Direct Connect connects on-premises networks, not VPC-to-service traffic. Modifying Bedrock to use a private subnet is not possible — Bedrock is a managed service that the user cannot configure at the network level.

RAG via Bedrock Knowledge Bases is the least-operational-overhead path — not fine-tuning, not Neptune.

When the requirement is to supplement an LLM with documents from S3 using RAG, the answer is to create a Knowledge Base for Amazon Bedrock and configure the S3 bucket as a data source. Bedrock handles chunking, embedding, vector storage (OpenSearch Serverless by default), and retrieval automatically.

Fine-tuning (AutoML or SageMaker Pipelines) updates model weights — it does not inject external documents at query time. Amazon Neptune is a graph database — not a managed vector store for RAG.

3.7 Networking and Privacy

Requirement	Solution
Network isolation from internet during training	Private subnet + S3 gateway VPC endpoint
Private IP addresses only	Private subnet + VPC endpoint (no NAT gateway)
Opt out of SageMaker metadata collection	Set `OPT_OUT_TRACKING` environment variable or opt-out parameter per training job
Cross-account S3 access	VPC peering or S3 bucket policies with cross-account permissions

Network isolation requires a private subnet AND an S3 VPC gateway endpoint — not a NAT gateway.

To ensure SageMaker training jobs have no internet connectivity while still accessing S3:

Run training jobs in private subnets (no route to internet gateway).
Create an S3 gateway VPC endpoint so traffic to S3 stays on the AWS private network.

A NAT gateway routes traffic to the internet, which breaks the network isolation requirement. A public subnet with security group inbound rules still exposes the training job to the internet on the outbound path.

3.8 Custom Containers and Infrastructure

Approach	When to Use	Provisioning Method
BYOC (Bring Your Own Container)	Custom libraries, frameworks not in SageMaker containers	Docker image in ECR
SageMaker pre-built containers	Standard frameworks (PyTorch, TensorFlow, sklearn)	Built-in; no Docker work needed
EKS with CDK	Kubernetes; managed control plane; Python provisioning	AWS CDK (Python)
EKS with CloudFormation	Kubernetes; managed control plane; declarative templates	CloudFormation (YAML/JSON)

AWS CDK is the correct answer when the infrastructure must be provisioned using Python specifically.

When requirements specify (1) no managing the Kubernetes control plane, (2) using Kubernetes, and (3) Python for provisioning, the answer is AWS CDK to provision an Amazon EKS cluster. CDK supports Python natively and handles the Kubernetes control plane automatically via EKS (managed).

AWS CloudFormation can also provision EKS, but uses YAML/JSON templates, not Python. The AWS CLI is not a repeatable provisioning method — it is imperative, not declarative. EC2-based self-managed Kubernetes requires managing the control plane.

Domain 4: ML Solution Monitoring, Maintenance, and Security

4.1 SageMaker Model Monitor

Model Monitor continuously evaluates deployed models against a baseline and raises alerts when violations are detected.

Monitor Type	What It Detects	Baseline Required
Data Quality Monitor	Shifts in input feature distributions (data drift)	Statistical baseline from training data
Model Quality Monitor	Degradation in prediction accuracy/performance	Ground truth labels + prediction logs
Bias Drift Monitor	Changes in model fairness metrics over time	Clarify bias baseline
Feature Attribution Drift	Changes in which features drive predictions	Clarify SHAP baseline

Correct sequence for setting up Model Monitor:

Create a baseline from training/validation data (generates constraints and statistics)
Create a monitoring job that compares live inference data against the baseline

The baseline must be created BEFORE the monitoring job — not after.

A monitoring job in SageMaker Model Monitor compares live inference data against a pre-established baseline. If the monitoring job is configured before the baseline exists, it has nothing to compare against and will not produce meaningful alerts.

The correct order is always: (1) establish the baseline from training data, (2) configure the monitoring schedule.

Data drift requires a Data Quality Monitor with a data quality baseline — not a model quality baseline.

When the requirement is to "detect changes in the input data distribution of model features" (i.e., data drift), the answer is to configure a Data Quality Monitor and establish a data quality baseline. The baseline captures the statistical distribution of features at training time, and the monitor alerts when live data deviates from it.

A model quality baseline monitors output accuracy — it requires ground truth labels to compare predictions against. It does not detect input distribution shifts.

4.2 Automated Retraining Pipelines

Model Monitor + Lambda is the standard pattern for detecting drift and automatically triggering retraining.

When a pipeline must automatically initiate retraining upon detecting data drift:

SageMaker Model Monitor detects the drift and publishes a CloudWatch metric violation.
An AWS Lambda function (triggered by the CloudWatch alarm) initiates a new SageMaker training job or kicks off a SageMaker Pipeline.

AWS Glue, Apache Flink, and QuickSight are not designed for detecting model drift. Step Functions can orchestrate retraining, but add complexity — Lambda is the simplest trigger mechanism.

After retraining on new data, the Model Monitor baseline must also be updated.

When a model is retrained with new data, the distribution of inputs changes. Using the original baseline from the old training data will cause Model Monitor to flag legitimate input patterns as violations.

After retraining: (1) retrain with new data, (2) reestablish the baseline from the new training data, (3) update the monitoring job to use the new baseline. Reusing the original baseline with newly trained models introduces false alerts and undermines the monitoring system.

4.3 Identifying the Root Cause of Performance Degradation

Symptom	Root Cause	Solution
Model accuracy degrades gradually over time	Data drift — production data distribution shifted	Retrain on recent data; update monitoring baseline
Model accuracy degrades suddenly after a deployment	Bug in new code or data pipeline	Roll back deployment; debug pipeline
High training accuracy, low validation accuracy	Overfitting	Regularization, more data, cross-validation
Low accuracy on both training and validation	Underfitting	More complex model, more features, more training
Accuracy suddenly drops after months of stability	Concept drift or data pipeline failure	Investigate data freshness; check upstream pipelines

Data drift in production — not overfitting — is the correct cause when a well-performing model degrades over time.

A model that has been deployed and is performing well for months, with metrics above thresholds, is, by definition, not overfitting at deployment time. Overfitting is a training-time phenomenon.

When performance degrades after a period of stable production operation, the cause is almost always drift in the production data distribution — the real-world inputs have changed in ways that were not represented in the original training data. The fix is retraining on current data, not debugging the original model.

4.4 Content Moderation and Real-Time Video Analysis

Amazon Rekognition + Lambda is the least-overhead solution for real-time video frame content moderation.

Amazon Rekognition includes built-in content moderation capabilities: it can detect inappropriate content categories (e.g., explicit nudity, violence) in images and video frames with a single API call.

Using a custom SageMaker model requires training, hosting, and maintaining a computer vision model, which incurs significant operational overhead. Using Transcribe + Comprehend analyzes audio/text, not visual content. Rekognition with Lambda extracts and analyzes image frames in a serverless, managed fashion.

4.5 Security and Access Control

Requirement	Solution
Block traffic from a specific IP	Network ACL inbound rule (deny rule); security groups only support allow rules
Prevent misuse of presigned URLs from outside a VPN	IAM condition `aws:sourceVpc` validation
Restrict presigned URLs to specific IP range	IAM condition `aws:sourceIp` validation
SageMaker training jobs with KMS-encrypted S3	IAM execution role must have `kms:Encrypt` and `kms:Decrypt` permissions
Opt out of SageMaker metadata collection	`OPT_OUT_TRACKING` env variable or per-job opt-out parameter

Security groups cannot deny traffic — only Network ACLs can create deny rules.

AWS security groups are stateful and support allow rules only. You cannot create a deny rule in a security group. To block traffic from a specific IP address, you must create a Network ACL (NACL) deny rule for the specific IP and apply it to the subnet.

VPC route tables do not support deny rules either—they determine routing paths, not access-control decisions. Creating a shadow variant to redirect traffic is not a security mechanism.

KMS-encrypted S3 access failures after switching from SSE-S3 to SSE-KMS require adding KMS permissions to the execution role.

When S3 buckets use SSE-S3, SageMaker training jobs can access them with standard S3 permissions. When you switch to SSE-KMS, the SageMaker execution role must also have kms:Encrypt and kms:Decrypt (and optionally kms:GenerateDataKey) permissions on the specific KMS key. Without these, all S3 reads and writes fail with AccessDenied.

Adding s3:ListBucket does not resolve KMS access errors. Updating the aws:SecureTransport bucket policy condition enforces HTTPS but does not grant KMS permissions.

aws:sourceVpc validates the VPC origin of a request; aws:sourceIp validates the IP address.

When the company shares SageMaker Studio notebooks over a VPN and needs to ensure presigned URLs can only be used from within that VPN (the VPC), the correct IAM policy condition is aws:sourceVpc. This ensures that even if a presigned URL is leaked externally, it cannot be used from outside the authorized VPC.

aws:sourceIp restricts access to specific IP ranges — appropriate when the restriction is IP-based rather than VPC-based. aws:PrincipalTag is for attribute-based access control tied to IAM principals, not network origin.

4.6 Monitoring Tools Reference

Service	What It Monitors
SageMaker Model Monitor	Data quality, model quality, bias drift, feature attribution drift
SageMaker Clarify	Pre/post-training bias metrics; SHAP-based feature explanations
SageMaker Debugger / Profiler	Training job resource utilization (GPU, CPU, memory); training anomalies
Amazon CloudWatch	Infrastructure metrics (CPU, memory, invocations); alarm-based notifications
AWS CloudTrail	API call audit logs; who did what and when
Amazon CloudWatch Logs Insights	Query and analyze log data for error patterns

Model Monitor captures bias metrics; CloudWatch visualizes them — use both together for dashboards.

SageMaker Clarify bias metrics generated by a monitoring job are emitted as Amazon CloudWatch metrics. To display these on a dashboard, the ML engineer captures the CloudWatch metrics from SageMaker Clarify/Model Monitor and builds a CloudWatch dashboard.

CloudTrail captures API activity logs — it does not capture bias metrics. EventBridge can trigger actions based on events, but it is not the path for metric collection. SNS is for sending notifications, not for capturing or displaying metrics.

Quick-Reference: Key Service Decision Trees

Which data transformation tool?

No-code, visual interface → AWS Glue DataBrew (recipe job)
Serverless ETL with Python/PySpark → AWS Glue ETL job
Interactive data prep in SageMaker → SageMaker Data Wrangler
Large-scale Spark → Amazon EMR
SQL over S3 → Amazon Athena

Which inference type?

Millisecond latency, small payload → Real-time
Sporadic traffic, no idle cost, CPU → Serverless
Large payload (>6 MB), long processing, notify on complete → Asynchronous
Bulk dataset, once daily → Batch Transform

Which encoding?

Nominal category, few values, no ordering → One-hot
Ordinal category with ranking → Label/ordinal
Binary yes/no, no dimensionality increase → Label (binary)
High-cardinality nominal, minimize columns → Binary encoding

Which anomaly detection approach?

Unlabeled data → Random Cut Forest (RCF)
Labeled data, classification → XGBoost or Linear Learner

Which monitoring tool?

Input data drift → Model Monitor (Data Quality)
Prediction accuracy over time → Model Monitor (Model Quality)
Bias in live predictions → Model Monitor (Bias Drift) + Clarify
Training bottleneck → SageMaker Debugger / Profiler
Security audit → AWS CloudTrail

Common Exam Traps: Summary

Trap	Correct Answer
"Aggregate from S3 + on-prem MySQL" → EMR	AWS Lake Formation
"Data always up to date in Data Wrangler" → cataloged	Direct connection
"Large payload 100–300 MB, 60 min processing" → batch transform	Asynchronous inference
"Sporadic, no traffic overnight, CPU model" → real-time	Serverless inference
"Detect irregularities, unlabeled" → k-means	Random Cut Forest (RCF)
"Recommendation, high-dimensional sparse" → k-NN	Factorization Machines
"Catch all fraud, minimize misses" → precision	Recall
"Continuous numeric predictions" → accuracy	RMSE
"Overfitting + unnecessary features" → L2	L1 regularization
"Block specific IP" → security group deny	Network ACL deny rule
"KMS AccessDenied after SSE-KMS switch" → add s3 permissions	Add `kms:Encrypt` / `kms:Decrypt` to execution role
"Monitor creates baseline → schedule" → reverse order	Baseline first, then monitoring job
"Retrain on new data → reuse old baseline"	Establish new baseline after retraining
"Compare new model to production, least effort" → separate endpoint	Shadow variant on same endpoint
"No-code transformation" → Glue ETL job	Glue DataBrew (recipe job)
"Label employees-only data" → Mechanical Turk	Private workforce in Ground Truth
"RAG from S3 docs, least overhead" → fine-tuning	Bedrock Knowledge Bases
"Private subnet EC2 → Bedrock" → NAT gateway	AWS PrivateLink (interface VPC endpoint)
"Python provisioning of EKS" → CloudFormation	AWS CDK

Last updated: Wednesday, 03 June 2026

MLA-C01 AWS Machine Learning Engineer – Associate Study Guide