MLA-C01 AWS Machine Learning Engineer – Associate Study Guide
Getting ready for the MLA-C01 exam requires a deep, hands-on understanding of the full ML lifecycle on AWS - from raw data ingestion and feature engineering, through model training and evaluation, to production deployment, monitoring, and security. The real challenge is not just knowing the services, but knowing which service to use, when, and why - especially when several options seem equally valid.
This guide is built from real exam patterns and covers the four core domains with clear summaries, direct service comparisons, decision-making frameworks, and warnings about common exam traps.
Exam Domain Overview
| Domain | Approx. Weight |
|---|---|
| Data Preparation for Machine Learning | ~28% |
| ML Model Development | ~26% |
| Deployment and Orchestration of ML Workflows | ~22% |
| ML Solution Monitoring, Maintenance, and Security | ~24% |
Domain 1: Data Preparation for Machine Learning
1.1 Data Ingestion and Aggregation
Before any ML model can be trained, data must be collected from disparate sources and unified into a consistent format. The exam frequently presents multi-source scenarios and tests whether you can identify the right aggregation tool.
| Tool | Best For | Key Characteristic |
|---|---|---|
| AWS Lake Formation | Centralized data lake from S3, RDS, on-premises databases | Governed ingestion + access control; ideal for heterogeneous sources |
| Amazon EMR (Spark) | Large-scale distributed data processing | Heavy compute; requires cluster management |
| AWS Glue | ETL jobs, schema discovery, Data Catalog | Serverless; best for structured data transformation |
| Amazon Kinesis Data Streams | Real-time streaming data ingestion | Low latency; not for batch aggregation from static sources |
| Amazon DynamoDB | NoSQL key-value store | A database, not a data aggregation service |
Lake Formation is the aggregation answer; Kinesis is the streaming answer — do not swap them.
When a scenario describes pulling data from Amazon S3, on-premises MySQL databases, and other static sources into a unified data lake, the answer is AWS Lake Formation. It is purpose-built to govern, catalog, and aggregate data from multiple heterogeneous sources.
Amazon Kinesis Data Streams is for real-time event streaming — not for batch ingestion from relational databases or S3. Selecting Kinesis when the data is static is a frequent trap.
Amazon DynamoDB is a database and stores data but does not perform aggregation from multiple source systems.
1.2 Data Transformation and ETL
Once data is ingested, it must be transformed into a model-ready format. The exam tests whether you can choose the right ETL tool for the job's complexity and operational overhead.
| Tool | Use Case | Code Required? |
|---|---|---|
| AWS Glue (ETL jobs) | Schema discovery, format conversion, Data Catalog updates | PySpark / Python |
| AWS Glue DataBrew | Visual, no-code data cleaning and transformation | No (recipe-based) |
| AWS Glue Relationalize | Flatten nested/JSON data into tabular format | PySpark transform |
| SageMaker Data Wrangler | Interactive data prep inside SageMaker Studio | No (UI-based) |
| Amazon Athena | SQL queries over S3 data | SQL |
| Amazon EMR | Large-scale Spark/Hadoop jobs | Spark/Scala/Python |
Format conversions that appear on the exam:
- CSV → Apache Parquet: The standard optimization for query performance; use an AWS Glue ETL job. Parquet is columnar and dramatically faster for analytical queries and ML feature generation.
- JSON (nested) → tabular: Use the AWS Glue Relationalize transform — it flattens nested JSON automatically with minimal code.
- DeepAR time series format: DeepAR requires JSON Lines (
.jsonl) format withgzipcompression, not Parquet or RecordIO.
No-code vs. low-code tools are not interchangeable in exam answers.
AWS Glue DataBrew is a fully visual, no-code tool. You build "recipes" through a point-and-click interface — no scripting. It is the answer whenever the requirement specifies "no-code transformation."
AWS Glue ETL jobs require writing PySpark or Python code. They are more flexible but carry more implementation effort.
SageMaker Data Wrangler is the no-code/low-code option within SageMaker Studio specifically. It is the right answer when the question is scoped to the SageMaker ecosystem and asks for data preparation with the least effort.
Selecting a code-based tool when the requirement specifies "least development effort" or "no-code" will cost you the point.
The correct compression format for DeepAR is gzip, not Snappy or XZ.
Amazon SageMaker DeepAR expects JSON Lines (.jsonl) format for time series data, compressed with gzip. Snappy is common with Parquet but is not the correct answer for DeepAR. XZ offers high compression ratios but is not supported. Mixing format and compression (e.g., Parquet + XZ, RecordIO + gzip for DeepAR) is a common distractor pattern.
1.3 Data Sources and Connections in SageMaker
| Connection Type | What It Does | When to Use |
|---|---|---|
| SageMaker Data Wrangler — Direct connection | Live query at import time; always pulls latest data | When data must always be up to date |
| SageMaker Data Wrangler — Cataloged connection | Uses a snapshot registered in AWS Glue Data Catalog | When consistency matters more than recency |
| SageMaker Feature Store | Persistent, versioned feature repository for training and inference | When features must be shared across models and stay synchronized |
Direct connections — not cataloged connections — ensure always-current data.
When the exam states that "ingested data must always be up to date with the latest changes in the source systems," the answer is a direct connection in SageMaker Data Wrangler.
Cataloged connections reference a snapshot in the Glue Data Catalog, which may lag behind the source. If freshness is the requirement, direct connections are correct.
1.4 Feature Engineering
Feature engineering transforms raw data into the numerical representations that ML algorithms can learn from. The exam heavily tests encoding choices and their implications.
Encoding Techniques:
| Technique | When to Apply | Result |
|---|---|---|
| One-hot encoding | Nominal/categorical features with no inherent order | Adds one binary column per category; increases dimensionality |
| Label (ordinal) encoding | Ordinal categories with a natural ranking | Maps categories to integers; preserves order |
| Binary encoding | High-cardinality nominal categories | Fewer new columns than one-hot; compromise approach |
| Tokenization | Text data for NLP models | Splits text into sub-word or word tokens |
One-hot encoding increases dimensionality; label encoding does not — this distinction drives the correct answer.
When a feature has a clear ordinal relationship (e.g., job_seniority_level: Junior < Senior < Principal), label encoding is appropriate because the integer ordering carries meaning.
When a feature is nominal with no ordering (e.g., location: Berlin, Paris, Tokyo), one-hot encoding is appropriate because you do not want the model to assume Berlin < Paris < Tokyo.
When the question explicitly states "must not increase the dimensionality of the dataset" and the values are binary (yes/no, true/false), label encoding or binary encoding is correct — one-hot encoding would add columns.
1.5 Handling Class Imbalance
Class imbalance occurs when one class in the target variable is far more common than another (e.g., 99% legitimate transactions vs. 1% fraudulent). This is one of the most tested topics in the data preparation domain.
| Technique | What It Does | When to Use |
|---|---|---|
| Random oversampling | Duplicates minority class examples | When you cannot generate new data |
| SMOTE (Synthetic Minority Oversampling Technique) | Synthesizes new minority class examples | When you need more diversity in minority class without losing data |
| Random undersampling | Removes majority class examples | Only when you can afford to lose data |
| Class weights / scale_pos_weight | Adjusts loss function to penalize minority class misses | Algorithmic fix; no data modification |
Difference in Proportions of Labels (DPL):
DPL is a pre-training bias metric used to quantify representation imbalance across demographic groups. A DPL value near 0 indicates balanced representation. A high positive DPL (e.g., +0.9) for a specific subgroup means that subgroup is dramatically over-represented in the positive class compared to the rest of the dataset.
When DPL is high for a subgroup, you undersample that subgroup — not the entire class.
A DPL of +0.9 for the age range 40–45 in the positive class means that age group is over-represented relative to all other age ranges. The correct remedy is to undersample the positive class for the age range 40–45 to bring it in line with other groups.
Oversampling the positive class for that age group would worsen the imbalance. Undersampling all other age ranges would lose data unnecessarily. The fix is targeted at the over-represented group only.
SMOTE is for balancing classes without losing data; undersampling discards existing data.
When the requirement explicitly states "without losing any existing training data," the correct technique is oversampling (e.g., SMOTE), not undersampling.
Use SageMaker Clarify to detect the imbalance (CI metric > 0 indicates imbalance) and SageMaker Data Wrangler to apply SMOTE.
1.6 PII Detection and Data Masking
| Service | Capability | Limitation |
|---|---|---|
| AWS Glue — Detect PII transform | Find and mask PII in Glue jobs using Spark | Requires writing a Glue job but uses built-in transform |
| Amazon Macie | Discover sensitive data in S3 | Identifies PII but does not mask it in place |
| Amazon Comprehend | Detect entities (names, addresses) in text | NLP-based; does not natively mask |
| Custom Spark/regex on EC2 | Maximum flexibility | Maximum development effort |
Macie discovers PII but does not mask it — it is a detection service, not a transformation service.
Amazon Macie will classify S3 objects containing sensitive data and alert you, but it does not modify the data. If the requirement is to "find and mask" PII, the correct answer involves AWS Glue with the Sensitive Data Detection transform (or a custom Glue job with regex). Macie alone does not satisfy the masking requirement.
1.7 Data Labeling
| Service | Workforce Type | Best For |
|---|---|---|
| SageMaker Ground Truth | Private, public (Mechanical Turk), or vendor | Flexible labeling with custom UIs; requires setup |
| SageMaker Ground Truth Plus | Managed vendor workforce | Fully managed; no app development required |
| Amazon Augmented AI (A2I) | Human review of ML predictions | Reviewing and correcting model outputs; not initial labeling |
| Amazon Mechanical Turk (AMT) | Public crowdworkers via AMT website | Not the AWS ML-integrated path |
Ground Truth and A2I solve different problems in the labeling workflow.
SageMaker Ground Truth is for creating initial labels on raw data — it is used before a model exists, to build the training dataset.
Amazon Augmented AI (A2I) is for human review of predictions that a model has already made — used in production to catch low-confidence outputs and improve them.
When a scenario involves unlabeled data that needs to be annotated for the first time, the answer is Ground Truth. When it involves reviewing model outputs in production, the answer is A2I.
A private workforce in Ground Truth is the correct answer for employee-only access.
When the requirement states that only employees should perform labeling tasks, the answer is to create a private workforce in SageMaker Ground Truth. Amazon Mechanical Turk is a public crowdsourcing platform where random members of the public complete tasks — this does not satisfy any requirement for restricted access.
1.8 Data Quality and Profiling
| Service | Purpose |
|---|---|
| AWS Glue Data Quality | Rule-based data quality checks on Glue Data Catalog tables |
| AWS Glue DataBrew (Profile job) | Statistical profiling — distributions, nulls, correlations |
| AWS Glue DataBrew (Recipe job) | Applying transformations and cleaning steps |
| SageMaker Data Wrangler | Interactive data quality exploration within SageMaker |
DataBrew profile jobs analyze data quality; recipe jobs transform it — they are not the same job type.
A profile job in DataBrew generates statistical summaries: column distributions, null counts, duplicate rates, and correlation matrices. It is a read-only analysis step.
A recipe job applies a sequence of transformation steps (the "recipe") to actually clean, normalize, encode, or filter the data and write the output.
When the requirement is to "clean and normalize" data, the answer is a recipe job, not a profile job.
1.9 SageMaker Feature Store
Feature Store provides a centralized repository for storing, sharing, and serving ML features consistently between training and inference.
| Store Type | Latency | Use Case |
|---|---|---|
| Online store | Low (milliseconds) | Real-time inference; always returns latest feature value |
| Offline store | Higher (batch) | Training jobs; historical snapshots in S3 |
Enabling both online and offline stores is required when you need real-time inference AND batch training consistency.
When a scenario requires that the online store is updated with the most recent data as it arrives AND a complete historical record is maintained for batch processing, you must enable both stores simultaneously.
Using only the online store loses the historical record. Using only the offline store cannot serve real-time inference at low latency. The Feature Store Spark connector supports ingesting data into both stores in a single operation.
Domain 2: ML Model Development
2.1 Choosing the Right Algorithm
The exam presents business scenarios and asks you to select the appropriate SageMaker built-in algorithm. The key is mapping problem type to algorithm family.
| Algorithm | Problem Type | Key Signal Words |
|---|---|---|
| XGBoost | Classification, regression, ranking | Tabular data; "predict likelihood"; "loan ranking" |
| Linear Learner | Binary/multi-class classification, regression | Linear relationships; interpretability |
| DeepAR | Time series forecasting | "Historical measurements at regular intervals"; "predict future values" |
| Random Cut Forest (RCF) | Anomaly detection | Unlabeled data; "detect irregularities"; "unexpected values" |
| K-means | Clustering (unsupervised) | "Group similar items"; no labels |
| K-nearest neighbors (k-NN) | Classification/regression (supervised) | "Based on similar examples"; labeled training data |
| PCA | Dimensionality reduction | "Reduce features"; "high dimensionality" |
| Factorization Machines | Recommendations on sparse data | "High-dimensional sparse data"; collaborative filtering |
| Neural Topic Model (NTM) | Topic modeling | "Discover topics in text documents" |
| Seq2Seq | Sequence-to-sequence | Machine translation, text summarization |
| BlazingText | Word2Vec / text classification | Word embeddings; fast text classification |
DeepAR requires specific hyperparameters — context_length and prediction_length must match your forecasting horizon.
The context_length controls how much historical data the model looks at to make each prediction. The prediction_length controls how far into the future the model predicts. Both must be set to match the time granularity and horizon of your problem. Setting scale_pos_weight (an XGBoost hyperparameter) on a DeepAR job, or applying k-means clustering to a forecasting problem, are common wrong answers the exam uses as distractors.
RCF is the anomaly detection algorithm; k-means is a clustering algorithm — they are not interchangeable even though both are unsupervised.
Random Cut Forest is explicitly designed to score data points by their anomaly likelihood. It works on unlabeled data and assigns an anomaly score to each input — higher scores indicate anomalies.
K-means groups data into clusters based on similarity. It does not produce an anomaly score and is not designed to flag outliers.
When the scenario involves an unlabeled dataset and asks to "detect irregularities," "identify unusual patterns," or "find outliers," the answer is RCF. The num_trees hyperparameter controls model stability (more trees = more stable scores).
Factorization Machines, not k-NN, is correct for high-dimensional sparse recommendation data.
Recommendation systems typically involve user-item interaction matrices that are extremely sparse — most users have interacted with only a tiny fraction of all items. Factorization Machines are designed for exactly this use case and handle sparse, high-dimensional data efficiently.
K-NN struggles with sparse high-dimensional spaces (the curse of dimensionality) and is not the correct choice for recommendation problems at scale.
2.2 Overfitting, Underfitting, and Regularization
| Problem | Symptom | Root Cause | Solution |
|---|---|---|---|
| Overfitting | High training accuracy, low validation accuracy | Model too complex; memorizing noise | Regularization (L1/L2), dropout, k-fold CV, more data |
| Underfitting | Low accuracy on both training and validation | Model too simple; not learning patterns | More features, more epochs, less regularization, more complex model |
L1 vs. L2 Regularization:
| Regularization | Effect | Key Benefit |
|---|---|---|
| L1 (Lasso) | Drives some feature weights to exactly zero | Automatic feature selection; removes irrelevant features |
| L2 (Ridge) | Shrinks all feature weights toward zero but not to zero | Distributes weight reduction across all features |
L1 regularization is the correct answer when the problem is both overfitting AND unnecessary features.
When a scenario states that the model is overfitting AND the training data contains unnecessary features that should be removed, L1 regularization is the correct choice. L1 performs implicit feature selection by zeroing out the weights of irrelevant features, addressing both problems simultaneously.
L2 reduces weights but does not eliminate features. Increasing training iterations worsens overfitting. Decreasing iterations may underfit. SageMaker Debugger can help profile training but does not apply regularization to a live model.
2.3 Evaluation Metrics
| Metric | Problem Type | When It Matters Most |
|---|---|---|
| Accuracy | Classification | Balanced class distributions only |
| Precision | Classification | Cost of false positives is high (e.g., spam detection) |
| Recall | Classification | Cost of false negatives is high (e.g., fraud, disease detection) |
| F1 Score | Classification | Imbalanced classes; need balance of precision and recall |
| AUC-ROC | Binary classification | Evaluating classifier performance across all thresholds |
| RMSE | Regression | Numeric prediction; penalizes large errors more |
| MAE | Regression | Numeric prediction; equal weight to all errors |
Recall is the correct metric when the business cannot afford to miss positive cases.
Recall = (True Positives) / (True Positives + False Negatives). It measures how many of the actual positives the model correctly identified.
In fraud detection, the cost of missing a fraudulent transaction (false negative) is far higher than incorrectly flagging a legitimate one (false positive). Therefore, maximizing recall — catching as many fraudulent transactions as possible — is the correct objective. Optimizing for precision would minimize false alarms but allow more fraud to pass through undetected.
The same logic applies to medical diagnosis: missing a positive case (disease) is typically far more costly than over-investigating a healthy patient.
For continuous numeric predictions, RMSE is correct — not accuracy or AUC.
Accuracy, F1, AUC, and precision/recall are all classification metrics. They require a discrete class label prediction. When the model output is a continuous number (home price, sales forecast, patient measurement), these metrics are inapplicable.
RMSE (Root Mean Square Error) measures the average deviation of predictions from actual values, with larger errors penalized more heavily. For models that "make continuous numeric predictions," RMSE is the correct evaluation metric, not accuracy.
2.4 Bias Detection with SageMaker Clarify
SageMaker Clarify provides pre-training and post-training bias metrics and model explainability reports.
Pre-training Bias Metrics:
| Metric | Measures |
|---|---|
| Class Imbalance (CI) | Whether classes in the target variable are proportionally balanced |
| Difference in Proportions of Labels (DPL) | Whether the positive class rate differs across demographic groups |
| KL Divergence / JS Divergence | Statistical difference between label distributions of subgroups |
DPL, not MSE or SSIM, is the correct metric to confirm class imbalance across demographic groups.
DPL (Difference in Proportions of Labels) is specifically designed as a pre-training bias metric. A DPL value significantly different from 0 indicates that the positive class is not evenly distributed across subgroups — confirming bias in the dataset.
MSE (Mean Square Error) is a regression error metric — it has nothing to do with bias detection.
SSIM (Structural Similarity Index Measure) is an image quality metric — also unrelated.
Silhouette score is a clustering quality metric — unrelated to bias.
When the question asks for a pre-training bias metric to confirm class imbalance, the answer is DPL.
2.5 Hyperparameter Tuning and Training Optimization
Key learning rate guidance:
- If accuracy is not increasing and loss is decreasing very slowly with SGD, the learning rate is too low — increase it to escape the slow convergence plateau.
- If loss is oscillating wildly and not converging, the learning rate is too high — decrease it.
Distributed training with SageMaker:
| Technique | When to Use |
|---|---|
| Distributed Data Parallel (DDP) | Large datasets that don't fit on one GPU; split data across instances |
| Model Parallel | Models too large to fit in one GPU's memory; split model layers |
| Increase number of instances | Primary lever to reduce wall-clock training time for DDP jobs |
To reduce DDP training time, increase the number of instances — not epochs, neurons, or layers.
When a training job uses distributed data parallelism across instances and the goal is to decrease training time, the correct action is to increase the number of instances. More instances means each instance processes a smaller shard of the data, and the total time decreases proportionally.
Adding epochs increases training time. Adding neurons or layers increases model complexity and per-instance compute time but does not leverage distributed parallelism.
2.6 SageMaker Training Methods
| Method | Purpose |
|---|---|
estimator.fit() |
Initiates a training job with a dataset |
estimator.deploy() |
Deploys a trained model to an endpoint |
predictor.predict() |
Calls inference on a deployed endpoint |
estimator.create_model() |
Creates a model object without deploying it |
fit() starts training; deploy() starts hosting — they are sequential, not interchangeable.
The SageMaker SDK workflow is: (1) configure an Estimator, (2) call .fit() to train the model, (3) call .deploy() to host it, (4) call .predict() on the resulting predictor.
Calling .deploy() before .fit() will fail because there is no trained model artifact yet. Calling .predict() without a deployed endpoint will also fail. Questions that ask how to "initiate a training job" are asking for .fit().
2.7 Experiment Tracking and Model Management
| Service | Purpose |
|---|---|
| SageMaker Experiments | Track and compare training runs: hyperparameters, metrics, artifacts |
| SageMaker Debugger / Profiler | Identify bottlenecks in training: GPU utilization, CPU usage, memory |
| SageMaker Model Registry | Version, stage, and deploy approved models |
| SageMaker Profiler annotations | Mark specific code sections to measure resource usage during training |
SageMaker Experiments is the correct answer for logging training run characteristics — not CloudWatch, S3, or Model Registry.
When an ML engineer needs to track hundreds of training iterations with different hyperparameters, features, and algorithms and compare their results, the answer is SageMaker Experiments. It is purpose-built for this workflow and requires minimal implementation effort — no custom metrics, no S3 log parsing, and no extra infrastructure.
CloudWatch can surface metrics but is not designed for systematic experiment comparison. Model Registry manages approved models for deployment — it is not the right tool for logging in-progress experiment results.
SageMaker Profiler annotations are the right tool to pinpoint where in the training script resources are underutilized.
When GPU utilization is low and the engineer needs to find the specific bottleneck in the training script, the answer is to add SageMaker Profiler annotations to the code and generate a profiler report. This instruments the script at the function level and shows exactly where compute time is being spent.
CloudWatch and CloudTrail provide coarser-grained metrics at the endpoint or instance level — they cannot pinpoint specific lines of Python training code.
Domain 3: Deployment and Orchestration of ML Workflows
3.1 SageMaker Inference Types
Choosing the right inference type is one of the highest-weighted topics in this domain. The decision depends on four factors: latency requirement, payload size, traffic pattern, and cost sensitivity.
| Inference Type | Latency | Max Payload | Traffic Pattern | Idle Cost |
|---|---|---|---|---|
| Real-time | Milliseconds | 6 MB | Consistent, predictable | Always running |
| Serverless | Seconds (cold start) | 4 MB | Sporadic, unpredictable | Zero when idle |
| Asynchronous | Minutes (queued) | 1 GB | Large, intermittent | Auto-scales to zero |
| Batch Transform | Hours (scheduled) | Entire dataset | Bulk, offline | Runs only during job |
Asynchronous inference is the correct answer for large payloads (>6 MB) with processing times up to 60 minutes.
The exam frequently presents scenarios with payload sizes of 100–300 MB and processing times measured in minutes. Asynchronous inference is the only SageMaker inference type designed for this use case. It queues requests, processes them, and stores results in S3 — the client polls for completion.
Real-time inference has a hard 6 MB payload limit. Serverless inference has a 4 MB limit and is designed for sporadic low-volume workloads, not large payloads. Batch transform is for processing entire datasets at once, not individual requests.
Serverless inference is the most cost-effective option for intermittent CPU-based workloads with periods of zero traffic.
When traffic is intermittent (e.g., only during business hours) and the model runs on CPU, Serverless Inference charges only per inference request — there is no cost during idle periods. It is superior to a real-time endpoint that bills by the hour even when idle.
Asynchronous inference can also scale to zero, but it is designed for large payloads and long processing times — not for low-latency interactive predictions.
3.2 Endpoint Strategies for Multiple Models
| Strategy | Use Case | Cost Model |
|---|---|---|
| Multi-model endpoint (MME) | Many models sharing one container (same framework) | Pay for one endpoint; models loaded on-demand |
| Multi-container endpoint (MCE) | A few models with different frameworks; sequential or direct invocation | Pay for one endpoint; containers always loaded |
| Single-model endpoint | One model per endpoint | One endpoint per model |
| Inference pipeline | Sequential preprocessing → model → postprocessing steps | Chained containers in one endpoint |
Multi-model endpoints and multi-container endpoints solve different problems.
A multi-model endpoint hosts many models (potentially hundreds) that share the same framework and container. Models are dynamically loaded and unloaded into memory based on traffic. This is the most cost-effective option when you have many models built on the same framework.
A multi-container endpoint hosts a small number of models with different frameworks in separate containers within one endpoint. Use this when models cannot share a container because they need different runtimes, but you still want to consolidate onto one endpoint.
When the question says "multiple models, same framework," MME is the answer. When it says "different frameworks," MCE is the answer.
3.3 Auto Scaling for SageMaker Endpoints
High-resolution metrics + appropriate cooldown periods are required for rapid scaling response.
For endpoints that must scale quickly in response to sudden traffic changes:
- Use high-resolution CloudWatch metrics (10-second intervals) rather than standard 60-second intervals — this gives the scaling policy faster feedback.
- Use the
InvocationsPerInstancemetric as the target tracking metric. - Set a longer scale-in cooldown (e.g., 600 seconds) to prevent premature scaling down after a traffic spike subsides.
A shorter scale-in cooldown combined with standard metrics results in unstable scaling behavior — the endpoint may scale in too aggressively before traffic has truly subsided.
When endpoints scale to zero instances overnight, apply a scheduled scale-out policy to pre-warm before business hours.
Auto scaling with a target tracking policy can scale the minimum instance count to zero during idle periods, which reduces cost. However, when traffic arrives at the start of business hours, there are zero instances available to handle it — causing delays while new instances spin up.
The solution is to apply a scheduled scaling action (via a CloudWatch alarm and step scaling policy) that increases the minimum instance count from 0 to 1 (or more) before business hours begin, ensuring the endpoint is ready when traffic arrives.
3.4 Deployment Strategies
| Strategy | Purpose | When to Use |
|---|---|---|
| Blue/Green deployment | Zero-downtime cutover; full traffic shift | When you want a clean switch with rollback capability |
| Shadow variant | Test a new model on live traffic without affecting user responses | When comparing new model performance to current model before promoting |
| Canary deployment | Gradually shift a percentage of traffic to new model | When you want a controlled incremental rollout |
| A/B test (production variant) | Split traffic between model versions with metrics | When comparing multiple model variants in production |
Shadow variants are the least-effort path to comparing a new model against a production model.
A shadow variant receives a copy of live production traffic and processes it, but its responses are not returned to users. This allows you to measure the new model's performance on real traffic with zero risk to the user experience.
To deploy a shadow variant: deploy the new model as a shadow variant on the same endpoint as the current model. Route a configured percentage of live traffic to it. Evaluate its outputs against the production model's outputs.
This requires less operational effort than deploying to a separate endpoint, managing DNS routing, or writing custom Lambda routing logic.
3.5 Model Registry and Versioning
| Service | Purpose |
|---|---|
| SageMaker Model Registry | Version control and lifecycle management for trained models; supports approval workflows |
| SageMaker Inference Recommender | Benchmarks model performance across EC2 instance types; recommends best-fit instance |
| Amazon ECR | Container image registry; stores Docker images but does not manage ML model versions |
SageMaker Inference Recommender, not Autopilot or Compute Optimizer, provides instance type recommendations for hosting.
When the requirement is to determine the best EC2 instance type for hosting an ML model on SageMaker, the answer is SageMaker Inference Recommender. It runs load tests against multiple instance configurations and ranks them by performance and cost.
SageMaker Autopilot is an AutoML service for automatically building and training models — it does not recommend hosting instance types.
AWS Compute Optimizer makes EC2 sizing recommendations for general workloads based on CloudWatch metrics — not for SageMaker model hosting.
3.6 Amazon Bedrock Deployment
| Concept | Detail |
|---|---|
| Custom model import | Import Hugging Face or other models into Bedrock for API access |
| S3 URI requirement | Model files must be in S3 in the same AWS account as the Bedrock import job |
| VPC endpoint (PrivateLink) | Connect EC2 in private subnets to Bedrock without traversing the internet |
| Bedrock Knowledge Bases | Fully managed RAG; supports S3 data sources; default vector store is OpenSearch Serverless |
PrivateLink (interface VPC endpoint) is the correct solution for private subnet EC2 to Bedrock connectivity.
When EC2 instances are in a private subnet and must remain private while calling the Amazon Bedrock API, the answer is AWS PrivateLink — specifically an interface VPC endpoint for Bedrock. This routes all API traffic through the AWS private network without requiring internet access.
A NAT gateway would route traffic to the internet (against the requirement). AWS Direct Connect connects on-premises networks, not VPC-to-service traffic. Modifying Bedrock to use a private subnet is not possible — Bedrock is a managed service that the user cannot configure at the network level.
RAG via Bedrock Knowledge Bases is the least-operational-overhead path — not fine-tuning, not Neptune.
When the requirement is to supplement an LLM with documents from S3 using RAG, the answer is to create a Knowledge Base for Amazon Bedrock and configure the S3 bucket as a data source. Bedrock handles chunking, embedding, vector storage (OpenSearch Serverless by default), and retrieval automatically.
Fine-tuning (AutoML or SageMaker Pipelines) updates model weights — it does not inject external documents at query time. Amazon Neptune is a graph database — not a managed vector store for RAG.
3.7 Networking and Privacy
| Requirement | Solution |
|---|---|
| Network isolation from internet during training | Private subnet + S3 gateway VPC endpoint |
| Private IP addresses only | Private subnet + VPC endpoint (no NAT gateway) |
| Opt out of SageMaker metadata collection | Set OPT_OUT_TRACKING environment variable or opt-out parameter per training job |
| Cross-account S3 access | VPC peering or S3 bucket policies with cross-account permissions |
Network isolation requires a private subnet AND an S3 VPC gateway endpoint — not a NAT gateway.
To ensure SageMaker training jobs have no internet connectivity while still accessing S3:
- Run training jobs in private subnets (no route to internet gateway).
- Create an S3 gateway VPC endpoint so traffic to S3 stays on the AWS private network.
A NAT gateway routes traffic to the internet — this breaks the network isolation requirement. A public subnet with security group inbound rules still exposes the training job to the internet on the outbound path.
3.8 Custom Containers and Infrastructure
| Approach | When to Use | Provisioning Method |
|---|---|---|
| BYOC (Bring Your Own Container) | Custom libraries, frameworks not in SageMaker containers | Docker image in ECR |
| SageMaker pre-built containers | Standard frameworks (PyTorch, TensorFlow, sklearn) | Built-in; no Docker work needed |
| EKS with CDK | Kubernetes; managed control plane; Python provisioning | AWS CDK (Python) |
| EKS with CloudFormation | Kubernetes; managed control plane; declarative templates | CloudFormation (YAML/JSON) |
AWS CDK is the correct answer when the infrastructure must be provisioned using Python specifically.
When requirements specify (1) no managing the Kubernetes control plane, (2) using Kubernetes, and (3) Python for provisioning, the answer is AWS CDK to provision an Amazon EKS cluster. CDK supports Python natively and handles the Kubernetes control plane automatically via EKS (managed).
AWS CloudFormation can also provision EKS but uses YAML/JSON templates, not Python. The AWS CLI is not a repeatable provisioning method — it is imperative, not declarative. EC2-based self-managed Kubernetes requires managing the control plane.
Domain 4: ML Solution Monitoring, Maintenance, and Security
4.1 SageMaker Model Monitor
Model Monitor continuously evaluates deployed models against a baseline and raises alerts when violations are detected.
| Monitor Type | What It Detects | Baseline Required |
|---|---|---|
| Data Quality Monitor | Shifts in input feature distributions (data drift) | Statistical baseline from training data |
| Model Quality Monitor | Degradation in prediction accuracy/performance | Ground truth labels + prediction logs |
| Bias Drift Monitor | Changes in model fairness metrics over time | Clarify bias baseline |
| Feature Attribution Drift | Changes in which features drive predictions | Clarify SHAP baseline |
Correct sequence for setting up Model Monitor:
- Create a baseline from training/validation data (generates constraints and statistics)
- Create a monitoring job that compares live inference data against the baseline
The baseline must be created BEFORE the monitoring job — not after.
A monitoring job in SageMaker Model Monitor compares live inference data against a pre-established baseline. If the monitoring job is configured before the baseline exists, it has nothing to compare against and will not produce meaningful alerts.
The correct order is always: (1) establish the baseline from training data, (2) configure the monitoring schedule.
Data drift requires a Data Quality Monitor with a data quality baseline — not a model quality baseline.
When the requirement is to "detect changes in the input data distribution of model features" (i.e., data drift), the answer is to configure a Data Quality Monitor and establish a data quality baseline. The baseline captures the statistical distribution of features at training time, and the monitor alerts when live data deviates.
A model quality baseline monitors output accuracy — it requires ground truth labels to compare predictions against. It does not detect input distribution shifts.
4.2 Automated Retraining Pipelines
Model Monitor + Lambda is the standard pattern for detecting drift and automatically triggering retraining.
When a pipeline must automatically initiate retraining upon detecting data drift:
- SageMaker Model Monitor detects the drift and publishes a CloudWatch metric violation.
- An AWS Lambda function (triggered by the CloudWatch alarm) initiates a new SageMaker training job or kicks off a SageMaker Pipeline.
AWS Glue, Apache Flink, and QuickSight are not designed for model drift detection. Step Functions can orchestrate retraining but adds complexity — Lambda is the simplest trigger mechanism.
After retraining on new data, the Model Monitor baseline must also be updated.
When a model is retrained with new data, the distribution of inputs changes. Using the original baseline from the old training data will cause Model Monitor to flag legitimate input patterns as violations.
After retraining: (1) retrain with new data, (2) reestablish the baseline from the new training data, (3) update the monitoring job to use the new baseline. Reusing the original baseline with newly trained models introduces false alerts and undermines the monitoring system.
4.3 Identifying the Root Cause of Performance Degradation
| Symptom | Root Cause | Solution |
|---|---|---|
| Model accuracy degrades gradually over time | Data drift — production data distribution shifted | Retrain on recent data; update monitoring baseline |
| Model accuracy degrades suddenly after a deployment | Bug in new code or data pipeline | Roll back deployment; debug pipeline |
| High training accuracy, low validation accuracy | Overfitting | Regularization, more data, cross-validation |
| Low accuracy on both training and validation | Underfitting | More complex model, more features, more training |
| Accuracy suddenly drops after months of stability | Concept drift or data pipeline failure | Investigate data freshness; check upstream pipelines |
Data drift in production — not overfitting — is the correct cause when a well-performing model degrades over time.
A model that has been deployed and performing well for months with metrics above thresholds is, by definition, not overfitting at deployment time. Overfitting is a training-time phenomenon.
When performance degrades after a period of stable production operation, the cause is almost always drift in the production data distribution — the real-world inputs have changed in ways that were not represented in the original training data. The fix is retraining on current data, not debugging the original model.
4.4 Content Moderation and Real-Time Video Analysis
Amazon Rekognition + Lambda is the least-overhead solution for real-time video frame content moderation.
Amazon Rekognition includes built-in content moderation capabilities: it can detect inappropriate content categories (explicit nudity, violence, etc.) in images and video frames with a single API call.
Using a custom SageMaker model requires training, hosting, and maintaining a computer vision model — significant operational overhead. Using Transcribe + Comprehend analyzes audio/text, not visual content. Rekognition with Lambda extracts and analyzes image frames in a serverless, managed fashion.
4.5 Security and Access Control
| Requirement | Solution |
|---|---|
| Block traffic from a specific IP | Network ACL inbound rule (deny rule); security groups only support allow rules |
| Prevent misuse of presigned URLs from outside a VPN | IAM condition aws:sourceVpc validation |
| Restrict presigned URLs to specific IP range | IAM condition aws:sourceIp validation |
| SageMaker training jobs with KMS-encrypted S3 | IAM execution role must have kms:Encrypt and kms:Decrypt permissions |
| Opt out of SageMaker metadata collection | OPT_OUT_TRACKING env variable or per-job opt-out parameter |
Security groups cannot deny traffic — only Network ACLs can create deny rules.
AWS security groups are stateful and support allow rules only. You cannot create a deny rule in a security group. To block traffic from a specific IP address, you must create a Network ACL (NACL) deny rule for the specific IP and apply it to the subnet.
VPC route tables do not support deny rules either — they determine routing paths, not access control decisions. Creating a shadow variant to redirect traffic is not a security mechanism.
KMS-encrypted S3 access failures after switching from SSE-S3 to SSE-KMS require adding KMS permissions to the execution role.
When S3 buckets use SSE-S3, SageMaker training jobs can access them with standard S3 permissions. When you switch to SSE-KMS, the SageMaker execution role must also have kms:Encrypt and kms:Decrypt (and optionally kms:GenerateDataKey) permissions on the specific KMS key. Without these, all S3 reads and writes fail with AccessDenied.
Adding s3:ListBucket does not resolve KMS access errors. Updating the aws:SecureTransport bucket policy condition enforces HTTPS but does not grant KMS permissions.
aws:sourceVpc validates the VPC origin of a request; aws:sourceIp validates the IP address.
When the company shares SageMaker Studio notebooks over a VPN and needs to ensure presigned URLs can only be used from within that VPN (the VPC), the correct IAM policy condition is aws:sourceVpc. This ensures that even if a presigned URL is leaked externally, it cannot be used from outside the authorized VPC.
aws:sourceIp restricts access to specific IP ranges — appropriate when the restriction is IP-based rather than VPC-based. aws:PrincipalTag is for attribute-based access control tied to IAM principals, not network origin.
4.6 Monitoring Tools Reference
| Service | What It Monitors |
|---|---|
| SageMaker Model Monitor | Data quality, model quality, bias drift, feature attribution drift |
| SageMaker Clarify | Pre/post-training bias metrics; SHAP-based feature explanations |
| SageMaker Debugger / Profiler | Training job resource utilization (GPU, CPU, memory); training anomalies |
| Amazon CloudWatch | Infrastructure metrics (CPU, memory, invocations); alarm-based notifications |
| AWS CloudTrail | API call audit logs; who did what and when |
| Amazon CloudWatch Logs Insights | Query and analyze log data for error patterns |
Model Monitor captures bias metrics; CloudWatch visualizes them — use both together for dashboards.
SageMaker Clarify bias metrics generated by a monitoring job are emitted as Amazon CloudWatch metrics. To display these on a dashboard, the ML engineer captures the CloudWatch metrics from SageMaker Clarify/Model Monitor and builds a CloudWatch dashboard.
CloudTrail captures API activity logs — it does not capture bias metrics. EventBridge can trigger actions based on events but is not the metrics collection path. SNS is for sending notifications, not for capturing or displaying metrics.
Quick-Reference: Key Service Decision Trees
Which data transformation tool?
- No-code, visual interface → AWS Glue DataBrew (recipe job)
- Serverless ETL with Python/PySpark → AWS Glue ETL job
- Interactive data prep in SageMaker → SageMaker Data Wrangler
- Large-scale Spark → Amazon EMR
- SQL over S3 → Amazon Athena
Which inference type?
- Millisecond latency, small payload → Real-time
- Sporadic traffic, no idle cost, CPU → Serverless
- Large payload (>6 MB), long processing, notify on complete → Asynchronous
- Bulk dataset, once daily → Batch Transform
Which encoding?
- Nominal category, few values, no ordering → One-hot
- Ordinal category with ranking → Label/ordinal
- Binary yes/no, no dimensionality increase → Label (binary)
- High-cardinality nominal, minimize columns → Binary encoding
Which anomaly detection approach?
- Unlabeled data → Random Cut Forest (RCF)
- Labeled data, classification → XGBoost or Linear Learner
Which monitoring tool?
- Input data drift → Model Monitor (Data Quality)
- Prediction accuracy over time → Model Monitor (Model Quality)
- Bias in live predictions → Model Monitor (Bias Drift) + Clarify
- Training bottleneck → SageMaker Debugger / Profiler
- Security audit → AWS CloudTrail
Common Exam Traps: Summary
| Trap | Correct Answer |
|---|---|
| "Aggregate from S3 + on-prem MySQL" → EMR | AWS Lake Formation |
| "Data always up to date in Data Wrangler" → cataloged | Direct connection |
| "Large payload 100–300 MB, 60 min processing" → batch transform | Asynchronous inference |
| "Sporadic, no traffic overnight, CPU model" → real-time | Serverless inference |
| "Detect irregularities, unlabeled" → k-means | Random Cut Forest (RCF) |
| "Recommendation, high-dimensional sparse" → k-NN | Factorization Machines |
| "Catch all fraud, minimize misses" → precision | Recall |
| "Continuous numeric predictions" → accuracy | RMSE |
| "Overfitting + unnecessary features" → L2 | L1 regularization |
| "Block specific IP" → security group deny | Network ACL deny rule |
| "KMS AccessDenied after SSE-KMS switch" → add s3 permissions | Add kms:Encrypt / kms:Decrypt to execution role |
| "Monitor creates baseline → schedule" → reverse order | Baseline first, then monitoring job |
| "Retrain on new data → reuse old baseline" | Establish new baseline after retraining |
| "Compare new model to production, least effort" → separate endpoint | Shadow variant on same endpoint |
| "No-code transformation" → Glue ETL job | Glue DataBrew (recipe job) |
| "Label employees-only data" → Mechanical Turk | Private workforce in Ground Truth |
| "RAG from S3 docs, least overhead" → fine-tuning | Bedrock Knowledge Bases |
| "Private subnet EC2 → Bedrock" → NAT gateway | AWS PrivateLink (interface VPC endpoint) |
| "Python provisioning of EKS" → CloudFormation | AWS CDK |
Last updated: Monday, 01 June 2026