AWS Certified Data Engineer - Associate (DEA-C01)

- 204 exam-style questions
- Detailed explanations and references
- Simulation and custom modes
- Custom exam settings to drill down into specific topics
- 180-day access period
- Pass or money back guarantee
What is in the package
The tone and tenor of our questions mimic the actual exam. Along with the detailed description and exam tips provided within the explanations, we have extensively referenced AWS documentation to get you up to speed on all domain areas tested for the DEA-C01 exam.
Beyond exam preparation, our practice exams serve as a lasting reference guide. From setting up secure data pipelines to monitoring and troubleshooting operations, you will find the knowledge and skills needed for real-world scenarios. Whether you're an aspiring data engineer or a seasoned professional, CertVista DEA-C01 will equip you with the tools to advance your career and excel in the ever-growing field of data engineering.
Complete DEA-C01 domains coverage
CertVista DEA-C01 is organized into four domains that closely relate to the topics necessary to cover the Certified Data Engineer exam.
Data Ingestion and Transformation
CertVista DEA-C01 covers selecting and implementing appropriate AWS services (e.g., Kinesis, DMS, Glue, Snow Family) to ingest data from various sources based on velocity and volume requirements. It assesses the ability to design, build, and optimize data transformation pipelines using services like AWS Glue, EMR, and Lambda to process, partition, format (e.g., Parquet, Avro), and validate data for analytical use cases, including schema management.
Data Store Management
Here, we focus on selecting, configuring, and managing suitable AWS data storage solutions (e.g., S3, Redshift, DynamoDB, RDS) aligned with access patterns, data structure, and cost considerations. It includes implementing data lifecycle policies, applying optimal partitioning and file formats, managing data catalogs using AWS Glue Data Catalog, and optimizing storage for query performance and cost-efficiency.
Data Operations and Monitoring
You'll learn how to assess the ability to automate, orchestrate, monitor, and troubleshoot data processing workflows using services like Step Functions, Glue Workflows, and Managed Workflows for Apache Airflow. Key skills include implementing monitoring, logging, and alerting with CloudWatch, diagnosing pipeline failures and performance issues, managing pipeline deployments, and ensuring operational resilience and efficiency.
Data Security and Governance
CertVista DEA-C01 covers implementing security controls and data governance practices across data pipelines and storage. It includes applying security mechanisms like IAM for access control, KMS for encryption, and network security configurations. Candidates must demonstrate proficiency in managing data permissions (potentially with Lake Formation), ensuring compliance, implementing data cataloging, monitoring data quality, and applying data privacy techniques.
What's in the DEA-C01 exam
The AWS Certified Data Engineer—Associate exam validates the skills and expertise required to build, maintain, and optimize data processing systems on the AWS platform. It assesses the candidate's ability to manage data throughout its lifecycle, including ingestion, transformation, storage, and analysis. The exam is particularly relevant for professionals creating robust, scalable, secure data infrastructure critical for data-driven decision-making and AI-powered solutions.
This certification covers a range of topics, including streaming and batch data ingestion, automated data pipeline construction, data transformation techniques, storage services, and database management. Security and governance, such as encryption, masking, and access controls, are also critical components. Additionally, the exam tests knowledge of advanced topics like data cataloging, monitoring, auditing, and troubleshooting data operations, ensuring candidates can maintain efficient systems.
DEA-C01 Stats
The AWS Certified Data Engineer—Associate (DEA-C01) exam has a pass or fail designation. It is scored against a minimum standard established by AWS professionals who follow certification industry best practices and guidelines. Your results for the exam are reported as a scaled score of 100–1,000. The minimum passing score is 720.
What are the DEA-C01 Questions Like?
There are two types of questions on the exam: multiple choice, which has one correct response and three incorrect responses, or distractors, and multiple responses, which have two (or more) correct responses out of five or more response options.
During the exam, you will be asked to choose the best answer for scenarios to complete tasks to design and implement systems on AWS. The questions' overall length, complexity, and difficulty are longer and more complicated than what you expect from an associate-level certification exam. Most questions involve lengthy scenarios, usually several sentences to a couple of paragraphs.
Most of the answer choices will be several sentences long as well. So, take your time reading through these longer questions, and be sure to process every word you read in detail. Be on the lookout for repeated sentences across the possible answers with just a word or two changed.
Those one or two words can make all the difference when determining which answer is correct and which might be a distractor. Always do your best to eliminate these distractors as early as possible so you can focus more on the plausible answers and select the best answer to each question.
Like all exams, the AWS Certified Data Engineer – Associate certification from AWS is updated periodically and may eventually be retired or replaced. At some point, after AWS no longer offers this exam, the old editions of our practice exams will be retired.
AWS Certified Data Engineer – Associate Exam (DEA-C01) Objectives
The following table provides a breakdown of this book's exam coverage, showing you the weight of each section and the chapter where each objective or sub-objective is covered:
“Domain 1: Data Ingestion and Transformation 34%”
Excerpt From AWS Certified Data Engineer Study Guide Syed Humair This material may be protected by copyright.
Subject Area | % of Exam |
---|---|
Domain 1: Data Ingestion and Transformation | 34% |
Domain 2: Data Store Management | 26% |
Domain 3: Data Operations and Support | 22% |
Domain 4: Data Security and Governance | 18% |
Domain 1: Data Ingestion and Transformation (34%)
This domain focuses on getting data into AWS and preparing it for analysis or storage. It covers selecting appropriate AWS services (like Kinesis for streaming, DMS for database migration, Glue ETL, Snow Family for bulk transfer, SFTP, DataSync) for various data sources (databases, streams, files, APIs) and velocities (batch, near-real-time, real-time). Key tasks include configuring ingestion pipelines, handling different data formats (JSON, CSV, Parquet, Avro), managing schemas, and validating incoming data.
Furthermore, this domain emphasizes data transformation techniques using AWS Glue, EMR, Lambda, and Kinesis Data Analytics services. This involves cleaning, normalizing, enriching, partitioning, and converting data into optimized formats (like Parquet or ORC) suitable for data lakes or warehouses. Understanding ETL (Extract, Transform, Load) vs. ELT (Extract, Load, Transform) patterns and optimizing transformation jobs for performance and cost are crucial.
Domain 2: Data Store Management (26%)
This domain covers the effective storage and management of data within the AWS ecosystem. It requires understanding the characteristics and use cases of various AWS storage options, including object storage (S3), data warehouses (Redshift), NoSQL databases (DynamoDB), relational databases (RDS/Aurora), and data lake storage patterns. Candidates need to know how to choose the right storage solution based on data structure, access patterns, query requirements, durability, and cost. Key activities include managing data lifecycle policies (e.g., S3 Intelligent-Tiering, Glacier), implementing partitioning strategies in data lakes and data warehouses for query optimization, defining and managing data catalogs (using AWS Glue Data Catalog), understanding data modeling concepts, and optimizing storage for cost and performance (e.g., compression, file formats). Ensuring data is organized, accessible, and efficiently stored is the core focus.
Domain 3: Data Operations and Monitoring (22%)
This domain focuses on the operational aspects of maintaining, monitoring, and ensuring the reliability of data pipelines and workflows. It involves using AWS services to orchestrate data processing jobs, manage dependencies, and schedule executions (e.g., Step Functions, Glue Workflows, and Managed Workflows for Apache Airflow). Monitoring pipeline health, performance, and data quality is critical, utilizing services like CloudWatch (Logs, Metrics, Alarms) and potentially AWS Glue Data Quality.
Candidates must be proficient in troubleshooting pipeline failures, diagnosing performance bottlenecks, and implementing logging and alerting mechanisms. This domain also covers automating deployment and operational tasks, managing pipeline versions, optimizing resource utilization, and implementing strategies for handling failures and ensuring data processing resilience and efficiency.
Domain 4: Data Security and Governance (18%)
This domain covers the critical aspects of securing data assets and implementing data governance practices within AWS. It includes applying security best practices such as managing authentication and authorization using IAM (roles, policies), implementing encryption both at rest (using KMS, S3 server-side encryption) and in transit (TLS/SSL), and configuring network security (VPCs, security groups, endpoints). Securing access to data stores and processing services is paramount. Governance aspects involve managing data access controls at granular levels (e.g., using Lake Formation permissions), implementing data cataloging for discovery and lineage, defining and monitoring data quality rules, and ensuring compliance with relevant regulations and organizational policies. Data privacy techniques (like masking or tokenization), data retention policies, and audit logging are essential components of this domain.
Sample DEA-C01 questions
Get a taste of the AWS Certified AI Practitioner exam with our carefully curated sample questions below. These questions mirror the actual exam's style, complexity, and subject matter, giving you a realistic preview of what to expect. Each question comes with comprehensive explanations, relevant AWS documentation references, and valuable test-taking strategies from our expert instructors.
While these sample questions provide excellent study material, we encourage you to try our free demo for the complete exam preparation experience. The demo features our state-of-the-art test engine that simulates the real exam environment, helping you build confidence and familiarity with the exam format. You'll experience timed testing, question marking, and review capabilities – just like the actual AWS certification exam.
A data engineer needs to create an AWS Lambda function that converts the format of data from .csv to Apache Parquet. The Lambda function must run only if a user uploads a .csv file to an Amazon S3 bucket.
Which solution will meet these requirements with the least operational overhead?
Create an S3 event notification that has an event type of s3:ObjectTagging:*
for objects that have a tag set to .csv. Set the Amazon Resource Name (ARN) of the Lambda function as the destination for the event notification.
Create an S3 event notification that has an event type of s3:ObjectCreated:*
. Use a filter rule to generate notifications only when the suffix includes .csv. Set an Amazon Simple Notification Service (Amazon SNS) topic as the destination for the event notification. Subscribe the Lambda function to the SNS topic.
Create an S3 event notification that has an event type of s3:*
. Use a filter rule to generate notifications only when the suffix includes .csv. Set the Amazon Resource Name (ARN) of the Lambda function as the destination for the event notification.
Create an S3 event notification that has an event type of s3:ObjectCreated:*
. Use a filter rule to generate notifications only when the suffix includes .csv. Set the Amazon Resource Name (ARN) of the Lambda function as the destination for the event notification.
Creating an S3 event notification configured with the s3:ObjectCreated:*
event type ensures that the notification triggers whenever a new object is successfully uploaded to the bucket (including Put, Post, Copy, and Multipart Upload completion). Using a suffix filter rule for .csv
ensures that the notification is generated, and thus the Lambda function is invoked, only when the uploaded object's key name ends with .csv
. Setting the Lambda function's ARN directly as the destination for the event notification provides a direct invocation mechanism without intermediate services. This combination precisely meets the requirements (trigger on CSV upload only) with the minimum number of AWS resources and configuration steps, resulting in the least operational overhead.
Configuring the event type s3:ObjectTagging:*
is incorrect because this event triggers when tags are added or modified on an S3 object, not when the object is initially created or uploaded. The requirement is to trigger the Lambda function upon file upload.
Using the event type s3:*
is inefficient and overly broad. While it includes object creation events, it also includes many other events like object deletion (s3:ObjectRemoved:*
), tagging (s3:ObjectTagging:*
), etc. Although the suffix filter would prevent the Lambda from processing non-CSV files, the event notification system would still generate notifications for non-creation events involving CSV files (like deletion or tagging), potentially causing unnecessary triggers or requiring more complex handling logic within the Lambda function if it were invoked for events other than creation. Using s3:ObjectCreated:*
is more specific and efficient.
Introducing an Amazon SNS topic as an intermediary between the S3 event notification and the Lambda function adds an extra component to manage (the SNS topic itself, its access policies, and the Lambda subscription to the topic). While this architecture is valid and useful for scenarios requiring fan-out to multiple subscribers or decoupling, it increases the operational overhead compared to directly invoking the Lambda function from the S3 event notification. The requirement specifically asks for the solution with the least operational overhead.
For 'least operational overhead' questions involving AWS service integrations, always favor the most direct path allowed by the services. Avoid introducing intermediate services like SQS or SNS unless they are explicitly required for features like decoupling, buffering, or fan-out, which are not mentioned in this scenario. Also, be precise with event types (e.g., s3:ObjectCreated:*
vs. s3:*
) to minimize unnecessary triggers.
A marketing company collects clickstream data. The company sends the clickstream data to Amazon Kinesis Data Firehose and stores the clickstream data in Amazon S3. The company wants to build a series of dashboards that hundreds of users from multiple departments will use.
The company will use Amazon QuickSight to develop the dashboards. The company wants a solution that can scale and provide daily updates about clickstream activity.
Which combination of steps will meet these requirements most cost-effectively? (Choose two.)
Access the query data through a QuickSight direct SQL query.
Use Amazon Athena to query the clickstream data.
Use Amazon S3 analytics to query the clickstream data.
Access the query data through QuickSight SPICE (Super-fast, Parallel, In-memory Calculation Engine). Configure a daily refresh for the dataset.
Use Amazon Redshift to store and query the clickstream data.
First, consider the query engine. The data resides in Amazon S3. Amazon Athena is a serverless query service that allows you to analyze data directly in Amazon S3 using standard SQL. It scales automatically and you pay only for the queries you run, making it highly cost-effective for querying data stored in S3, especially when query patterns might vary. Loading the data into Amazon Redshift would incur additional ETL effort and the costs associated with running a Redshift cluster (provisioned or serverless), which might not be the most cost-effective approach compared to querying directly from S3 with Athena. Amazon S3 Analytics is used for analyzing storage access patterns, not querying the data content itself.
Second, consider the QuickSight data access method. QuickSight offers two modes: Direct Query and SPICE (Super-fast, Parallel, In-memory Calculation Engine).
- Direct Query mode runs queries against the underlying data source (Athena, in this case) in real-time as users interact with the dashboard. With hundreds of users, this could lead to a high volume of Athena queries, potentially increasing costs significantly (Athena charges per query/data scanned) and potentially impacting dashboard performance.
- SPICE imports a copy of the data into a highly optimized, in-memory cache within QuickSight. Queries from dashboards then hit the SPICE layer, providing fast performance and reducing the query load on the underlying source. SPICE datasets can be scheduled to refresh periodically (e.g., daily). For a large number of users accessing the same dataset, SPICE is generally more cost-effective and provides better performance than Direct Query mode, especially when the underlying source is pay-per-query like Athena. The requirement for daily updates aligns perfectly with SPICE's scheduled refresh capability.
Therefore, the most cost-effective and scalable solution involves using Amazon Athena to query the clickstream data directly from S3 and accessing this data in QuickSight through SPICE, configured with a daily refresh.
When designing QuickSight solutions with many users and pay-per-query data sources like Athena, strongly consider SPICE. SPICE optimizes for dashboard performance and cost by caching data and reducing queries to the source. Match the SPICE refresh frequency to the required data freshness (e.g., daily updates = daily SPICE refresh).
References:
Files from multiple data sources arrive in an Amazon S3 bucket on a regular basis. A data engineer wants to ingest new files into Amazon Redshift in near real time when the new files arrive in the S3 bucket.
Which solution will meet these requirements?
Use the query editor v2 to schedule a COPY
command to load new files into Amazon Redshift.
Use S3 Event Notifications to invoke an AWS Lambda function that loads new files into Amazon Redshift.
Use the zero-ETL integration between Amazon Aurora and Amazon Redshift to load new files into Amazon Redshift.
Use AWS Glue job bookmarks to extract, transform, and load (ETL) load new files into Amazon Redshift.
Amazon S3 Event Notifications provide a mechanism to automatically trigger downstream actions when specific events occur within an S3 bucket, such as the creation of a new object (s3:ObjectCreated:*
). You can configure S3 Event Notifications to send a message to various targets, including an AWS Lambda function.
When a new file arrives in the designated S3 bucket, S3 will automatically invoke the configured Lambda function, passing event details that include the bucket name and the object key (file name). The Lambda function can then use this information to connect to the Amazon Redshift cluster (e.g., using the Redshift Data API or a standard database driver) and execute a COPY
command to load the specific new file into the target Redshift table. This approach provides a highly responsive, serverless, and near real-time ingestion pipeline.
Scheduling a COPY
command using the query editor v2 is time-based, not event-based. It runs on a fixed schedule (e.g., every hour), leading to potential delays between file arrival and ingestion, thus not meeting the near real-time requirement.
The zero-ETL integration is specifically designed for replicating data changes from Amazon Aurora or Amazon RDS for MySQL databases to Amazon Redshift, not for ingesting files from S3.
AWS Glue job bookmarks are used to track processed data, enabling Glue jobs to process only new data since the last run. While Glue jobs can be triggered by S3 events (typically via EventBridge), running a full Glue ETL job for each arriving file might introduce latency and overhead compared to a lightweight Lambda function executing a COPY
command, making it less suitable for near real-time, per-file ingestion.
For near real-time processing triggered by S3 object creation, the combination of S3 Event Notifications and AWS Lambda is a standard and effective pattern. Lambda's low startup time and event-driven nature make it ideal for reacting quickly to file arrivals.
A retail company uses an Amazon Redshift data warehouse and an Amazon S3 bucket. The company ingests retail order data into the S3 bucket every day.
The company stores all order data at a single path within the S3 bucket. The data has more than 100 columns. The company ingests the order data from a third-party application that generates more than 30 files in CSV format every day. Each CSV file is between 50 and 70 MB in size.
The company uses Amazon Redshift Spectrum to run queries that select sets of columns. Users aggregate metrics based on daily orders. Recently, users have reported that the performance of the queries has degraded. A data engineer must resolve the performance issues for the queries.
Which combination of steps will meet this requirement with LEAST developmental effort? (Choose two.)
Load the JSON data into the Amazon Redshift table in a SUPER type column.
Configure the third-party application to create the files in JSON format.
Partition the order data in the S3 bucket based on order date.
Develop an AWS Glue ETL job to convert the multiple daily CSV files to one file for each day.
Configure the third-party application to create the files in a columnar format.
The scenario describes performance degradation for Amazon Redshift Spectrum queries against CSV files stored in Amazon S3. The queries select subsets of columns and aggregate metrics based on daily orders. The data includes many columns (>100), and all files are stored in a single path. To improve query performance with minimal development effort, we need to address the inefficiencies in data format and layout.
Columnar Format: CSV is a row-based format. When Spectrum queries select only a subset of columns from a wide table (100+ columns), it must still read entire rows, including data from columns not requested. Switching to a columnar format like Apache Parquet or ORC allows Spectrum to read data only for the columns specified in the query, significantly reducing the amount of data scanned and improving performance. Configuring the third-party application to generate files directly in a columnar format, if supported, would be the most efficient way to achieve this, minimizing downstream processing effort.
Partitioning: The queries aggregate metrics based on daily orders, implying frequent filtering or grouping by date. Storing all files in a single path forces Spectrum to list and potentially scan all files, even if the query only targets a specific date range. Partitioning the data in the S3 bucket based on the order date (e.g., creating prefixes like
s3://your-bucket/orders/order_date=YYYY-MM-DD/
) allows Spectrum to perform partition pruning. Spectrum can identify and scan only the partitions (and files within them) that match the date predicates in theWHERE
clause, drastically reducing the data scanned for date-specific queries.
Combining these two approaches addresses the key performance bottlenecks:
- Using a columnar format optimizes queries that select specific columns.
- Partitioning by date optimizes queries that filter or aggregate by date.
Let's consider the other options:
- Configuring the application to create files in JSON format does not solve the performance issue for selecting subsets of columns, as JSON is also row-oriented.
- Loading JSON data into a Redshift SUPER column involves moving data from S3 into Redshift's local storage, which changes the architecture away from Redshift Spectrum and adds complexity.
- Developing an AWS Glue ETL job to consolidate multiple daily files into one large file might offer minor benefits by reducing S3 listing overhead, but it doesn't address the core inefficiencies of the row-based format or lack of partitioning. It also adds development effort.
Therefore, partitioning the data by order date in S3 and configuring the source application to use a columnar format are the most effective steps to improve Redshift Spectrum query performance with potentially the least development effort.
When optimizing query performance for Redshift Spectrum or Athena against data in S3, always consider data format and partitioning. Columnar formats (Parquet, ORC) are best for queries selecting subsets of columns, especially with wide tables. Partitioning is crucial for filtering data based on low-cardinality columns frequently used in WHERE
clauses (like date, region, etc.). Aim to implement these optimizations as close to the data source as possible to minimize downstream ETL effort.
A company stores customer records in Amazon S3. The company must not delete or modify the customer record data for 7 years after each record is created. The root user also must not have the ability to delete or modify the data.
A data engineer wants to use S3 Object Lock to secure the data.
Which solution will meet these requirements?
Enable governance mode on the S3 bucket. Use a default retention period of 7 years.
Place a legal hold on individual objects in the S3 bucket. Set the retention period to 7 years.
Set the retention period for individual objects in the S3 bucket to 7 years.
Enable compliance mode on the S3 bucket. Use a default retention period of 7 years.
The requirement is to ensure customer records in Amazon S3 cannot be deleted or modified for 7 years, even by the root user. Amazon S3 Object Lock provides two retention modes to achieve Write-Once-Read-Many (WORM) storage:
- Governance Mode: Protects objects from deletion and modification by most users. However, users with the
s3:BypassGovernanceRetention
permission, including the root user, can override or remove the retention settings. - Compliance Mode: Provides a higher level of protection. Once an object version is locked in compliance mode, its retention mode cannot be changed, and its retention period cannot be shortened. Critically, no user, including the root user of the AWS account, can overwrite or delete an object version protected by compliance mode during its retention period.
A retention period specifies the length of time the object version remains locked. This period can be set explicitly for an object or automatically applied using a bucket's default retention settings.
To meet the requirement that even the root user cannot delete or modify the data for 7 years, Compliance Mode must be used. Setting a default retention period of 7 years on the bucket ensures that all new objects automatically inherit this lock configuration.
Placing a legal hold prevents deletion/modification but does not have a fixed duration; it remains until explicitly removed and doesn't automatically apply a 7-year term. Using governance mode does not satisfy the requirement of preventing deletion by the root user. Setting only the retention period without specifying compliance mode does not guarantee the required level of protection.
Therefore, enabling compliance mode on the bucket with a default 7-year retention period is the correct solution.
Pay close attention to the differences between S3 Object Lock's Governance and Compliance modes. Compliance mode offers stricter protection, preventing deletion even by the root user, which is often required for regulatory compliance scenarios like the one described.
A data engineer needs to create a new empty table in Amazon Athena that has the same schema as an existing table named old_table.
Which SQL statement should the data engineer use to meet this requirement?
CREATE TABLE new_table (LIKE old_table);
CREATE TABLE new_table AS (SELECT * FROM old_table) WITH NO DATA;
INSERT INTO new_table SELECT * FROM old_table;
CREATE TABLE new_table AS SELECT * FROM old_tables;
Amazon Athena supports the CREATE TABLE AS SELECT
(CTAS) statement, which creates a new table based on the results of a SELECT
query. The schema of the new table is derived from the columns and data types returned by the SELECT
statement.
To create the table with the same schema but without copying the data, Athena provides the WITH NO DATA
clause for CTAS statements. The statement CREATE TABLE new_table AS SELECT * FROM old_table WITH NO DATA;
instructs Athena to:
- Determine the schema by evaluating the
SELECT * FROM old_table
query. - Create
new_table
with that derived schema. - Skip the execution of the
SELECT
query for data population due to theWITH NO DATA
clause.
This results in a new, empty table (new_table
) with the identical schema as old_table
.
The statement CREATE TABLE new_table (LIKE old_table);
is not supported syntax in Athena for creating a table based on another table's schema. The LIKE
clause in Athena's CREATE TABLE
is used differently, typically related to SerDe properties when creating tables based on files.
The statement INSERT INTO new_table SELECT * FROM old_table;
is used to copy data into an existing table (new_table
) and does not create the table itself.
The statement CREATE TABLE new_table AS SELECT * FROM old_table;
is a standard CTAS statement that does copy all data from old_table
into new_table
, which violates the requirement for the new table to be empty.
Remember the CREATE TABLE AS SELECT
(CTAS) pattern in Athena. Use the WITH NO DATA
clause when you need to duplicate a table's structure without duplicating its contents.
A data engineer needs to create an Amazon Athena table based on a subset of data from an existing Athena table named cities_world
. The cities_world
table contains cities that are located around the world. The data engineer must create a new table named cities_us
to contain only the cities from cities_world
that are located in the US.
Which SQL statement should the data engineer use to meet this requirement?
UPDATE cities_usa SET (city, state) = (SELECT city, state FROM cities_world WHERE country=’usa’);
INSERT INTO cities_usa SELECT city, state FROM cities_world WHERE country=’usa’;
MOVE city, state FROM cities_world TO cities_usa WHERE country=’usa’;
INSERT INTO cities_usa (city,state) SELECT city, state FROM cities_world WHERE country=’usa’;
The requirement is to populate a new table, cities_us
, with a specific subset of data (cities located in the US) from an existing table, cities_world
. While the question uses the word "create," the provided SQL options focus on data manipulation (UPDATE, INSERT, MOVE) rather than table creation (CREATE TABLE
). The most direct way to create a table based on a query result in Athena is using CREATE TABLE AS SELECT
(CTAS). Since CTAS is not offered as an option, the question likely assumes the table cities_us
already exists with the appropriate columns (city
, state
), and the task is to insert the relevant data into it.
The standard SQL syntax for inserting data into a table based on a selection from another table is INSERT INTO ... SELECT ...
.
The statement INSERT INTO cities_usa (city,state) SELECT city, state FROM cities_world WHERE country=’usa’;
correctly performs this operation:
INSERT INTO cities_usa (city,state)
: Specifies the target table (cities_usa
) and the columns (city
,state
) into which data will be inserted.SELECT city, state FROM cities_world
: Selects the required columns from the source table (cities_world
).WHERE country=’usa’
: Filters the rows fromcities_world
to include only those where the country is 'usa'.
This statement effectively copies the city and state for all US cities from the source table into the target table.
The statement UPDATE cities_usa ...
is incorrect because UPDATE
modifies existing rows, it does not insert new rows.
The statement MOVE ...
is not valid SQL syntax for this purpose.
The statement INSERT INTO cities_usa SELECT city, state FROM cities_world WHERE country=’usa’;
is also syntactically valid if the columns city
and state
are the first two columns (or the only columns) in the cities_usa
table and match the order in the SELECT
list. However, explicitly listing the target columns, as done in the correct answer, is generally considered better practice as it is more robust to changes in table structure.
Understand the difference between CREATE TABLE AS SELECT
(CTAS), which creates and populates a table in one step, and INSERT INTO ... SELECT
, which populates an existing table. If CTAS is not an option and the goal is to have a table with a subset of data, look for the INSERT INTO ... SELECT
pattern, assuming the table structure exists.
A company implements a data mesh that has a central governance account. The company needs to catalog all data in the governance account. The governance account uses AWS Lake Formation to centrally share data and grant access permissions.
The company has created a new data product that includes a group of Amazon Redshift Serverless tables. A data engineer needs to share the data product with a marketing team. The marketing team must have access to only a subset of columns. The data engineer needs to share the same data product with a compliance team. The compliance team must have access to a different subset of columns than the marketing team needs access to.
Which combination of steps should the data engineer take to meet these requirements? (Choose two.)
Create an Amazon Redshift managed VPC endpoint in the marketing team’s account. Grant the marketing team access to the views.
Create an Amazon Redshift data share that includes the tables that need to be shared.
Share the Amazon Redshift data share to the Lake Formation catalog in the governance account.
Share the Amazon Redshift data share to the Amazon Redshift Serverless workgroup in the marketing team's account.
Create views of the tables that need to be shared. Include only the required columns.
The company needs to share specific columns from Amazon Redshift Serverless tables with different teams (marketing, compliance) using AWS Lake Formation for central governance. Each team requires access to a unique subset of columns.
To achieve column-level access control, a standard approach within databases like Amazon Redshift is to create SQL views. Views act as virtual tables defined by a query. The data engineer can create specific views for each team, selecting only the columns required for that team from the underlying base tables. This encapsulates the column filtering logic directly within Redshift.
Once the views tailored for each team are created, they need to be shared. Since the company uses Lake Formation for central governance, the sharing mechanism should integrate with it. Amazon Redshift Data Sharing allows sharing tables and views between Redshift clusters or workgroups. To leverage Lake Formation, the Redshift data share (containing the relevant views) needs to be registered with or shared to the Lake Formation catalog in the central governance account. Lake Formation can then be used to grant permissions to the consuming accounts (marketing and compliance) to access the specific views shared via the data share.
Therefore, the key steps are:
- Create views in the source Redshift Serverless workgroup that expose only the required columns for each team (marketing and compliance).
- Create an Amazon Redshift data share that includes these newly created views.
- Share this Redshift data share to the AWS Lake Formation catalog in the central governance account. Lake Formation will then manage access grants to these shared views for the respective consumer accounts.
Considering the provided options, creating views to define the column subsets is a crucial step. Sharing the Redshift data share (which would contain these views) to the Lake Formation catalog is the necessary step to integrate with the central governance model.
Creating an Amazon Redshift managed VPC endpoint relates to network connectivity for consumers, not the sharing mechanism or permission model itself. Sharing the data share directly to the consumer's Redshift workgroup bypasses the central Lake Formation governance specified in the requirement. Creating the data share itself is a prerequisite but sharing it to Lake Formation is the integration step needed.
When dealing with column-level access control in shared data scenarios, consider using SQL views within the source database (like Redshift) to define the specific column subsets. When using Lake Formation for central governance with Redshift, remember that Redshift data shares are the mechanism to package objects for sharing, and these shares need to be registered with Lake Formation to allow central permission management.
A company has an application that uses a microservice architecture. The company hosts the application on an Amazon Elastic Kubernetes Services (Amazon EKS) cluster.
The company wants to set up a robust monitoring system for the application. The company needs to analyze the logs from the EKS cluster and the application. The company needs to correlate the cluster's logs with the application's traces to identify points of failure in the whole application request flow.
Which combination of steps will meet these requirements with the least development effort? (Choose two.)
Use AWS Glue to correlate the logs and traces.
Use Amazon CloudWatch to collect logs. Use Amazon Managed Streaming for Apache Kafka (Amazon MSK) to collect traces.
Use Amazon CloudWatch to collect logs. Use Amazon Kinesis to collect traces.
Use FluentBit to collect logs. Use OpenTelemetry to collect traces.
Use Amazon OpenSearch to correlate the logs and traces.
The goal is to set up monitoring for an application on Amazon EKS, collecting and correlating logs and traces with minimal development effort.
Log and Trace Collection: For applications running on Kubernetes, standard open-source tools are often used for observability data collection. FluentBit is a lightweight and efficient log processor and forwarder commonly deployed as a DaemonSet in Kubernetes to collect logs from containers and nodes. OpenTelemetry (OTel) is the industry standard framework for instrumenting applications to generate traces (and metrics/logs). Using OTel SDKs in the microservices and deploying an OTel Collector within the EKS cluster provides a standard way to gather traces. This combination (FluentBit for logs, OTel for traces) represents a best practice for collecting observability data from EKS with relatively low integration effort compared to building custom solutions.
Correlation and Analysis: Once logs and traces are collected, they need to be sent to a backend system capable of storing, querying, and correlating them. Amazon OpenSearch Service is a managed service suitable for indexing and analyzing large volumes of log and trace data. OpenSearch and its visualization tool, OpenSearch Dashboards, provide features for searching logs and visualizing traces. Crucially, if trace IDs generated by OpenTelemetry are injected into the corresponding logs (a common practice enabled by OTel instrumentation libraries), OpenSearch can be used to correlate specific traces with the log entries generated during that trace's execution path, helping to identify points of failure.
Using FluentBit and OpenTelemetry for collection leverages widely adopted open standards, minimizing custom development for data gathering. Using Amazon OpenSearch Service provides a managed backend optimized for log and trace analysis and correlation, reducing operational overhead.
Other options are less suitable:
- AWS Glue is an ETL service, not designed for real-time log/trace correlation.
- Using Amazon MSK or Amazon Kinesis directly for traces is inappropriate; they are streaming platforms, not trace backends, and would require significant development to build a tracing solution on top.
- While Amazon CloudWatch (for logs) and AWS X-Ray (for traces, often integrated with OTel) are viable AWS native alternatives, the combination presented involves FluentBit/OTel for collection and OpenSearch for the backend, which is a common and effective pattern.
Therefore, using FluentBit and OpenTelemetry for collection, combined with Amazon OpenSearch Service for storage, analysis, and correlation, meets the requirements with the least development effort among the choices provided.
For observability on Kubernetes (like EKS), look for solutions leveraging standard tools: FluentBit/Fluentd for logs, OpenTelemetry for traces/metrics, and a suitable backend like OpenSearch, CloudWatch/X-Ray, or other commercial offerings. Prioritize managed services (like Amazon OpenSearch Service) to reduce operational overhead.
References:
- Setting up Fluent Bit as a DaemonSet to send logs to CloudWatch Logs - Amazon EKS (Shows FluentBit usage in EKS, though destination here is CloudWatch)
- OpenTelemetry Documentation
- What is Amazon OpenSearch Service?
- Trace analytics in Amazon OpenSearch Service
A retail company is expanding its operations globally. The company needs to use Amazon QuickSight to accurately calculate currency exchange rates for financial reports. The company has an existing dashboard that includes a visual that is based on an analysis of a dataset that contains global currency values and exchange rates.
A data engineer needs to ensure that exchange rates are calculated with a precision of four decimal places. The calculations must be precomputed. The data engineer must materialize results in QuickSight super-fast, parallel, in-memory calculation engine (SPICE).
Which solution will meet these requirements?
Define and create the calculated field in the visual.
Define and create the calculated field in the dataset.
Define and create the calculated field in the analysis.
Define and create the calculated field in the dashboard.
Amazon QuickSight allows calculated fields to be created at different stages:
- Dataset Level: When you create a calculated field during dataset preparation (before publishing or saving), the calculation is performed as the data is ingested into SPICE. The results of this calculation are then stored within SPICE along with the original data. This meets the requirement for precomputing the calculation and materializing it in SPICE.
- Analysis/Visual/Dashboard Level: Calculated fields created within an analysis (and subsequently available in visuals and published dashboards) are computed dynamically when the analysis or dashboard is viewed or interacted with. These calculations are performed on the data retrieved from SPICE (or via Direct Query), but the results of the calculation itself are not stored back into SPICE.
To ensure the exchange rate calculations are precomputed and materialized in SPICE, the calculated field must be defined and created at the dataset level. This guarantees that the computation happens during data preparation/ingestion into SPICE, and the resulting values (with the desired precision) are stored efficiently within SPICE for fast retrieval during analysis.
Defining the calculated field in the visual, analysis, or dashboard would mean the calculation happens dynamically at query time, not meeting the requirement for precomputation and SPICE materialization.
Understand the difference between creating calculated fields at the dataset level versus the analysis level in QuickSight. Dataset-level calculations are precomputed and stored in SPICE (if used), improving performance for complex calculations and ensuring consistent results. Analysis-level calculations offer more flexibility during exploration but are computed dynamically.