Free AWS Certified Data Engineer - Associate (DEA-C01) Practice Questions
Test your knowledge with 20 free exam-style questions
DEA-C01 Exam Facts
Questions
65
Passing
720/1000
Duration
130 min
A company needs to build an ETL pipeline that extracts data from multiple relational databases, transforms it, and loads it into an Amazon S3 data lake. The data sources include Amazon RDS for MySQL, Amazon RDS for PostgreSQL, and an on-premises Oracle database. The pipeline should run daily and handle schema changes automatically. Which AWS service should the data engineer use to implement this ETL pipeline?
Frequently Asked Questions
These 20 sample questions let you experience the exact format, difficulty, and question styles you'll encounter on exam day. Use them to identify knowledge gaps and decide if our full practice exam package is right for your preparation strategy.
Our questions mirror the actual exam format, difficulty level, and topic distribution. Each question includes detailed explanations to help you understand the concepts.
The full package includes 7 complete practice exams with 455+ unique questions, detailed explanations, progress tracking, and lifetime access.
Yes! Our DEA-C01 practice questions are regularly updated to reflect the latest exam objectives and question formats. All questions align with the current 2026 exam blueprint.
Sample DEA-C01 Practice Questions
Browse all 20 free AWS Certified Data Engineer - Associate practice questions below.
A company needs to build an ETL pipeline that extracts data from multiple relational databases, transforms it, and loads it into an Amazon S3 data lake. The data sources include Amazon RDS for MySQL, Amazon RDS for PostgreSQL, and an on-premises Oracle database. The pipeline should run daily and handle schema changes automatically. Which AWS service should the data engineer use to implement this ETL pipeline?
- Amazon EMR with Apache Sqoop for data extraction and Spark for transformation
- AWS Data Pipeline with EC2 instances running custom extraction scripts
- AWS Glue with JDBC connections for all databases and Glue crawlers for schema discovery
- Amazon Kinesis Data Firehose with Lambda transformations
A data engineering team is designing a data lake on Amazon S3 to store petabytes of data from various sources. They need to optimize query performance for analytics workloads while minimizing storage costs. The data will be queried using Amazon Athena. Which combination of practices should the team implement? (Select TWO)
- Use S3 Standard storage class with versioning enabled for all objects
- Disable S3 server-side encryption to improve query performance
- Store data in Apache Parquet format with Snappy compression and partition by frequently filtered columns
- Configure S3 Intelligent-Tiering for automatic cost optimization based on access patterns
- Store data in JSON format for maximum flexibility and human readability
A company uses Amazon Athena to query data stored in their S3 data lake. Users are complaining that queries on large tables are slow and expensive. The tables contain billions of rows and are stored in CSV format without any organization. What should the data engineer do to improve query performance and reduce costs?
- Convert the data to Parquet format, partition by date, and use bucketing on frequently joined columns
- Enable Athena workgroups with query result caching and increase concurrent query limits
- Split the large CSV files into smaller files and enable S3 Transfer Acceleration
- Migrate the data to Amazon Redshift for better query performance
A retail company wants to implement a data warehouse solution on AWS to analyze sales data from multiple regions. They require the ability to run complex analytical queries, support concurrent users, and automatically scale based on workload demands. The solution should minimize operational overhead. Which Amazon Redshift feature should the data engineer configure?
- Amazon Redshift DC2 nodes with snapshot-based backup and restore for scaling
- Amazon Redshift provisioned cluster with Concurrency Scaling enabled
- Amazon Redshift RA3 nodes with managed storage and elastic resize
- Amazon Redshift Serverless with automatic scaling based on workload
A media company needs to ingest clickstream data from their website into a data lake. The data arrives at a rate of 100,000 events per second with occasional bursts up to 500,000 events per second. The data should be available for analytics within 60 seconds of arrival. Which AWS service should the data engineer use for data ingestion?
- Amazon Kinesis Data Firehose with direct PUT operations
- Amazon MSK (Managed Streaming for Apache Kafka) with single partition topic
- Amazon Kinesis Data Streams with enhanced fan-out and multiple shards
- Amazon SQS with multiple queues for parallel processing
A company needs to ingest real-time clickstream data from their e-commerce website. The data arrives at approximately 100,000 events per second with peak loads of up to 300,000 events per second. The data must be available for real-time analytics within 1 second of arrival. Which AWS service should the data engineer use?
- Amazon Kinesis Data Firehose with direct PUT operations
- Amazon Kinesis Data Streams with on-demand capacity mode to automatically handle variable throughput
- Amazon MSK with a single partition topic
- Amazon SQS Standard queue with multiple consumers
A data engineer is configuring Amazon Kinesis Data Firehose to deliver streaming data to Amazon S3. The data should be converted from JSON to Parquet format and partitioned by year, month, and day based on the event timestamp in the data. Which configuration should the data engineer implement?
- Use a Lambda transformation to convert JSON to Parquet and set the S3 prefix
- Enable record format conversion with Parquet output format, configure dynamic partitioning with JQ expressions to extract the event timestamp
- Configure the S3 prefix with timestamp placeholders (!{timestamp:yyyy/MM/dd}) for partitioning
- Write JSON to S3 and use a Glue ETL job to convert to Parquet after delivery
A company is building a real-time event streaming platform and has chosen Amazon MSK (Managed Streaming for Apache Kafka). They need to ensure high availability and durability for their critical business events. What configuration should the data engineer implement?
- Configure acks=0 on producers for maximum throughput with EBS encryption enabled
- Use MSK Serverless which automatically handles replication and availability
- Deploy MSK across 3 Availability Zones, configure topics with replication factor of 3 and min.insync.replicas of 2
- Deploy MSK in a single AZ with replication factor of 3 for faster performance
A data engineer needs to process 10 TB of log data daily using Amazon EMR. The processing involves complex aggregations and joins using Apache Spark. The data should be processed within a 4-hour batch window. Which EMR configuration should the data engineer use for cost optimization?
- Use EMR on EKS with Fargate for containerized Spark execution
- Use all On-Demand instances with Reserved Instance pricing for consistent capacity
- Use EMR on EC2 with Spot Instances for task nodes and On-Demand instances for core nodes, with instance fleet for diverse instance types
- Use EMR Serverless for automatic scaling without managing infrastructure
A company has a streaming data pipeline that receives JSON events from IoT sensors. The JSON schema has evolved over time - new fields have been added and some field types have changed. The data engineer needs to handle these schema changes gracefully while maintaining backward compatibility. Which approach should be used?
- Create a new Kinesis stream for each schema version
- Use AWS Glue Schema Registry with compatibility mode set to BACKWARD to validate and manage schema versions
- Convert all data to Avro format which inherently supports schema evolution
- Store raw JSON without schema validation and handle variations at query time
A company is designing a data lake on Amazon S3 to store sensor data from IoT devices. The data will be queried using Amazon Athena for both recent operational queries (last 24 hours) and historical trend analysis (last 2 years). The data is 500 GB per day. Which S3 storage strategy optimizes both cost and query performance?
- Store recent data (last 30 days) in S3 Standard and use lifecycle policies to transition older data to S3 Standard-IA, partitioning by date
- Store all data in S3 Glacier Instant Retrieval with date partitioning
- Use S3 Intelligent-Tiering for all data with lifecycle policies to transition to Glacier after 2 years
- Store all data in S3 Standard with no partitioning to simplify management
A data engineer is designing an Amazon Redshift data warehouse for a retail company. The fact table contains 5 billion rows of transaction data with columns: transaction_id, store_id, product_id, customer_id, transaction_date, quantity, and amount. Most queries filter by store_id and aggregate by transaction_date. Which distribution and sort key strategy provides optimal query performance?
- KEY distribution on store_id with interleaved sort key (store_id, transaction_date)
- EVEN distribution with compound sort key (store_id, transaction_date)
- ALL distribution with compound sort key (transaction_date, store_id)
- KEY distribution on transaction_id with compound sort key (store_id, transaction_date)
A company stores customer profile data that is accessed by multiple applications. The data has the following characteristics: 50 million records, each record averages 4 KB, read-heavy workload with 10,000 reads per second and 100 writes per second, queries are primarily key-value lookups by customer_id with occasional queries by email address. Which AWS database service best meets these requirements?
- Amazon Aurora Serverless v2 with query caching
- Amazon ElastiCache for Redis as the primary data store
- Amazon RDS for PostgreSQL with read replicas
- Amazon DynamoDB with a global secondary index on email
A data engineer is implementing a star schema in Amazon Redshift. The product dimension table has 2 million rows and is frequently joined with the 10 billion row sales fact table. The product table includes a surrogate key (product_sk), natural key (product_code), and attributes (name, category, subcategory, brand). Queries often filter products by category. What is the optimal table design for the product dimension?
- KEY distribution on product_sk with sort key on category
- ALL distribution with sort key on product_sk
- ALL distribution with sort key on category
- EVEN distribution with interleaved sort key on (category, subcategory, brand)
A company has a data lake on Amazon S3 with log data partitioned by year/month/day. Each day's partition contains 10,000 small JSON files averaging 1 MB each. Queries using Amazon Athena are slow and expensive. What should the data engineer do to improve query performance and reduce costs?
- Enable S3 Transfer Acceleration to speed up data access
- Create an Athena workgroup with increased concurrent query limit
- Use AWS Glue ETL to compact files into larger Parquet files with Snappy compression
- Enable S3 Select to push down filtering to the storage layer
A data engineer is building an ETL pipeline that processes files uploaded to Amazon S3. The pipeline must: (1) validate the file format, (2) transform the data using AWS Glue, (3) load the data into Amazon Redshift, (4) send a notification on completion or failure. Which service should orchestrate this workflow?
- AWS Lambda with S3 event trigger chaining Lambda functions
- AWS Step Functions with service integrations for Glue, Redshift, and SNS
- AWS Glue Workflow with triggers for each job
- Amazon EventBridge Scheduler to trigger each step sequentially
A company has multiple data pipelines that need to be triggered when specific business events occur, such as 'order completed', 'inventory updated', or 'customer created'. The pipelines are implemented in different services (Lambda, Step Functions, and Glue). Which approach decouples event producers from pipeline consumers?
- Create direct API integrations between services for event notification
- Use Amazon EventBridge with custom event bus and rules to route events to different targets
- Configure SNS topics for each event type with subscribers for each pipeline
- Have each service poll an SQS queue for relevant events
A data engineer is implementing a data quality framework using AWS Glue Data Quality. The framework should validate that: (1) customer_email column has valid email format, (2) order_total is always positive, (3) no duplicate order_ids exist. Which DQDL rules implement these validations?
- ColumnValues "customer_email" matches "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
- ColumnValues "order_total" > 0
- IsUnique "order_id"
- RowCount > 0
A company uses Amazon MWAA (Managed Workflows for Apache Airflow) to orchestrate their data pipelines. They need to pass large amounts of data (100 MB) between DAG tasks. Currently, using XCom causes performance issues and DAG failures. What is the recommended approach?
- Use Airflow's TaskFlow API to directly pass data between tasks
- Store data in Amazon S3 and pass only the S3 path through XCom
- Increase the MWAA environment size to handle larger XCom values
- Split the 100 MB data into smaller chunks and pass multiple XCom values
A data engineer is designing error handling for an AWS Step Functions workflow that processes customer orders. If the payment processing step fails, the workflow should retry 3 times with exponential backoff. If it still fails, the order should be marked as 'payment_failed' and the workflow should continue to send a notification. Which Step Functions feature implements this?
- Configure Retry with exponential backoff and Catch with ResultPath to continue the workflow
- Use a Choice state after payment processing to check for errors
- Implement retry logic within the payment processing Lambda function
- Create a parallel state that runs payment and notification simultaneously