Thought Leadership

AWS Certified Big Data Specialty Tips

By Sergio Deras

November 27, 2020

In this article, you will find relevant information about the topics that I are covered in the exam as well as the strategy that I have used to prepare and pass the AWS Certified Big Data - Specialty exam.

The road to certification

I passed the AWS Certified Big Data Specialty test on July 11 2018 with 74% of score. I began to study on March, so it took me 4 months to get comfortable with the technologies covered in the exam. I did not start from zero since I had previous experience on Big Data analysis and visualization, but it was a good challenge anyway. To study, I completed the A Cloud Guru Big Data - Speciality course around 3 times. The first one to get familiar with all the terms, services and technologies. The second time to take notes about key points, and the third to evaluate my knowledge. I did not have much time to look at the FAQs, but when I had a doubt, I used them as a main reference. If I were your mentor on this topic, I would say that this course does give you a great baseline to understand the services, but you will require some experience on architectural solutions. You may not pass the exam if you only know the limits, costs, main features and limitations of the key services, this time you need to know how to combine 2 or more services to get the right answer. My suggestion is that you study the presented use cases deeper and look for blogs, white papers and other advanced material to increase your knowledge. Talking about practice exams and questions, I could not find a free set of sample questions on the web, other than those offered by AWS. So, I bought the Udemy and at the end the Whizlabs practice exams too. If you are serious on taking the exam, I recommend that you pay at least for the Whizlabs test. The questions were pretty much the same in both sources, I got the impression that same people generate the questions. The main advantage of Whizlabs is that answers are explained in a very detail way, whereas Udemy gives you no more than 3 lines to justify the answer. Besides justifying the right answer thoroughly, Whizlabs provides supplementary diagrams and explains why the other options are wrong or invalid. I got pretty good scores on my practice exams from Whizlab, but that was because I did the Udemy exams first. I tried Udemy twice, and in the first chance I barely passed 2 of the 4 exams. I learned from my mistakes and in the second pass I got more than the minimum to pass them.

Deep dive into the exam

This is by far the most complicated cert exam I have ever presented, as I mentioned in the previous sections, you will need to select the best answer from scenarios where 2 to 5 services are included, a lot of questions will try to cover the whole topics spectrum (from collection to visualization). So if you do not have experience on architecture solutions on the AWS ecosystem or BigData solutions, do read as many use cases or blogs as you can. You will have around 2 minutes to answer each question. I did follow the advise to read the last piece of the question and note what is the main concern (lowest cost, increase reliability, most secure way, etc.), take a quick glance at the answer to try to eliminate some or check the similarities between them. The questions are hardly short, so do not expect to find quick sentences like: "What's the retention period for Shards in Kinesis?" or "How do you set backups on RedShift instances?" or even "What's the maximum DynamoDB record size?". You will not get more than 2 of those, but you still have to know these numbers to solve the scenario base questions.

The services that I have found into the exam

I want to mention the services that appeared in my exam to help you know where you need to emphasize your study, the next points are just the tip of the iceberg and must be used as a high overview to guide your prioritization. Services that are heavily covered and used in the scenarios:

Kinesis Stream and Firehose
- Around 1/3 of the questions were related to these services or appeared in an incorrect question as option
- At least practice and observe how the KPL and KCL works to complement the theory
- KPL concepts:
  - Batching - Run the same process repeated over multiple entities (records)
  - Aggregation - Use one record to include multiple records that will share the same partition key
  - Collection - Use one request to add multiple records
- KCL: Remember that it uses Dynamo to keep track of the status
- AWS SDK: Remember the differences with KPL, like that it can create and delete streams, but aggregation is managed by you
- Kinesis Agent: Good choice to deliver logs to a Kinesis Stream
- Kinesis Firehose is a completely managed service, you can run Lambdas when you get the data or existing functions to ingest logs
- Concepts
  - Shards: Main storage container
    - Split and merge: You split shards when you need to increase the throughput because your "hot" shard is getting a lot of records with the same partition key, you can merge consecutive shards when the shards "cool" a little bit
    - Capacities of a shard: 1 MB/s to write, 2MB/s to read, up to 1000 PUT records per second
  - Streams: Container of shards
  - Records
    - Partition Key: ID used to be assigned to the shard
    - Sequence number: Order of arrival
    - BLOB or data: The information that we want to collect, up to 1 MB before Base64-encoding
S3
- Minimum object size for IA
- Share data between regions
- Encryption types
- Lifecyle
EMR
- Type of clusters
  - Transient: Perform the work and terminate, good if you can store the data in EMRFS or it can be aggregated
  - Long-running: Keep it on to solve process that can take a long time
- Instance types
  - Spot: Cheapest but not available all the time
  - Reserved: Cheaper and good if you process your data on regular hours/days
- Presto
- Hive
- HBase
- Spark
DynamoDB Indexes
- GSI: Use Global Secondary Indexes when you need to use a different partition key
- LSI: Use Local Secondary Indexes when you need to change the sort key
Data Pipeline
- Use cases
RedShift
- Star configuration: How your records are stored in the cluster
- Commands
  - Copy: Load data from S3
  - Unload: Put your data into S3
- Use S3 to load multiple files intoRedShift, split the file into smaller files that are a multiple of the nodes in a cluster, keep them between 1GB and
- I got two questions related on how to hide info with queries (views)
- Manual Backup vs. Automatic Backup: If you need to keep your data for more than a month, you need to make a manual backup. Automatic backups are incremental
- Remember that RedShift is not for OLTP
Machine Learning Models
- Binary: Answer to direct questions (yes or no)
- Classes: Answer to a set of options (A cat, a dog, a human)
- Regression: Probability (A number)
Encryption
- At rest for
  - S3
  - RedShift
  - EMR
- KMS
ElasticSearch

Other tools and services that are covered on regular basis in the exam:

S3
- EMRFS
- Consistent view: Allows EMR clusters to check for list and read-after-write consistency for Amazon S3 objects written by or synced with EMRFS.
Glacier Vault - Keep your data safe forever
DynamoDB
EMR
- Instance types to use
- HUE - UI to manage EMR
RDS
- Parameter groups - Organize your clusters
AWS SDK
- PutRecord and PutRecords
IoT components
- Device Gateway
- Rule Config
D3 - Helps to create web graphics of your data
Quicksight - Display your data in different formats
Lambda
SQS as an alternative to Kinesis
- When you need to decouple the store tier, you need to retain your data for a longer period of time or you consumer is not available all the time
STS - Secure Token Service

Some of the services and tools that I was expecting and did not appear in the exam

R language
Security
- CloudHSM
RedShift
- Commands
  - VACCUM
  - EXPLAIN
  - ENCODE
- Constraints
EMR
- Zookeper
RDS /Dynamo Data Types
Nothing about manifest files in S3
Encryption in transit

Tips about the questions

Do not expect questions where the answer has less than 3 words, they will be two lines long with several services as part of the solution
You will not find question of the type:
- "What's the read throughput for a kinesis shard?"
- "What's the data retention default of a kinesis shard?"
- The closer to these were the ML questions to answer "What type of ML model will you use to...?"

Scenarios and combinations that I remember as (almost) puntual questions

How to update records on an existing table in RedShift?
- Use staging tables to add the new information and replace the old table with the new data
How to encrypt the data that ends in Kinesis?
- They are already encrypted
Troubleshooting cases of Dynamo + KCL
- Remember that the KCL uses a Dynamo table to keep track of the data that has been read, you might need to increase the Dynamo table in case of high throughput
When to upload files from S3 to RedShift with a time limit
An enterprise wants to migrate data between different regions to run better data analysis as the new region lacks them, the data is encrypted in S3, but the country's law indicates that this information should not leave the country
- In this case you need to anonymize that data in order to migrate the expected information