Experiences from my Google Cloud Professional Data Engineer Exam

Narayan Sharma
4 min readNov 13, 2020

Data engineer roles and a few of the products(eg. Machine Learning, Hadoop, etc) are something that isn’t part of my day to day job still wanted to explore a few amazing services (eg. Bigquery, dataflow, PubSub, etc) by google and thought of trying to spend a few months and finally able to crack this one of the difficult exams.

Long story short let me point out a few of the points which are very important from the exam perspective.

For Exam preparation, I have created one repo that covers most of the topics which are very important for the exam.

Important Topics

  1. BigQuery, BigQuery, Bigquery: (I hope you have understood how much import BigQuery is)
    - Partitions (all types of partitions)
    - Federated queries (use cases, limitations, etc)
    - Batch operations vs Stream operations (Limit, de-duplication)
    - Standard SQL vs Legacy SQL
    - Bigquery Decorator (For legacy SQL)
    - User-defined functions (we can write functions with js)
    - Analytical & Geographical functions (eg. OVER, RANK, LEAD, LAG, etc)
    - Authorized views (Very Important)
    - Type of query processing (Interactive vs Batch)
    - Best practice while working with BigQuery
    -> Avoid **Select ***
    -> De-normalize when-ever possible
    -> Use Struct & Array for faster performance and low cost.
    -> Limit doesn’t impact the cost
    -> Partition
    -> Clustering
    -> Use APPROX_COUNT_DISTINCT over count(distinct)
    -> Put the largest table on the left while joining
    - Pricing Model (eg. on-demand vs slots based)
    -> Should know which is best for a particular scenario.
    -> Warehouse Migrations
    -> Finally Access control (this is everywhere)
  2. Dataproc: This is also very important from exam perspective you can expect around 5 questions from this.
    - Should have good familiarities with BigData tools (eg. HDFS, MapReduce, YARN, PIG, HIVE, SPARK, Sqoop, Oozie, etc)
    - Should know it’s managed service to run Hadoop, Spark workload on the cloud.
    - Should know the advantages of storing data in HDFS and Cloud Storage.
    Should know what is the way to save the cost (eg. use preemptible VM, store data in the bucket, etc)
    - Should know the advantages/disadvantages of using auto-scaling. https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/autoscaling
    - Should know the limit(number of VM) and limitations of using preemptible VM.
    - Uses of SSD/HDD (use case)
  3. Dataflow: Another very important (for exam perspective)
    - You should know the about apache beam (little more details) including:
    -> Pipelines: https://beam.apache.org/documentation/pipelines/design-your-pipeline/
    -> Transform:
    -> Runner:
    -> Side Input:

    -> Various transformations functions (eg. Element, ParDo, Map, Flatmap, GroupByKey, Combine etc)
    - Should know how to deal with streaming data (eg windowing, trigger, watermark, etc)
    - How to handle later arrival data
    - You should know what other features Dataflow provides
    -> Horizontal autoscaling(adding VM automatically based on demand)
    -> Pre-defined data processing pipeline template managed by google
    -> Dynamic work re-balancer
    - Access Control: (Developer vs worker)
  4. Bigtable: Another big chunk of your exam
    - Should know when to use BigQuery vs BigTable
    - Should know how Bigtable store the data
    - Choose HDD/SSD (also should know how to change if required)
    - Choose the best row key (very important) (how to design row key)
    - How to prevent hot-spotting
    - Table design for time-series data vs other
    - What is the role of the app profile (also single routing policy vs multi-cluster routing).
    - Should know how to query the Bigtable data (Hbase, cbt tool)
  5. PubSub:
    - Should know push and pull mechanism. (which is best for real-time analysis)
    - You should know how publisher and subscription works (you should know in more details)
    - Should know maximum retention periods.
    - Should know the limitations eg. ordering, duplication may happen, etc.
    - Should know why do we need to acknowledge once the message received(Important)
    - Should know how to process/transform pubsub messages using dataflow.
  6. AI/ML: Lots of questions from this too. Consider this is a very important part of your exam. (I’m also very new on this topic)
    - I have created a bunch of list on this repo(please follow if required): https://github.com/narayansharma91/GoogleCloudPlatform/tree/master/CertificationsGuide/Professional%20Data%20Engineer#MACHINE-LEARNING
    (The definition might incorrect, I have written as per my understanding)
    - Also wanted to refer to this official course: (very important) https://developers.google.com/machine-learning/crash-course
  7. Cloud Storage:
    - Lifecycle management
    - How to securely share bucket data for a limited period (signed URL)
    - Should know how to encrypt using customer supply encryption key (configurations .boto file)
    - Various transfer services use cases (gsutil vs transfer service vs transfer appliance, etc)
    - Resumable upload vs multi-part uploads.
    - Access control
  8. Cloud SQL, Spanner, Datastore: Few questions from these services
    - Cloud SQL (read-replica, size limit, failover replica, etc)
    - Cloud Spanner (should know when to use, prevent hot-spotting, secondary index, etc)
    - DataStore: Should know index, strong consistency vs eventually consistent, export and import capabilities/limitations, etc.
  9. DataPrep & data fusion: You should know what is the use cases of these services.
  10. Cloud Composer: You should know when to use these services also should know it’s managed airflow services)

NOTE: this just my experience, If you want to details the visibility of all services please follow the above repo URL.

The courses that I followed while preparing

  1. Coursera (8 out of 10): this is a one-stop course which contains all the things related to data engineer:
  2. A Cloud Guru: (5 out of 10):
  3. And the golden rule is read documentation:
  4. The extensive list of URL maintained by sathish vj: https://github.com/sathishvj/awesome-gcp-certifications/blob/master/professional-data-engineer.md

My Certification

Verification Link:
Other Certifications:

Would you like to connect with me to share, discuss any concerns, need any help? Connect with me on Linkedin: https://www.linkedin.com/in/narayansharma91/

