Experiences from my Google Cloud Professional Data Engineer Exam

Narayan Sharma
4 min readNov 13, 2020

Data engineer roles and a few of the products(eg. Machine Learning, Hadoop, etc) are something that isn’t part of my day to day job still wanted to explore a few amazing services (eg. Bigquery, dataflow, PubSub, etc) by google and thought of trying to spend a few months and finally able to crack this one of the difficult exams.

Long story short let me point out a few of the points which are very important from the exam perspective.

For Exam preparation, I have created one repo that covers most of the topics which are very important for the exam.

Important Topics

  1. BigQuery, BigQuery, Bigquery: (I hope you have understood how much import BigQuery is)
    - Partitions (all types of partitions)
    - Federated queries (use cases, limitations, etc)
    - Batch operations vs Stream operations (Limit, de-duplication)
    - Standard SQL vs Legacy SQL
    - Bigquery Decorator (For legacy SQL)
    - User-defined functions (we can write functions with js)
    - Analytical & Geographical functions (eg. OVER, RANK, LEAD, LAG, etc)
    - Authorized views (Very Important)
    - Type of query processing (Interactive vs Batch)
    - Best practice while working with BigQuery
    -> Avoid **Select ***
    -> De-normalize when-ever possible
    -> Use Struct & Array for faster performance and low cost.
    -> Limit doesn’t impact the cost
    -> Partition
    -> Clustering
    -> Use APPROX_COUNT_DISTINCT over count(distinct)
    -> Put the largest table on the left while joining
    - Pricing Model (eg. on-demand vs slots based)
    -> Should know which is best for a particular scenario.
    -> Warehouse Migrations
    -> Finally Access control (this is everywhere)
  2. Dataproc: This is also very important from exam perspective you can expect around 5 questions from this.
    - Should have good familiarities with BigData tools (eg. HDFS, MapReduce, YARN, PIG, HIVE, SPARK, Sqoop, Oozie, etc)
    - Should know it’s managed service to run Hadoop, Spark workload on the cloud.
    - Should know the advantages of storing data in HDFS and Cloud Storage.
    -
    Should know what is the way to save the cost (eg. use preemptible VM, store data in the bucket, etc)
    - Should know the advantages/disadvantages of using auto-scaling. https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/autoscaling
    - Should know the limit(number of VM) and limitations of using preemptible VM.
    https://cloud.google.com/dataproc/docs/concepts/compute/secondary-vms
    - Uses of SSD/HDD (use case)
  3. Dataflow: Another very important (for exam perspective)
    - You should know the about apache beam (little more details) including:
    -> Pipelines: https://beam.apache.org/documentation/pipelines/design-your-pipeline/
    ->PCollection:
    -> Transform:
    -> Runner:
    -> Side Input:

    -> Various transformations functions (eg. Element, ParDo, Map, Flatmap, GroupByKey, Combine etc)
    - Should know how to deal with streaming data (eg windowing, trigger, watermark, etc)
    - How to handle later arrival data
    - You should know what other features Dataflow provides
    -> Horizontal autoscaling(adding VM automatically based on demand)
    -> Pre-defined data processing pipeline template managed by google
    -> Dynamic work re-balancer
    - Access Control: (Developer vs worker)
  4. Bigtable: Another big chunk of your exam
    - Should know when to use BigQuery vs BigTable
    - Should know how Bigtable store the data
    - Choose HDD/SSD (also should know how to change if required)
    - Choose the best row key (very important) (how to design row key)
    - How to prevent hot-spotting
    - Table design for time-series data vs other
    - What is the role of the app profile (also single routing policy vs multi-cluster routing).
    - Should know how to query the Bigtable data (Hbase, cbt tool)
  5. PubSub:
    - Should know push and pull mechanism. (which is best for real-time analysis)
    - You should know how publisher and subscription works (you should know in more details)
    - Should know maximum retention periods.
    - Should know the limitations eg. ordering, duplication may happen, etc.
    - Should know why do we need to acknowledge once the message received(Important)
    - Should know how to process/transform pubsub messages using dataflow.
  6. AI/ML: Lots of questions from this too. Consider this is a very important part of your exam. (I’m also very new on this topic)
    - I have created a bunch of list on this repo(please follow if required): https://github.com/narayansharma91/GoogleCloudPlatform/tree/master/CertificationsGuide/Professional%20Data%20Engineer#MACHINE-LEARNING
    (The definition might incorrect, I have written as per my understanding)
    - Also wanted to refer to this official course: (very important) https://developers.google.com/machine-learning/crash-course
  7. Cloud Storage:
    - Lifecycle management
    - How to securely share bucket data for a limited period (signed URL)
    - Should know how to encrypt using customer supply encryption key (configurations .boto file)
    - Various transfer services use cases (gsutil vs transfer service vs transfer appliance, etc)
    - Resumable upload vs multi-part uploads.
    - Access control
  8. Cloud SQL, Spanner, Datastore: Few questions from these services
    - Cloud SQL (read-replica, size limit, failover replica, etc)
    - Cloud Spanner (should know when to use, prevent hot-spotting, secondary index, etc)
    - DataStore: Should know index, strong consistency vs eventually consistent, export and import capabilities/limitations, etc.
  9. DataPrep & data fusion: You should know what is the use cases of these services.
  10. Cloud Composer: You should know when to use these services also should know it’s managed airflow services)

NOTE: this just my experience, If you want to details the visibility of all services please follow the above repo URL.

The courses that I followed while preparing

  1. Coursera (8 out of 10): this is a one-stop course which contains all the things related to data engineer:
    https://www.coursera.org/professional-certificates/gcp-data-engineering
  2. A Cloud Guru: (5 out of 10):
    https://acloudguru.com/course/google-cloud-certified-professional-data-engineer-la
  3. And the golden rule is read documentation:
  4. The extensive list of URL maintained by sathish vj: https://github.com/sathishvj/awesome-gcp-certifications/blob/master/professional-data-engineer.md

My Certification

Verification Link:
https://www.credential.net/95a182a9-b37a-43cf-ab72-db865c09e44e
Other Certifications:
credential.net/profile/narayansharma349/wallet

Would you like to connect with me to share, discuss any concerns, need any help? Connect with me on Linkedin: https://www.linkedin.com/in/narayansharma91/

--

--