Experiences from my Google Cloud Professional Data Engineer Exam
Data engineer roles and a few of the products(eg. Machine Learning, Hadoop, etc) are something that isn’t part of my day to day job still wanted to explore a few amazing services (eg. Bigquery, dataflow, PubSub, etc) by google and thought of trying to spend a few months and finally able to crack this one of the difficult exams.
Long story short let me point out a few of the points which are very important from the exam perspective.
For Exam preparation, I have created one repo that covers most of the topics which are very important for the exam.
Important Topics
- BigQuery, BigQuery, Bigquery: (I hope you have understood how much import BigQuery is)
- Partitions (all types of partitions)
- Federated queries (use cases, limitations, etc)
- Batch operations vs Stream operations (Limit, de-duplication)
- Standard SQL vs Legacy SQL
- Bigquery Decorator (For legacy SQL)
- User-defined functions (we can write functions with js)
- Analytical & Geographical functions (eg. OVER, RANK, LEAD, LAG, etc)
- Authorized views (Very Important)
- Type of query processing (Interactive vs Batch)
- Best practice while working with BigQuery
-> Avoid **Select ***
-> De-normalize when-ever possible
-> Use Struct & Array for faster performance and low cost.
-> Limit doesn’t impact the cost
-> Partition
-> Clustering
-> Use APPROX_COUNT_DISTINCT over count(distinct)
-> Put the largest table on the left while joining
- Pricing Model (eg. on-demand vs slots based)
-> Should know which is best for a particular scenario.
-> Warehouse Migrations
-> Finally Access control (this is everywhere) - Dataproc: This is also very important from exam perspective you can expect around 5 questions from this.
- Should have good familiarities with BigData tools (eg. HDFS, MapReduce, YARN, PIG, HIVE, SPARK, Sqoop, Oozie, etc)
- Should know it’s managed service to run Hadoop, Spark workload on the cloud.
- Should know the advantages of storing data in HDFS and Cloud Storage.
- Should know what is the way to save the cost (eg. use preemptible VM, store data in the bucket, etc)
- Should know the advantages/disadvantages of using auto-scaling. https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/autoscaling
- Should know the limit(number of VM) and limitations of using preemptible VM.
https://cloud.google.com/dataproc/docs/concepts/compute/secondary-vms
- Uses of SSD/HDD (use case) - Dataflow: Another very important (for exam perspective)
- You should know the about apache beam (little more details) including:
-> Pipelines: https://beam.apache.org/documentation/pipelines/design-your-pipeline/
->PCollection:
-> Transform:
-> Runner:
-> Side Input:
-> Various transformations functions (eg. Element, ParDo, Map, Flatmap, GroupByKey, Combine etc)
- Should know how to deal with streaming data (eg windowing, trigger, watermark, etc)
- How to handle later arrival data
- You should know what other features Dataflow provides
-> Horizontal autoscaling(adding VM automatically based on demand)
-> Pre-defined data processing pipeline template managed by google
-> Dynamic work re-balancer
- Access Control: (Developer vs worker) - Bigtable: Another big chunk of your exam
- Should know when to use BigQuery vs BigTable
- Should know how Bigtable store the data
- Choose HDD/SSD (also should know how to change if required)
- Choose the best row key (very important) (how to design row key)
- How to prevent hot-spotting
- Table design for time-series data vs other
- What is the role of the app profile (also single routing policy vs multi-cluster routing).
- Should know how to query the Bigtable data (Hbase, cbt tool) - PubSub:
- Should know push and pull mechanism. (which is best for real-time analysis)
- You should know how publisher and subscription works (you should know in more details)
- Should know maximum retention periods.
- Should know the limitations eg. ordering, duplication may happen, etc.
- Should know why do we need to acknowledge once the message received(Important)
- Should know how to process/transform pubsub messages using dataflow. - AI/ML: Lots of questions from this too. Consider this is a very important part of your exam. (I’m also very new on this topic)
- I have created a bunch of list on this repo(please follow if required): https://github.com/narayansharma91/GoogleCloudPlatform/tree/master/CertificationsGuide/Professional%20Data%20Engineer#MACHINE-LEARNING
(The definition might incorrect, I have written as per my understanding)
- Also wanted to refer to this official course: (very important) https://developers.google.com/machine-learning/crash-course - Cloud Storage:
- Lifecycle management
- How to securely share bucket data for a limited period (signed URL)
- Should know how to encrypt using customer supply encryption key (configurations .boto file)
- Various transfer services use cases (gsutil vs transfer service vs transfer appliance, etc)
- Resumable upload vs multi-part uploads.
- Access control - Cloud SQL, Spanner, Datastore: Few questions from these services
- Cloud SQL (read-replica, size limit, failover replica, etc)
- Cloud Spanner (should know when to use, prevent hot-spotting, secondary index, etc)
- DataStore: Should know index, strong consistency vs eventually consistent, export and import capabilities/limitations, etc. - DataPrep & data fusion: You should know what is the use cases of these services.
- Cloud Composer: You should know when to use these services also should know it’s managed airflow services)
NOTE: this just my experience, If you want to details the visibility of all services please follow the above repo URL.
The courses that I followed while preparing
- Coursera (8 out of 10): this is a one-stop course which contains all the things related to data engineer:
https://www.coursera.org/professional-certificates/gcp-data-engineering - A Cloud Guru: (5 out of 10):
https://acloudguru.com/course/google-cloud-certified-professional-data-engineer-la - And the golden rule is read documentation:
- The extensive list of URL maintained by sathish vj: https://github.com/sathishvj/awesome-gcp-certifications/blob/master/professional-data-engineer.md
My Certification
Verification Link:
https://www.credential.net/95a182a9-b37a-43cf-ab72-db865c09e44e
Other Certifications:
credential.net/profile/narayansharma349/wallet
Other Certifications Guides:
Associate Cloud Engineer:
https://medium.com/@narayansharma91/how-i-was-able-to-clear-my-google-cloud-engineer-exam-c8553835fbb0
Professional Cloud Architect: https://medium.com/@narayansharma91/important-topics-which-need-to-be-gear-up-in-order-to-passed-google-cloud-architect-certification-2cea4e3f3534
Professional Security Engineer:
https://medium.com/@narayansharma91/important-topics-to-passed-google-cloud-professional-cloud-security-engineer-certification-8f2dfcf9ff44
Professional Cloud Engineer:
https://narayansharma91.medium.com/notes-guides-from-my-gcp-professional-cloud-engineer-exam-ec6581607a36
Would you like to connect with me to share, discuss any concerns, need any help? Connect with me on Linkedin: https://www.linkedin.com/in/narayansharma91/