Weekend Sale - Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: xmaspas7

Easiest Solution 2 Pass Your Certification Exams

Professional-Data-Engineer Google Professional Data Engineer Exam Free Practice Exam Questions (2025 Updated)

Prepare effectively for your Google Professional-Data-Engineer Google Professional Data Engineer Exam certification with our extensive collection of free, high-quality practice questions. Each question is designed to mirror the actual exam format and objectives, complete with comprehensive answers and detailed explanations. Our materials are regularly updated for 2025, ensuring you have the most current resources to build confidence and succeed on your first attempt.

Page: 3 / 4
Total 376 questions

You are creating a new pipeline in Google Cloud to stream IoT data from Cloud Pub/Sub through Cloud Dataflow to BigQuery. While previewing the data, you notice that roughly 2% of the data appears to be corrupt. You need to modify the Cloud Dataflow pipeline to filter out this corrupt data. What should you do?

A.

Add a SideInput that returns a Boolean if the element is corrupt.

B.

Add a ParDo transform in Cloud Dataflow to discard corrupt elements.

C.

Add a Partition transform in Cloud Dataflow to separate valid data from corrupt data.

D.

Add a GroupByKey transform in Cloud Dataflow to group all of the valid data together and discard the rest.

You work for a large ecommerce company. You are using Pub/Sub to ingest the clickstream data to Google Cloud for analytics. You observe that when a new subscriber connects to an existing topic to analyze data, they are unable to subscribe to older data for an upcoming yearly sale event in two months, you need a solution that, once implemented, will enable any new subscriber to read the last 30 days of data. What should you do?

A.

Create a new topic, and publish the last 30 days of data each time a new subscriber connects to an existing topic.

B.

Set the topic retention policy to 30 days.

C.

Set the subscriber retention policy to 30 days.

D.

Ask the source system to re-push the data to Pub/Sub, and subscribe to it.

You use a dataset in BigQuery for analysis. You want to provide third-party companies with access to the same dataset. You need to keep the costs of data sharing low and ensure that the data is current. What should you do?

A.

Use Analytics Hub to control data access, and provide third party companies with access to the dataset

B.

Create a Dataflow job that reads the data in frequent time intervals and writes it to the relevant BigQuery dataset or Cloud Storage bucket for third-party companies to use.

C.

Use Cloud Scheduler to export the data on a regular basis to Cloud Storage, and provide third-party companies with access to the bucket.

D.

Create a separate dataset in BigQuery that contains the relevant data to share, and provide third-party companies with access to the new dataset.

You are administering a BigQuery on-demand environment. Your business intelligence tool is submitting hundreds of queries each day that aggregate a large (50 TB) sales history fact table at the day and month levels. These queries have a slow response time and are exceeding cost expectations. You need to decrease response time, lower query costs, and minimize maintenance. What should you do?

A.

Build materialized views on top of the sales table to aggregate data at the day and month level.

B.

Build authorized views on top of the sales table to aggregate data at the day and month level.

C.

Enable Bl Engine and add your sales table as a preferred table.

D.

Create a scheduled query to build sales day and sales month aggregate tables on an hourly basis.

You architect a system to analyze seismic data. Your extract, transform, and load (ETL) process runs as a series of MapReduce jobs on an Apache Hadoop cluster. The ETL process takes days to process a data set because some steps are computationally expensive. Then you discover that a sensor calibration step has been omitted. How should you change your ETL process to carry out sensor calibration systematically in the future?

A.

Modify the transformMapReduce jobs to apply sensor calibration before they do anything else.

B.

Introduce a new MapReduce job to apply sensor calibration to raw data, and ensure all other MapReduce jobs are chained after this.

C.

Add sensor calibration data to the output of the ETL process, and document that all users need to apply sensor calibration themselves.

D.

Develop an algorithm through simulation to predict variance of data output from the last MapReduce job based on calibration factors, and apply the correction to all data.

Data Analysts in your company have the Cloud IAM Owner role assigned to them in their projects to allow them to work with multiple GCP products in their projects. Your organization requires that all BigQuery data access logs be retained for 6 months. You need to ensure that only audit personnel in your company can access the data access logs for all projects. What should you do?

A.

Enable data access logs in each Data Analyst’s project. Restrict access to Stackdriver Logging via Cloud IAM roles.

B.

Export the data access logs via a project-level export sink to a Cloud Storage bucket in the Data Analysts’ projects. Restrict access to the Cloud Storage bucket.

C.

Export the data access logs via a project-level export sink to a Cloud Storage bucket in a newly created projects for audit logs. Restrict access to the project with the exported logs.

D.

Export the data access logs via an aggregated export sink to a Cloud Storage bucket in a newly created project for audit logs. Restrict access to the project that contains the exported logs.

An online brokerage company requires a high volume trade processing architecture. You need to create a secure queuing system that triggers jobs. The jobs will run in Google Cloud and cat thecompany's Python API to execute trades. You need to efficiently implement a solution. What should you do?

A.

Use Cloud Composer to subscribe to a Pub/Sub tope and can the Python API.

B.

Use a Pub/Sub push subscription to trigger a Cloud Function to pass the data to tie Python API.

C.

Write an application that makes a queue in a NoSQL database

D.

Write an application hosted on a Compute Engine instance that makes a push subscription to the Pub/Sub topic

You are implementing several batch jobs that must be executed on a schedule. These jobs have many interdependent steps that must be executed in a specific order. Portions of the jobs involve executing shell scripts, running Hadoop jobs, and running queries in BigQuery. The jobs are expected to run for many minutes up to several hours. If the steps fail, they must be retried a fixed number of times. Which service should you use to manage the execution of these jobs?

A.

Cloud Scheduler

B.

Cloud Dataflow

C.

Cloud Functions

D.

Cloud Composer

You are building a teal-lime prediction engine that streams files, which may contain Pll (personal identifiable information) data, into Cloud Storage and eventually into BigQuery You want to ensure that the sensitive data is masked but still maintains referential Integrity, because names and emails are often used as join keys How should you use the Cloud Data Loss Prevention API (DLP API) to ensure that the Pll data is not accessible by unauthorized individuals?

A.

Create a pseudonym by replacing the Pll data with cryptogenic tokens, and store the non-tokenized data in a locked-down button.

B.

Redact all Pll data, and store a version of the unredacted data in a locked-down bucket

C.

Scan every table in BigQuery, and mask the data it finds that has Pll

D.

Create a pseudonym by replacing Pll data with a cryptographic format-preserving token

Your startup has a web application that currently serves customers out of a single region in Asia. You are targeting funding that will allow your startup lo serve customers globally. Your current goal is to optimize for cost, and your post-funding goat is to optimize for global presence and performance. You must use a native JDBC driver. What should you do?

A.

Use Cloud Spanner to configure a single region instance initially. and then configure multi-region C oud Spanner instances after securing funding.

B.

Use a Cloud SQL for PostgreSQL highly available instance first, and 8»gtable with US. Europe, and Asia

replication alter securing funding

C.

Use a Cloud SQL for PostgreSQL zonal instance first and Bigtable with US. Europe, and Asia after securing funding.

D.

Use a Cloud SOL for PostgreSQL zonal instance first, and Cloud SOL for PostgreSQL with highly available configuration after securing funding.

Your organization has two Google Cloud projects, project A and project B. In project A, you have a Pub/Sub topic that receives data from confidential sources. Only the resources in project A should be able to access the data in that topic. You want to ensure that project B and any future project cannot access data in the project A topic. What should you do?

A.

Configure VPC Service Controls in the organization with a perimeter around the VPC of project A.

B.

Add firewall rules in project A so only traffic from the VPC in project A is permitted.

C.

Configure VPC Service Controls in the organization with a perimeter around project A.

D.

Use Identity and Access Management conditions to ensure that only users and service accounts in project A can access resources in project.

You recently deployed several data processing jobs into your Cloud Composer 2 environment. You notice that some tasks are failing in Apache Airflow. On the monitoring dashboard, you see an increase in the total workers’ memory usage, and there were worker pod evictions. You need to resolve these errors. What should you do?

Choose 2 answers

A.

Increase the directed acyclic graph (DAG) file parsing interval.

B.

Increase the memory available to the Airflow workers.

C.

Increase the maximum number of workers and reduce worker concurrency.

D.

Increase the memory available to the Airflow triggerer.

E.

Increase the Cloud Composer 2 environment size from medium to large.

You are planning to migrate your current on-premises Apache Hadoop deployment to the cloud. You need to ensure that the deployment is as fault-tolerant and cost-effective as possible for long-running batch jobs. You want to use a managed service. What should you do?

A.

Deploy a Cloud Dataproc cluster. Use a standard persistent disk and 50% preemptible workers. Store data in Cloud Storage, and change references in scripts from hdfs:// to gs://

B.

Deploy a Cloud Dataproc cluster. Use an SSD persistent disk and 50% preemptible workers. Store data in Cloud Storage, and change references in scripts from hdfs:// to gs://

C.

Install Hadoop and Spark on a 10-node Compute Engine instance group with standard instances. Install the Cloud Storage connector, and store the data in Cloud Storage. Change references in scripts from hdfs:// to gs://

D.

Install Hadoop and Spark on a 10-node Compute Engine instance group with preemptible instances. Store data in HDFS. Change references in scripts from hdfs:// to gs://

Your company maintains a hybrid deployment with GCP, where analytics are performed on your anonymized customer data. The data are imported to Cloud Storage from your data center through parallel uploads to a data transfer server running on GCP. Management informs you that the daily transfers take too long and have

asked you to fix the problem. You want to maximize transfer speeds. Which action should you take?

A.

Increase the CPU size on your server.

B.

Increase the size of the Google Persistent Disk on your server.

C.

Increase your network bandwidth from your datacenter to GCP.

D.

Increase your network bandwidth from Compute Engine to Cloud Storage.

You are deploying a batch pipeline in Dataflow. This pipeline reads data from Cloud Storage, transforms the data, and then writes the data into BigQuory. The security team has enabled anorganizational constraint in Google Cloud, requiring all Compute Engine instances to use only internal IP addresses and no external IP addresses. What should you do?

A.

Ensure that the firewall rules allow access to Cloud Storage and BigQuery. Use Dataflow with only internal IPs.

B.

Ensure that your workers have network tags to access Cloud Storage and BigQuery. Use Dataflow with only internal IP addresses.

C.

Create a VPC Service Controls perimeter that contains the VPC network and add Dataflow. Cloud Storage, and BigQuery as allowed

services in the perimeter. Use Dataflow with only internal IP addresses.

D.

Ensure that Private Google Access is enabled in the subnetwork. Use Dataflow with only internal IP addresses.

You are training a spam classifier. You notice that you are overfitting the training data. Which three actions can you take to resolve this problem? (Choose three.)

A.

Get more training examples

B.

Reduce the number of training examples

C.

Use a smaller set of features

D.

Use a larger set of features

E.

Increase the regularization parameters

F.

Decrease the regularization parameters

A web server sends click events to a Pub/Sub topic as messages. The web server includes an event Timestamp attribute in the messages, which is the time when the click occurred. You have a Dataflow streaming job that reads from this Pub/Sub topic through a subscription, applies some transformations, and writes the result to another Pub/Sub topic for use by the advertising department. The advertising department needs to receive each message within 30 seconds of the corresponding click occurrence, but they report receiving the messages late. Your Dataflow job's system lag is about 5 seconds, and the data freshness is about 40 seconds. Inspecting a few messages show no more than 1 second lag between their event Timestamp and publish Time. What is the problem and what should you do?

A.

The advertising department is causing delays when consuming the messages. Work with the advertising department to fix this.

B.

Messages in your Dataflow job are processed in less than 30 seconds, but your job cannot keep up with the backlog in the Pub/Sub

subscription. Optimize your job or increase the number of workers to fix this.

C.

The web server is not pushing messages fast enough to Pub/Sub. Work with the web server team to fix this.

D.

Messages in your Dataflow job are taking more than 30 seconds to process. Optimize your job or increase the number of workers to fix this.

MJTelco is building a custom interface to share data. They have these requirements:

    They need to do aggregations over their petabyte-scale datasets.

    They need to scan specific time range rows with a very fast response time (milliseconds).

Which combination of Google Cloud Platform products should you recommend?

A.

Cloud Datastore and Cloud Bigtable

B.

Cloud Bigtable and Cloud SQL

C.

BigQuery and Cloud Bigtable

D.

BigQuery and Cloud Storage

You create a new report for your large team in Google Data Studio 360. The report uses Google BigQuery as its data source. It is company policy to ensure employees can view only the data associated with their region, so you create and populate a table for each region. You need to enforce the regional access policy to the data.

Which two actions should you take? (Choose two.)

A.

Ensure all the tables are included in global dataset.

B.

Ensure each table is included in a dataset for a region.

C.

Adjust the settings for each table to allow a related region-based security group view access.

D.

Adjust the settings for each view to allow a related region-based security group view access.

E.

Adjust the settings for each dataset to allow a related region-based security group view access.

MJTelco needs you to create a schema in Google Bigtable that will allow for the historical analysis of the last 2 years of records. Each record that comes in is sent every 15 minutes, and contains a unique identifier of the device and a data record. The most common query is for all the data for a given device for a given day. Which schema should you use?

A.

Rowkey: date#device_idColumn data: data_point

B.

Rowkey: dateColumn data: device_id, data_point

C.

Rowkey: device_idColumn data: date, data_point

D.

Rowkey: data_pointColumn data: device_id, date

E.

Rowkey: date#data_pointColumn data: device_id

Page: 3 / 4
Total 376 questions
Copyright © 2014-2025 Solution2Pass. All Rights Reserved