11.11 Sale Special - Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: xmaspas7

Easiest Solution 2 Pass Your Certification Exams

Professional-Data-Engineer Google Professional Data Engineer Exam Free Practice Exam Questions (2025 Updated)

Prepare effectively for your Google Professional-Data-Engineer Google Professional Data Engineer Exam certification with our extensive collection of free, high-quality practice questions. Each question is designed to mirror the actual exam format and objectives, complete with comprehensive answers and detailed explanations. Our materials are regularly updated for 2025, ensuring you have the most current resources to build confidence and succeed on your first attempt.

Page: 3 / 4
Total 387 questions

You are developing a software application using Google's Dataflow SDK, and want to use conditional, for loops and other complex programming structures to create a branching pipeline. Which component will be used for the data processing operation?

A.

PCollection

B.

Transform

C.

Pipeline

D.

Sink API

When using Cloud Dataproc clusters, you can access the YARN web interface by configuring a browser to connect through a ____ proxy.

A.

HTTPS

B.

VPN

C.

SOCKS

D.

HTTP

How would you query specific partitions in a BigQuery table?

A.

Use the DAY column in the WHERE clause

B.

Use the EXTRACT(DAY) clause

C.

Use the __PARTITIONTIME pseudo-column in the WHERE clause

D.

Use DATE BETWEEN in the WHERE clause

You have a job that you want to cancel. It is a streaming pipeline, and you want to ensure that any data that is in-flight is processed and written to the output. Which of the following commands can you use on the Dataflow monitoring console to stop the pipeline job?

A.

Cancel

B.

Drain

C.

Stop

D.

Finish

Flowlogistic wants to use Google BigQuery as their primary analysis system, but they still have Apache Hadoop and Spark workloads that they cannot move to BigQuery. Flowlogistic does not know how to store the data that is common to both workloads. What should they do?

A.

Store the common data in BigQuery as partitioned tables.

B.

Store the common data in BigQuery and expose authorized views.

C.

Store the common data encoded as Avro in Google Cloud Storage.

D.

Store he common data in the HDFS storage for a Google Cloud Dataproc cluster.

Flowlogistic’s management has determined that the current Apache Kafka servers cannot handle the data volume for their real-time inventory tracking system. You need to build a new system on Google Cloud Platform (GCP) that will feed the proprietary tracking software. The system must be able to ingest data from a variety of global sources, process and query in real-time, and store the data reliably. Which combination of GCP products should you choose?

A.

Cloud Pub/Sub, Cloud Dataflow, and Cloud Storage

B.

Cloud Pub/Sub, Cloud Dataflow, and Local SSD

C.

Cloud Pub/Sub, Cloud SQL, and Cloud Storage

D.

Cloud Load Balancing, Cloud Dataflow, and Cloud Storage

Flowlogistic is rolling out their real-time inventory tracking system. The tracking devices will all send package-tracking messages, which will now go to a single Google Cloud Pub/Sub topic instead of the Apache Kafka cluster. A subscriber application will then process the messages for real-time reporting and store them in Google BigQuery for historical analysis. You want to ensure the package data can be analyzed over time.

Which approach should you take?

A.

Attach the timestamp on each message in the Cloud Pub/Sub subscriber application as they are received.

B.

Attach the timestamp and Package ID on the outbound message from each publisher device as they are sent to Clod Pub/Sub.

C.

Use the NOW () function in BigQuery to record the event’s time.

D.

Use the automatically generated timestamp from Cloud Pub/Sub to order the data.

Flowlogistic’s CEO wants to gain rapid insight into their customer base so his sales team can be better informed in the field. This team is not very technical, so they’ve purchased a visualization tool to simplify the creation of BigQuery reports. However, they’ve been overwhelmed by all thedata in the table, and are spending a lot of money on queries trying to find the data they need. You want to solve their problem in the most cost-effective way. What should you do?

A.

Export the data into a Google Sheet for virtualization.

B.

Create an additional table with only the necessary columns.

C.

Create a view on the table to present to the virtualization tool.

D.

Create identity and access management (IAM) roles on the appropriate columns, so only they appear in a query.

To give a user read permission for only the first three columns of a table, which access control method would you use?

A.

Primitive role

B.

Predefined role

C.

Authorized view

D.

It's not possible to give access to only the first three columns of a table.

You are building a new data pipeline to share data between two different types of applications: jobs generators and job runners. Your solution must scale to accommodate increases in usage and must accommodate the addition of new applications without negatively affecting the performance of existing ones. What should you do?

A.

Create an API using App Engine to receive and send messages to the applications

B.

Use a Cloud Pub/Sub topic to publish jobs, and use subscriptions to execute them

C.

Create a table on Cloud SQL, and insert and delete rows with the job information

D.

Create a table on Cloud Spanner, and insert and delete rows with the job information

Your neural network model is taking days to train. You want to increase the training speed. What can you do?

A.

Subsample your test dataset.

B.

Subsample your training dataset.

C.

Increase the number of input features to your model.

D.

Increase the number of layers in your neural network.

Your company's data platform ingests CSV file dumps of booking and user profile data from upstream sources into Cloud Storage. The data analyst team wants to join these datasets on the email field available in both the datasets to perform analysis. However, personally identifiable information (PII) should not be accessible to the analysts. You need to de-identify the email field in both the datasets before loading them into BigQuery for analysts. What should you do?

A.

1. Create a pipeline to de-identify the email field by using recordTransformations in Cloud Data Loss Prevention (Cloud DLP) with masking as the de-identification transformations type.2. Load the booking and user profile data into a BigQuery table.

B.

1. Create a pipeline to de-identify the email field by using recordTransformations in Cloud DLP with format-preserving encryption with FFX as the de-identification transformation type.2. Load the booking and user profile data into a BigQuery table.

C.

1. Load the CSV files from Cloud Storage into a BigQuery table, and enable dynamic data masking.2. Create a policy tag with the email mask as the data masking rule.3. Assign the policy to the email field in both tables. A4. Assign the Identity and Access Management bigquerydatapolicy.maskedReader role for the BigQuery tables to the analysts.

D.

1. Load the CSV files from Cloud Storage into a BigQuery table, and enable dynamic data masking.2. Create a policy tag with the default masking value as the data masking rule.3. Assign the policy to the email field in both tables.4. Assign the Identity and Access Management bigquerydatapolicy.maskedReader role for the BigQuery tables to the analysts

You want to automate execution of a multi-step data pipeline running on Google Cloud. The pipeline includes Cloud Dataproc and Cloud Dataflow jobs that have multiple dependencies on each other. You want to use managed services where possible, and the pipeline will run every day. Which tool should you use?

A.

cron

B.

Cloud Composer

C.

Cloud Scheduler

D.

Workflow Templates on Cloud Dataproc

You have a BigQuery dataset named "customers". All tables will be tagged by using a Data Catalog tag template named "gdpr". The template contains one mandatory field, "has sensitive data~. with a boolean value. All employees must be able to do a simple search and find tables in the dataset that have either true or false in the "has sensitive data" field. However, only the Human Resources (HR) group should be able to see the data inside the tables for which "hass-ensitive-data" is true. You give the all employees group the bigquery.metadataViewer and bigquery.connectionUser roles on the dataset. You want to minimize configuration overhead. What should you do next?

A.

Create the "gdpr" tag template with private visibility. Assign the bigquery -dataViewer role to the HR group on the tables that contain sensitive data.

B.

Create the ~gdpr" tag template with private visibility. Assign the datacatalog. tagTemplateViewer role on this tag to the all employeesgroup, and assign the bigquery.dataViewer role to the HR group on the tables that contain sensitive data.

C.

Create the "gdpr" tag template with public visibility. Assign the bigquery. dataViewer role to the HR group on the tables that containsensitive data.

D.

Create the "gdpr" tag template with public visibility. Assign the datacatalog. tagTemplateViewer role on this tag to the all employees.group, and assign the bijquery.dataViewer role to the HR group on the tables that contain sensitive data.

Your company needs to ingest and transform streaming data from IoT devices and store it for analysis. The data is sensitive and requires encryption with your own key in transit and at rest. The volume of data is expected to fluctuate significantly throughout the day. You need to identify a solution that is managed and elastic. What should you do?

A.

Write data directly into BigQuery by using the Storage Write API, and process it in BigQuery by using SQL functions, selecting a Google-managed encryption key for each service.

B.

Publish data to Pub/Sub, process it with Dataflow and store it in Cloud SQL, selecting your key from Cloud HSM for each service.

C.

Publish data to Pub/Sub, process it with Dataflow and store it in BigQuery, selecting your key from Cloud KMS for each service.

D.

Write data directly into Cloud Storage, process it with Dataproc, and store it in BigQuery, selecting a customer-managed encryption key (CMEK) for each service.

You are migrating your on-premises data warehouse to BigQuery. As part of the migration, you want to facilitate cross-team collaboration to get the most value out of the organization's data. You need to design an architecture that would allow teams within the organization to securely publish, discover, and subscribe to read-only data in a self-service manner. You need to minimize costs while also maximizing data freshness What should you do?

A.

Create authorized datasets to publish shared data in the subscribing team's project.

B.

Create a new dataset for sharing in each individual team's project. Grant the subscribing team the bigquery. dataViewer role on thedataset.

C.

Use BigQuery Data Transfer Service to copy datasets to a centralized BigQuery project for sharing.

D.

Use Analytics Hub to facilitate data sharing.

An aerospace company uses a proprietary data format to store its night data. You need to connect this new data source to BigQuery and stream the data into BigQuery. You want to efficiency import the data into BigQuery where consuming as few resources as possible. What should you do?

A.

Use a standard Dataflow pipeline to store the raw data in BigQuery and then transform the format later when the data is used.

B.

Write a shell script that triggers a Cloud Function that performs periodic ETL batch jobs on the new data source

C.

Use Apache Hive to write a Dataproc job that streams the data into BigQuery in CSV format

D.

Use an Apache Beam custom connector to write a Dataflow pipeline that streams the data into BigQuery in Avro format

You are designing a real-time system for a ride hailing app that identifies areas with high demand for rides to effectively reroute available drivers to meet the demand. The system ingests data from multiple sources to Pub/Sub. processes the data, and stores the results for visualization and analysis in real-time dashboards. The data sources include driver location updates every 5 seconds and app-based booking events from riders. The data processing involves real-time aggregation of supply and demand data for the last 30 seconds, every 2 seconds, and storing the results in a low-latency system for visualization. What should you do?

A.

Group the data by using a tumbling window in a Dataflow pipeline, and write the aggregated data to Memorystore

B.

Group the data by using a hopping window in a Dataflow pipeline, and write the aggregated data to Memorystore

C.

Group the data by using a session window in a Dataflow pipeline, and write the aggregated data to BigQuery.

D.

Group the data by using a hopping window in a Dataflow pipeline, and write the aggregated data to BigQuery.

After migrating ETL jobs to run on BigQuery, you need to verify that the output of the migrated jobs is the same as the output of the original. You’ve loaded a table containing the output of the original job and want to compare the contents with output from the migrated job to show that they are identical. The tables do not contain a primary key column that would enable you to join them together for comparison.

What should you do?

A.

Select random samples from the tables using the RAND() function and compare the samples.

B.

Select random samples from the tables using the HASH() function and compare the samples.

C.

Use a Dataproc cluster and the BigQuery Hadoop connector to read the data from each table and calculate a hash from non-timestamp columns of the table after sorting. Compare the hashes of each table.

D.

Create stratified random samples using the OVER() function and compare equivalent samples from each table.

Government regulations in the banking industry mandate the protection of client’s personally identifiable information (PII). Your company requires PII to be access controlled encrypted and compliant with major data protection standards In addition to using Cloud Data Loss Prevention (Cloud DIP) you want to follow Google-recommended practices and use service accounts to control access to PII. What should you do?

A.

Assign the required identity and Access Management (IAM) roles to every employee, and create a single service account to access protect resources

B.

Use one service account to access a Cloud SQL database and use separate service accounts for each human user

C.

Use Cloud Storage to comply with major data protection standards. Use one service account shared by all users

D.

Use Cloud Storage to comply with major data protection standards. Use multiple service accounts attached to IAM groups to grant the appropriate access to each group

Page: 3 / 4
Total 387 questions
Copyright © 2014-2025 Solution2Pass. All Rights Reserved