Weekend Sale - Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: xmaspas7

Easiest Solution 2 Pass Your Certification Exams

Professional-Data-Engineer Google Professional Data Engineer Exam Free Practice Exam Questions (2025 Updated)

Prepare effectively for your Google Professional-Data-Engineer Google Professional Data Engineer Exam certification with our extensive collection of free, high-quality practice questions. Each question is designed to mirror the actual exam format and objectives, complete with comprehensive answers and detailed explanations. Our materials are regularly updated for 2025, ensuring you have the most current resources to build confidence and succeed on your first attempt.

Page: 3 / 4
Total 383 questions

You want to build a managed Hadoop system as your data lake. The data transformation process is composed of a series of Hadoop jobs executed in sequence. To accomplish the design of separating storage from compute, you decided to use the Cloud Storage connector to store all input data, output data, and intermediary data. However, you noticed that one Hadoop job runsvery slowly with Cloud Dataproc, when compared with the on-premises bare-metal Hadoop environment (8-core nodes with 100-GB RAM). Analysis shows that this particular Hadoop job is disk I/O intensive. You want to resolve the issue. What should you do?

A.

Allocate sufficient memory to the Hadoop cluster, so that the intermediary data of that particular Hadoop job can be held in memory

B.

Allocate sufficient persistent disk space to the Hadoop cluster, and store the intermediate data of that particular Hadoop job on native HDFS

C.

Allocate more CPU cores of the virtual machine instances of the Hadoop cluster so that the networking bandwidth for each instance can scale up

D.

Allocate additional network interface card (NIC), and configure link aggregation in the operating system to use the combined throughput when working with Cloud Storage

You have a table that contains millions of rows of sales data, partitioned by date Various applications and users query this data many times a minute. The query requires aggregating values by using avg. max. and sum, and does not require joining to other tables. The required aggregations are only computed over the past year of data, though you need to retain full historical data in the base tables You want to ensure that the query results always include the latest data from the tables, while also reducing computation cost, maintenance overhead, and duration. What should you do?

A.

Create a materialized view to aggregate the base table data Configure a partition expiration on the base table to retain only the last one year of partitions.

B.

Create a materialized view to aggregate the base table data include a filter clause to specify the last one year of partitions.

C.

Create a new table that aggregates the base table data include a filter clause to specify the last year of partitions. Set up a scheduled query to recreate the new table every hour.

D.

Create a view to aggregate the base table data Include a filter clause to specify the last year of partitions.

You are running your BigQuery project in the on-demand billing model and are executing a change data capture (CDC) process that ingests data. The CDC process loads 1 GB of data every 10 minutes into a temporary table, and then performs a merge into a 10 TB target table. This process is very scan intensive and you want to explore options to enable a predictable cost model. You need to create a BigQuery reservation based on utilization information gathered from BigQuery Monitoring and apply the reservation to the CDC process. What should you do?

A.

Create a BigQuery reservation for the job.

B.

Create a BigQuery reservation for the service account running the job.

C.

Create a BigQuery reservation for the dataset.

D.

Create a BigQuery reservation for the project.

Flowlogistic is rolling out their real-time inventory tracking system. The tracking devices will all send package-tracking messages, which will now go to a single Google Cloud Pub/Sub topic instead of the Apache Kafka cluster. A subscriber application will then process the messages for real-time reporting and store them in Google BigQuery for historical analysis. You want to ensure the package data can be analyzed over time.

Which approach should you take?

A.

Attach the timestamp on each message in the Cloud Pub/Sub subscriber application as they are received.

B.

Attach the timestamp and Package ID on the outbound message from each publisher device as they are sent to Clod Pub/Sub.

C.

Use the NOW () function in BigQuery to record the event’s time.

D.

Use the automatically generated timestamp from Cloud Pub/Sub to order the data.

Flowlogistic wants to use Google BigQuery as their primary analysis system, but they still have Apache Hadoop and Spark workloads that they cannot move to BigQuery. Flowlogistic does not know how to store the data that is common to both workloads. What should they do?

A.

Store the common data in BigQuery as partitioned tables.

B.

Store the common data in BigQuery and expose authorized views.

C.

Store the common data encoded as Avro in Google Cloud Storage.

D.

Store he common data in the HDFS storage for a Google Cloud Dataproc cluster.

An aerospace company uses a proprietary data format to store its night data. You need to connect this new data source to BigQuery and stream the data into BigQuery. You want to efficiency import the data into BigQuery where consuming as few resources as possible. What should you do?

A.

Use a standard Dataflow pipeline to store the raw data m BigQuery and then transform the format later when the data is used

B.

Write a she script that triggers a Cloud Function that performs periodic ETL batch jobs on the new data source

C.

Use Apache Hive to write a Dataproc job that streams the data into BigQuery in CSV format

D.

Use an Apache Beam custom connector to write a Dataflow pipeline that streams the data into BigQuery in Avro format

Flowlogistic’s CEO wants to gain rapid insight into their customer base so his sales team can be better informed in the field. This team is not very technical, so they’ve purchased a visualization tool to simplify the creation of BigQuery reports. However, they’ve been overwhelmed by all thedata in the table, and are spending a lot of money on queries trying to find the data they need. You want to solve their problem in the most cost-effective way. What should you do?

A.

Export the data into a Google Sheet for virtualization.

B.

Create an additional table with only the necessary columns.

C.

Create a view on the table to present to the virtualization tool.

D.

Create identity and access management (IAM) roles on the appropriate columns, so only they appear in a query.

Flowlogistic’s management has determined that the current Apache Kafka servers cannot handle the data volume for their real-time inventory tracking system. You need to build a new system on Google Cloud Platform (GCP) that will feed the proprietary tracking software. The system must be able to ingest data from a variety of global sources, process and query in real-time, and store the data reliably. Which combination of GCP products should you choose?

A.

Cloud Pub/Sub, Cloud Dataflow, and Cloud Storage

B.

Cloud Pub/Sub, Cloud Dataflow, and Local SSD

C.

Cloud Pub/Sub, Cloud SQL, and Cloud Storage

D.

Cloud Load Balancing, Cloud Dataflow, and Cloud Storage

Cloud Dataproc is a managed Apache Hadoop and Apache _____ service.

A.

Blaze

B.

Spark

C.

Fire

D.

Ignite

Cloud Dataproc charges you only for what you really use with _____ billing.

A.

month-by-month

B.

minute-by-minute

C.

week-by-week

D.

hour-by-hour

Which of the following is NOT a valid use case to select HDD (hard disk drives) as the storage for Google Cloud Bigtable?

A.

You expect to store at least 10 TB of data.

B.

You will mostly run batch workloads with scans and writes, rather than frequently executing random reads of a small number of rows.

C.

You need to integrate with Google BigQuery.

D.

You will not use the data to back a user-facing or latency-sensitive application.

Dataproc clusters contain many configuration files. To update these files, you will need to use the --properties option. The format for the option is: file_prefix:property=_____.

A.

details

B.

value

C.

null

D.

id

How would you query specific partitions in a BigQuery table?

A.

Use the DAY column in the WHERE clause

B.

Use the EXTRACT(DAY) clause

C.

Use the __PARTITIONTIME pseudo-column in the WHERE clause

D.

Use DATE BETWEEN in the WHERE clause

Which of these numbers are adjusted by a neural network as it learns from a training dataset (select 2 answers)?

A.

Weights

B.

Biases

C.

Continuous features

D.

Input values

The _________ for Cloud Bigtable makes it possible to use Cloud Bigtable in a Cloud Dataflow pipeline.

A.

Cloud Dataflow connector

B.

DataFlow SDK

C.

BiqQuery API

D.

BigQuery Data Transfer Service

What Dataflow concept determines when a Window's contents should be output based on certain criteria being met?

A.

Sessions

B.

OutputCriteria

C.

Windows

D.

Triggers

When creating a new Cloud Dataproc cluster with the projects.regions.clusters.create operation, these four values are required: project, region, name, and ____.

A.

zone

B.

node

C.

label

D.

type

Cloud Bigtable is Google's ______ Big Data database service.

A.

Relational

B.

mySQL

C.

NoSQL

D.

SQL Server

Which TensorFlow function can you use to configure a categorical column if you don't know all of the possible values for that column?

A.

categorical_column_with_vocabulary_list

B.

categorical_column_with_hash_bucket

C.

categorical_column_with_unknown_values

D.

sparse_column_with_keys

What are all of the BigQuery operations that Google charges for?

A.

Storage, queries, and streaming inserts

B.

Storage, queries, and loading data from a file

C.

Storage, queries, and exporting data

D.

Queries and streaming inserts

Page: 3 / 4
Total 383 questions
Copyright © 2014-2025 Solution2Pass. All Rights Reserved