Databricks-Certified-Professional-Data-Engineer new test camp sheet

NEW QUESTION 34
What are the advantages of the Hashing Features?

Requires the less memory

Less pass through the training data

Easily reverse engineer vectors to determine which original feature mapped to a vector location

NEW QUESTION 35
A data engineer is overwriting data in a table by deleting the table and recreating the table. Another data
engineer suggests that this is inefficient and the table should simply be overwritten instead.
Which of the following reasons to overwrite the table instead of deleting and recreating the table is incorrect?

Overwriting a table is an atomic operation and will not leave the table in an unfinished state

Overwriting a table maintains the old version of the table for Time Travel

Overwriting a table is efficient because no files need to be deleted

Overwriting a table results in a clean table history for logging and audit purposes

Overwriting a table allows for concurrent queries to be completed while in progress

NEW QUESTION 36
A data engineer has set up a notebook to automatically process using a Job. The data engineer’s manager wants
to version control the schedule due to its complexity.
Which of the following approaches can the data engineer use to obtain a version-controllable con-figuration of
the Job’s schedule?

They can download the JSON description of the Job from the Job’s page

They can submit the Job once on an all-purpose cluster

They can link the Job to notebooks that are a part of a Databricks Repo

They can submit the Job once on a Job cluster

They can download the XML description of the Job from the Job’s page

NEW QUESTION 37
A table customerLocations exists with the following schema:
1. id STRING,
2. date STRING,
3. city STRING,
4. country STRING
A senior data engineer wants to create a new table from this table using the following command:
1. CREATE TABLE customersPerCountry AS
2. SELECT country,
3. COUNT(*) AS customers
4. FROM customerLocations
5. GROUP BY country;
A junior data engineer asks why the schema is not being declared for the new table. Which of the following
responses explains why declaring the schema is not necessary?

CREATE TABLE AS SELECT statements result in tables that do not support schemas

CREATE TABLE AS SELECT statements assign all columns the type STRING

CREATE TABLE AS SELECT statements adopt schema details from the source table and query

CREATE TABLE AS SELECT statements infer the schema by scanning the data

CREATE TABLE AS SELECT statements result in tables where schemas are optional

NEW QUESTION 38
A data engineer needs to create a database called customer360 at the loca-tion /customer/customer360. The
data engineer is unsure if one of their colleagues has already created the database.
Which of the following commands should the data engineer run to complete this task?

CREATE DATABASE customer360 DELTA LOCATION ‘/customer/customer360’;

CREATE DATABASE customer360 LOCATION ‘/customer/customer360’;

CREATE DATABASE IF NOT EXISTS customer360 DELTA LOCATION ‘/customer/customer360’;

CREATE DATABASE IF NOT EXISTS customer360;

CREATE DATABASE IF NOT EXISTS customer360 LOCATION ‘/customer/customer360’;

NEW QUESTION 39
You are asked to create a model to predict the total number of monthly subscribers for a specific magazine.
You are provided with 1 year’s worth of subscription and payment data, user demographic data, and 10 years
worth of content of the magazine (articles and pictures). Which algorithm is the most appropriate for building
a predictive model for subscribers?

Linear regression

Logistic regression

Decision trees

TF-IDF

NEW QUESTION 40
A junior data engineer needs to create a Spark SQL table my_table for which Spark manages both the data and
the metadata. The metadata and data should also be stored in the Databricks Filesystem (DBFS).
Which of the following commands should a senior data engineer share with the junior data engineer to
complete this task?

1. CREATE MANAGED TABLE my_table (id STRING, value STRING) USING
2. org.apache.spark.sql.parquet OPTIONS (PATH “storage-path”);

1. CREATE TABLE my_table (id STRING, value STRING) USING DBFS;

1. CREATE TABLE my_table (id STRING, value STRING) USING
2. org.apache.spark.sql.parquet OPTIONS (PATH “storage-path”)

1. CREATE TABLE my_table (id STRING, value STRING);

1. CREATE MANAGED TABLE my_table (id STRING, value STRING);

NEW QUESTION 41
A data engineering team has been using a Databricks SQL query to monitor the performance of an ELT job.
The ELT job is triggered by a specific number of input records being ready to process. The Databricks SQL
query returns the number of minutes since the job’s most recent runtime.
Which of the following approaches can enable the data engineering team to be notified if the ELT job has not
been run in an hour?

They can set up an Alert for the query to notify when the ELT job fails

They can set up an Alert for the accompanying dashboard to notify them if the returned value is greater
than 60

They can set up an Alert for the accompanying dashboard to notify when it has not re-freshed in 60
minutes

They can set up an Alert for the query to notify them if the returned value is greater than 60

This type of alerting is not possible in Databricks

NEW QUESTION 42
A junior data engineer has ingested a JSON file into a table raw_table with the following schema:
1. cart_id STRING,
2. items ARRAY<item_id:STRING>
The junior data engineer would like to unnest the items column in raw_table to result in a new table with the
following schema:
1.cart_id STRING,
2.item_id STRING
Which of the following commands should the junior data engineer run to complete this task?

1. SELECT cart_id, flatten(items) AS item_id
2. FROM raw_table;

1. SELECT cart_id, reduce(items) AS item_id
2. FROM raw_table;

1. SELECT cart_id, slice(items) AS item_id
2. FROM raw_table;

1. SELECT cart_id, filter(items) AS item_id
2. FROM raw_table;

1. SELECT cart_id, explode(items) AS item_id
2. FROM raw_table;

NEW QUESTION 43
A data engineer has three notebooks in an ELT pipeline. The notebooks need to be executed in a specific order
for the pipeline to complete successfully. The data engineer would like to use Delta Live Tables to manage this
process.
Which of the following steps must the data engineer take as part of implementing this pipeline using Delta
Live Tables?

They need to create a Delta Live Tables pipeline from the Jobs page

They need to refactor their notebook to use Python and the dlt library

They need to create a Delta Live tables pipeline from the Compute page

They need to create a Delta Live Tables pipeline from the Data page

They need to refactor their notebook to use SQL and CREATE LIVE TABLE keyword

NEW QUESTION 44
Projecting a multi-dimensional dataset onto which vector has the greatest variance?

first principal component

first eigenvector

not enough information given to answer

second eigenvector

second principal component

NEW QUESTION 45
A data analyst has noticed that their Databricks SQL queries are running too slowly. They claim that this issue
is affecting all of their sequentially run queries. They ask the data engineering team for help. The data
engineering team notices that each of the queries uses the same SQL endpoint, but the SQL endpoint is not
used by any other user.
Which of the following approaches can the data engineering team use to improve the latency of the data
analyst’s queries?

They can increase the maximum bound of the SQL endpoint’s scaling range

They can increase the cluster size of the SQL endpoint

They can turn on the Auto Stop feature for the SQL endpoint

They can turn on the Serverless feature for the SQL endpoint and change the Spot In-stance Policy to
“Reliability Optimized”

They can turn on the Serverless feature for the SQL endpoint

NEW QUESTION 46
A data architect is designing a data model that works for both video-based machine learning work-loads and
highly audited batch ETL/ELT workloads.
Which of the following describes how using a data lakehouse can help the data architect meet the needs of
both workloads?

A data lakehouse requires very little data modeling

A data lakehouse combines compute and storage for simple governance

A data lakehouse fully exists in the cloud

A data lakehouse stores unstructured data and is ACID-compliant

A data lakehouse provides autoscaling for compute clusters

NEW QUESTION 47
A data engineer is designing a data pipeline. The source system generates files in a shared directory that is also
used by other processes. As a result, the files should be kept as is and will accumulate in the directory. The
data engineer needs to identify which files are new since the previous run in the pipeline, and set up the
pipeline to only ingest those new files with each run.
Which of the following tools can the data engineer use to solve this problem?

Unity Catalog

Auto Loader

Data Explorer

Delta Lake

Databricks SQL

NEW QUESTION 48
A data engineering team is in the process of converting their existing data pipeline to utilize Auto Loader for
incremental processing in the ingestion of JSON files. One data engineer comes across the following code
block in the Auto Loader documentation:
1. (streaming_df = spark.readStream.format(“cloudFiles”)
2. .option(“cloudFiles.format”, “json”)
3. .option(“cloudFiles.schemaLocation”, schemaLocation)
4. .load(sourcePath))
Assuming that schemaLocation and sourcePath have been set correctly, which of the following changes does
the data engineer need to make to convert this code block to use Auto Loader to ingest the data?

There is no change required. The inclusion of format(“cloudFiles”) enables the use of Auto Loader

There is no change required. Databricks automatically uses Auto Loader for streaming reads

The data engineer needs to change the format(“cloudFiles”) line to format(“autoLoader”)

The data engineer needs to add the .autoLoader line before the .load(sourcePath) line

There is no change required. The data engineer needs to ask their administrator to turn on Auto Loader

NEW QUESTION 49
A data engineering team needs to query a Delta table to extract rows that all meet the same condi-tion.
However, the team has noticed that the query is running slowly. The team has already tuned the size of the
data files. Upon investigating, the team has concluded that the rows meeting the condition are sparsely located
throughout each of the data files.
Based on the scenario, which of the following optimization techniques could speed up the query?

Tuning the file size

Bin-packing

Data skipping

Write as a Parquet file

Z-Ordering

NEW QUESTION 50
Which of the following describes a scenario in which a data engineer will want to use a Job cluster instead of
an all-purpose cluster?

An ad-hoc analytics report needs to be developed while minimizing compute costs

A data engineer needs to manually investigate a production error

An automated workflow needs to be run every 30 minutes

A data team needs to collaborate on the development of a machine learning model

A Databricks SQL query needs to be scheduled for upward reporting

NEW QUESTION 51
A data engineer has configured a Structured Streaming job to read from a table, manipulate the data, and then
perform a streaming write into a new table. The code block used by the data engineer is below:
1. (spark.table(“sales”)
2. .withColumn(“avg_price”, col(“sales”) / col(“units”))
3. .writeStream
4. .option(“checkpointLocation”, checkpointPath)
5. .outputMode(“complete”)
6. ._____
7. .table(“new_sales”)
8.)
If the data engineer only wants the query to execute a single micro-batch to process all of the available data,
which of the following lines of code should the data engineer use to fill in the blank?

.processingTime(1)

.processingTime(“once”)

.trigger(processingTime=”once”)

.trigger(once=True)

.trigger(continuous=”once”)

NEW QUESTION 52
A denote the event ‘student is female’ and let B denote the event ‘student is French’. In a class of 100 students
suppose 60 are French, and suppose that 10 of the French students are females. Find the probability that if I
pick a French student, it will be a girl, that is, find P(A|B).

1/3

2/3

1/6

2/6

Tag Databricks-Certified-Professional-Data-Engineer new test camp sheet

Ultimate Guide to Databricks-Certified-Professional-Data-Engineer Dumps – Enhance Your Future Career Now [Q34-Q52]