A developer is working with a pandas DataFrame containing user behavior data from a web...

Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question Answer

A developer is working with a pandas DataFrame containing user behavior data from a web application.

Which approach should be used for executing a groupBy operation in parallel across all workers in Apache Spark 3.5?

Use the applylnPandas API

Use the applyInPandas API:

df.groupby("user_id").applyInPandas(mean_func, schema="user_id long, value double").show()

Use the mapInPandas API:

df.mapInPandas(mean_func, schema="user_id long, value double").show()

Use a regular Spark UDF:

from pyspark.sql.functions import mean

df.groupBy("user_id").agg(mean("value")).show()

Use a Pandas UDF:

@pandas_udf("double")

def mean_func(value: pd.Series) -> float:

return value.mean()

df.groupby("user_id").agg(mean_func(df["value"])).show()

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 PDF/Engine

Get 65% Discount on All Products, Use Coupon: "ac4s65"