New Year Special - 75% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: ac75sure

A data scientist at a financial services company is working with a Spark DataFrame containing...

A data scientist at a financial services company is working with a Spark DataFrame containing transaction records. The DataFrame has millions of rows and includes columns fortransaction_id,account_number,transaction_amount, andtimestamp. Due to an issue with the source system, some transactions were accidentally recorded multiple times with identical information across all fields. The data scientist needs to remove rows with duplicates across all fields to ensure accurate financial reporting.

Which approach should the data scientist use to deduplicate the orders using PySpark?

A.

df = df.dropDuplicates()

B.

df = df.groupBy("transaction_id").agg(F.first("account_number"), F.first("transaction_amount"), F.first("timestamp"))

C.

df = df.filter(F.col("transaction_id").isNotNull())

D.

df = df.dropDuplicates(["transaction_amount"])

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 PDF/Engine
  • Printable Format
  • Value of Money
  • 100% Pass Assurance
  • Verified Answers
  • Researched by Industry Experts
  • Based on Real Exams Scenarios
  • 100% Real Questions
buy now Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 pdf
Get 75% Discount on All Products, Use Coupon: "ac75sure"