Streaming with Spark as a model deployment strategy means applying a machine learning model to data streams that are processed incrementally and continuously by Spark Structured Streaming. Spark Structured Streaming is a scalable and fault-tolerant stream processing engine that enables complex analytics on live data streams using the Dataset/DataFrame API1. Spark Structured Streaming supports various sources and sinks for streaming data, such as Kafka, Kinesis, TCP sockets, Delta tables, etc2. Spark Structured Streaming also supports various types of operations on streaming data, such as aggregations, windowing, joins, and stateful transformations3. To deploy a machine learning model on streaming data, you can use the MLflow model registry to managethe model lifecycle and versioning4. You can also use the MLflow model serving feature to serve the model as a REST API endpoint that can be invoked by Spark Structured Streaming5. Alternatively, you can use the UDF (user-defined function) feature to apply the model to streaming data within Spark Structured Streaming6.
The inference of incrementally processed records as soon as trigger is hit describes the streaming with Spark as a model deployment strategy. A trigger defines when the results of a streaming query should be written to the output sink. A trigger can be based on a processing time interval, a data size limit, or a continuous mode that writes the results as soon as they are available. The trigger ensures that the streaming query is executed incrementally and continuously, and the model inference is applied to the latest available data. The other options are incorrect because:
Option A: The inference of batch processed records as soon as a trigger is hit does not describe streaming with Spark, but rather batch processing with Spark. Batch processing means applying a machine learning model to a finite set of data that is processed as a single job. Batch processing does not require a trigger, as the results are written to the output sink when the job is completed.
Option B: The inference of all types of records in real-time does not describe streaming with Spark, but rather a generic definition of real-time processing. Real-time processing means applying a machine learning model to data streams that are processed as soon as they arrive, with minimal latency. Real-time processing does not necessarily use Spark Structured Streaming, as there are other frameworks and tools that can support it, such as Apache Flink, Apache Storm, etc.
Option C: The inference of batch processed records as soon as a Spark job is run does not describe streaming with Spark, but rather batch processing with Spark. Batch processing means applying a machine learning model to a finite set of data that is processed as a single job. Batch processing does not depend on a Spark job, as the model inference can be done outside of Spark, such as using a REST API endpoint, a command-line tool, etc.
Option E: The inference of incrementally processed records as soon as a Spark job is run does not describe streaming with Spark, but rather a contradiction. Incrementally processed records imply streaming processing, while a Spark job implies batch processing. Streaming processing and batch processing are different paradigms of data processing, and cannot be mixed in this way. References: Structured Streaming Programming Guide, Input Sources and Output Sinks, Operations on streaming DataFrames/Datasets, MLflow Model Registry, MLflow Model Serving, Apply machine learning models, [Triggers], [Trigger Types], [Batch Processing], [Real-time Processing], [Real-time Data Processing Frameworks], [Deploy machine learning models], [Batch vs Streaming Processing]