A publishing company is developing a chat assistant that uses a containerized large language model...

Amazon Web Services AIP-C01 Full Course Access

Amazon Web Services AIP-C01 View All Questions

Amazon Web Services AIP-C01 Question Answer

A publishing company is developing a chat assistant that uses a containerized large language model (LLM) that runs on Amazon SageMaker AI. The architecture consists of an Amazon API Gateway REST API that routes user requests to an AWS Lambda function. The Lambda function invokes a SageMaker AI real-time endpoint that hosts the LLM.

Users report uneven response times. Analytics show that a high number of chats are abandoned after 2 seconds of waiting for the first token. The company wants a solution to ensure that p95 latency is under 800 ms for interactive requests to the chat assistant.

Which combination of solutions will meet this requirement? (Select TWO.)

Enable model preload upon container startup. Implement dynamic batching to process multiple user requests together in a single inference pass.

Select a larger GPU instance type for the SageMaker AI endpoint. Set the minimum number of instances to 0. Continue to perform per-request processing. Lazily load model weights on the first request.

Switch to a multi-model endpoint. Use lazy loading without request batching.

Set the minimum number of instances to greater than 0. Enable response streaming.

Switch to Amazon SageMaker Asynchronous Inference for all requests. Store requests in an Amazon S3 bucket. Set the minimum number of instances to 0.

Explanation:

The correct answers are A and D because they directly reduce time-to-first-token and stabilize p95 latency for interactive, real-time chat workloads hosted on Amazon SageMaker AI real-time endpoints.

Option D addresses the biggest driver of uneven latency: cold starts and scale-to-zero behavior. By setting the minimum number of instances to greater than 0, the endpoint always has warm capacity and loaded runtime resources, eliminating the first-request penalty that causes users to wait multiple seconds. Enabling response streaming improves perceived latency by returning the first tokens as soon as they are generated rather than waiting for the complete response. This directly targets the abandonment problem described (users leaving after waiting for the first token).

Option A further improves p95 latency and throughput by removing model loading overhead during inference and improving GPU utilization. Preloading model weights during container startup ensures the model is ready before traffic arrives and avoids unpredictable on-demand weight loading. Dynamic batching increases efficiency by grouping compatible requests into a single inference pass, reducing per-request overhead and improving GPU saturation. When tuned properly for interactive workloads, batching can reduce tail latency while preserving responsiveness by enforcing small batch windows.

Option B makes latency worse because setting minimum instances to 0 and lazily loading weights guarantees cold-start delays and unpredictable first-token performance. Option C similarly increases cold-start behavior through lazy loading and offers no batching benefits. Option E is designed for non-interactive workloads and introduces queueing and storage latency, which conflicts with the 800 ms p95 requirement for interactive chat.

Therefore, A and D are the best combination to achieve consistently low p95 latency and fast first-token streaming for a SageMaker-hosted chat assistant.

AIP-C01 PDF/Engine

Printable Format
Value of Money
100% Pass Assurance
Verified Answers
Researched by Industry Experts
Based on Real Exams Scenarios
100% Real Questions

Get 65% Discount on All Products, Use Coupon: "ac4s65"

A company needs a system to automatically generate study materials from multiple content sources.

A financial services company needs to pre-process unstructured data such as customer transcripts, financial reports,...

Summer Sale Special Limited Time 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: ac4s65

A publishing company is developing a chat assistant that uses a containerized large language model...

The Answer Is:

Explanation:

Quick Links