Spring Sale Special Limited Time 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: ac4s65

A publishing company is developing a chat assistant that uses a containerized large language model...

A publishing company is developing a chat assistant that uses a containerized large language model (LLM) that runs on Amazon SageMaker AI. The architecture consists of an Amazon API Gateway REST API that routes user requests to an AWS Lambda function. The Lambda function invokes a SageMaker AI real-time endpoint that hosts the LLM.

Users report uneven response times. Analytics show that a high number of chats are abandoned after 2 seconds of waiting for the first token. The company wants a solution to ensure that p95 latency is under 800 ms for interactive requests to the chat assistant.

Which combination of solutions will meet this requirement? (Select TWO.)

A.

Enable model preload upon container startup. Implement dynamic batching to process multiple user requests together in a single inference pass.

B.

Select a larger GPU instance type for the SageMaker AI endpoint. Set the minimum number of instances to 0. Continue to perform per-request processing. Lazily load model weights on the first request.

C.

Switch to a multi-model endpoint. Use lazy loading without request batching.

D.

Set the minimum number of instances to greater than 0. Enable response streaming.

E.

Switch to Amazon SageMaker Asynchronous Inference for all requests. Store requests in an Amazon S3 bucket. Set the minimum number of instances to 0.

AIP-C01 PDF/Engine
  • Printable Format
  • Value of Money
  • 100% Pass Assurance
  • Verified Answers
  • Researched by Industry Experts
  • Based on Real Exams Scenarios
  • 100% Real Questions
buy now AIP-C01 pdf
Get 65% Discount on All Products, Use Coupon: "ac4s65"