The correct answer is C because tokenization is the NLP process of breaking down text into smaller units, such as words, subwords, or characters, that can be processed by ML models.
From AWS documentation:
"Tokenization is the process of splitting text into meaningful units, such as words or subwords, that are used as input tokens in NLP tasks. Tokenization is an essential step in preparing text data for models."
This is a foundational concept used in language models, including those on Amazon Bedrock and SageMaker.
Explanation of other options:
A. Encryption is not related to NLP and is used for data security.
B. Compression reduces file size but is unrelated to language processing.
D. Translation is a separate NLP task that uses tokenization as a preprocessing step but is not the definition of tokenization itself.
Referenced AWS AI/ML Documents and Study Guides:
AWS NLP on Amazon SageMaker Documentation
AWS Machine Learning Specialty Guide – NLP Fundamentals
Amazon Bedrock Foundation Model Documentation – Tokenization and Input Formatting