Option B is the most appropriate solution because it uses AWS-native, purpose-built data engineering and governance services to address data quality validation, metadata creation, monitoring, and transformation with minimal custom development. AWS Glue is designed specifically for large-scale data preparation and integrates seamlessly with Amazon S3, making it ideal for preprocessing unstructured datasets for downstream GenAI applications.
AWS Glue crawlers automatically infer schemas and populate the AWS Glue Data Catalog, creating auditable, queryable metadata for all datasets. This satisfies the requirement for traceability and governance, which is especially critical in financial services environments. Glue ETL jobs allow teams to implement customizable transformation logic, including text normalization and chunking strategies optimized for foundation model context windows.
AWS Glue Data Quality provides built-in rulesets for validating completeness, accuracy, and consistency. It also publishes quality metrics that can be monitored over time, meeting the requirement for ongoing data quality monitoring without building custom validation frameworks.
Because AWS Glue is fully managed, it eliminates the need to manage infrastructure, scaling, or orchestration. This significantly reduces development and operational effort compared to custom Lambda pipelines or EC2-based processing. The processed and validated data can then be safely ingested into Amazon Bedrock workflows or knowledge bases.
Option A and C require custom logic for validation, monitoring, and chunking, increasing development complexity. Option D introduces unnecessary infrastructure management and services not optimized for data preprocessing.
Therefore, Option B best meets the requirements while minimizing development effort and aligning with AWS Generative AI data preparation best practices.