In PMI-CPMAI’s treatment of data for AI, especially in sensitive domains like healthcare, the first responsibility of the project and data science teams is tounderstand and assess data quality and suitabilitybefore model development. The guidance states that AI teams should “systematically profile candidate data sources to evaluate completeness, consistency, validity, and coverage of key populations and variables relevant to the use case.” Data profiling tools are highlighted as a practical means to inspect distributions, missing values, outliers, and anomalies across structured clinical, administrative, and claims data.
For a patient readmission prediction use case, PMI-CPMAI stresses that teams must identify which sources (EHR, discharge summaries, lab results, prior admissions, demographics, social determinants, etc.) are available and then “quantify data quality metrics such as completeness and timeliness to determine whether the dataset is fit for training and deployment.” While techniques such as augmentation or real-time validation might be valuable later, they build upon an initial understanding obtained via profiling. Operationalizing a catalog supports governance and discovery but does not directly satisfy the immediate need to measure data quality.
Therefore, the method that best meets the objective ofidentifying data sources and ensuring data qualityis touse data profiling tools to assess data completeness and other quality dimensions, providing an evidence-based foundation for subsequent preprocessing, feature engineering, and model training.