Increasing Efficiency of Svmp+ for Handling Missing Values in Healthcare Prediction

Yufeng Zhang, Zijun Gao, Emily Wittrup, Jonathan Gryak, Kayvan Najarian

Abstract

Missing data presents a challenge for machine learning applications specifically when utilizing electronic health records to develop clinical decision support systems. The lack of these values is due in part to the complex nature of clinical data in which the content is personalized to each patient. Several methods have been developed to handle this issue, such as imputation or complete case analysis, but their limitations restrict the solidity of findings. However, recent studies have explored how using some features as fully available privileged information can increase model performance including in SVM. Building on this insight, we propose a computationally efficient kernel SVM-based framework (l2-SVMp+) that leverages partially available privileged information to guide model construction. Our experiments validated the superiority of l2-SVMp+ over common approaches for handling missingness and previous implementations of SVMp+ in both digit recognition, disease classification and patient readmission prediction tasks.

Introduction

Clinical Decision Support Systems (CDSS) rely heavily on machine learning algorithms to provide accurate predictions, but these algorithms are often challenged by missing values in Electronic Health Records (EHR) [1–3]. Given the complex nature of medical data, this is a common hurdle to algorithmic developments. For example, a patient in one condition might undergo multiple radiology scans while another patient receives specific lab tests, leading to missingness in both data modalities. Some other times, the information crucial for predicting patient outcomes could be missing due to various reasons such as changes in protocol and limited data access between institutions. Traditionally, researchers have resorted to methods like complete case analysis, dropping features, and imputation to handle missing data. Complete case analysis refers to discarding samples that have any missing values and restricting the study cohort to those with complete data [4]. Alternatively, all the features with missing values can be dropped from the analysis entirely

Materials and method

In the LUPAPI paradigm, the additional privileged information is contained only in a subset of the training samples and is not available in the testing stage. We represent the training data as triplets for i = 1, …, m and pairs {(xi, yi)} for i = m + 1, …, n, where n and m denote the sizes of the training dataset and the subset with privileged information respectively. For the ith training sample, represents the main features, represents the privileged features when they are available, and yi = ±1 is the sample’s label.

Results

The original MNIST+ dataset was split into training, validation, and testing sets as shown in Fig 1A, and this same split scheme was used for all experiments. The best hyperparameter combinations for each experiment were selected based on performance on the validation dataset, and the model performance was evaluated using the testing dataset. In the standard SVM models, only the main features were used for training, while both the main and privileged features were utilized in the SVMp+ and l2-SVMp+ models. To compare the performance of SVMp+ and l2-SVMp+ when privileged information (PI) is only partially available in training, PI was randomly sampled under a specific seed to provide availability ranging from 50% to 90%. This sampling was repeated independently five times to yield five sets of PI under each availability level. Additionally, the sampling was performed to ensure robustness and provide statistical measures for the reported results.

Discussion

There are several strategies to address missingness in EHR data analysis, including discarding patients with missing values, dropping the features and data imputation; however, all will potentially compromise the study reliability when the missing mechanism is ignored. Initially designed for utilizing partially accessible privileged information, SVMp+ achieved success on the ARDS detection task. Nevertheless, its computational inefficiency poses a challenge to its application on a larger dataset.

Conclusion

In this study, we introduced a highly efficient algorithm for solving the kernel SVMp+ problem. Our approach involves adding an l2-regularizer to the original formulation, thereby converting the problem into a one-class SVM. This enables efficient and accurate optimization using an embedded SMO solver. We conducted extensive experimentation on three different tasks to evaluate the performance of our approach. Our results demonstrated that our method outperforms other common approaches for handling missing values, and showed superior efficiency and accuracy. In summary, the proposed algorithm presents a novel and highly effective solution for kernel SVMp+ in the context of missing values.

Citation: Zhang Y, Gao Z, Wittrup E, Gryak J, Najarian K (2023) Increasing efficiency of SVMp+ for handling missing values in healthcare prediction. PLOS Digit Health 2(6): e0000281. https://doi.org/10.1371/journal.pdig.0000281

Editor: Mecit Can Emre Simsekler, Khalifa University of Science and Technology, UNITED ARAB EMIRATES

Received: February 20, 2023; Accepted: May 29, 2023; Published: June 29, 2023

Copyright: © 2023 Zhang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The MNIST+ data are collected from https://github.com/wenli-vision/svmplus_matlab, which is public and well-preprocessed. The heart failure dataset is a part of the PhysioNet Restricted Health Data, a freely-available medical research data platform. The dataset is available to qualified investigators which have been formally approved and under the terms of a data use agreement. https://physionet.org/content/heart-failure-zigong/1.3/ and contact contact@physionet.org for more information. The UCI heart disease dataset is publicly accessible on https://archive.ics.uci.edu/ml/datasets/heart+disease.

Funding: This material is based upon work supported by the National Science Foundation under Grant No. 1722801 and Grant No. 2014003. YZ received funding from National Science Foundation (NSF) Grant No. 2014003, ZG received funding from NSF Grant No. 1722801. EW and KN both received funding from NSF Grant No. 1722801 and Grant No. 2014003. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Source: https://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0000281#sec019