Analysis of Lung Cancer Risk Factors From Medical Records in Ethiopia Using Machine Learning

Demeke Endalie , Wondmagegn Taye Abebe


Cancer is a broad term that refers to a wide range of diseases that can affect any part of the human body. To minimize the number of cancer deaths and to prepare an appropriate health policy on cancer spread mitigation, scientifically supported knowledge of cancer causes is critical. As a result, in this study, we analyzed lung cancer risk factors that lead to a highly severe cancer case using a decision tree-based ranking algorithm. This feature relevance ranking algorithm computes the weight of each feature of the dataset by using split points to improve detection accuracy, and each risk factor is weighted based on the number of observations that occur for it on the decision tree. Coughing of blood, air pollution, and obesity are the most severe lung cancer risk factors out of nine, with a weight of 39%, 21%, and 14%, respectively. We also proposed a machine learning model that uses Extreme Gradient Boosting (XGBoost) to detect lung cancer severity levels in lung cancer patients. We used a dataset of 1000 lung cancer patients and 465 individuals free from lung cancer from Tikur Ambesa (Black Lion) Hospital in Addis Ababa, Ethiopia, to assess the performance of the proposed model. The proposed cancer severity level detection model achieved 98.9%, 99%, and 98.9% accuracy, precision, and recall, respectively, for the testing dataset. The findings can assist governments and non-governmental organizations in making lung cancer-related policy decisions.


Cancer is a complex and diverse disease; its occurrence patterns vary according to variances in underlying cancer risk factors, such as environmental and lifestyle factors [1]. According to studies, cancer is on the rise in economically transitional countries due to rapid population growth, higher life expectancy, the adoption of unhealthy lifestyles, and changes in reproductive patterns [2]. The prevalence of cancer in Ethiopia is rapidly increasing, with an annual estimate of 77,352 new cancer cases in 2022 [3]. The cancer burden was estimated using the Addis Ababa population-based cancer registry. As a result, breast cancer (31.5%) and cervix cancer (14.1%) are the two most prevalent cancers among females, whereas colorectal cancers (10.6%) and non-Hodgkin lymphomas (10.2%) are the most common malignancies among males [4].

Materials and method

The process of lung cancer risk factor analysis and a cancer severity detection model includes data collection, model evaluation, and model validation using various evaluation metrics. The high-level description of the proposed lung cancer severity detection model is shown in Fig 1. The architecture includes components such as a cancer patient’s demographic, medical history, and habits dataset; preprocessing components such as missing value filling, feature relevance calculation and selection; model training; and evaluation components.

Results and discussion

The experiment’s main goal was to detect the severity level of cancer disease from the risk factors that cause cancer. All of the experiments in this study were carried out on a computer with 16 GB of RAM, a Core i5, and the Windows 10 operating system. The source code for reading files, modeling, and presenting results was written in Python, and the hyperparameters of the machine learning algorithms used in this study were tuned using the gird search tuning strategy.


In this paper, we analyzed lung cancer risk factors and proposed a new lung cancer severity level predictive model. The data for this study came from Tikur Ambesa Hospital’s medical records repository, which included lung cancer patients and 465 healthy people who were tested for lung cancer as a control. We used a decision tree-based feature weighing strategy to determine which risk factor is dominant in the study area and the XGBoost machine learning algorithm to build a model to detect the severity level of lung cancer patients at the hospital. The results of the experiments suggest that dust allergies, obesity, fatigue, alcohol use, and passive smoking are the most prevalent risk factors in the study area. In addition, the proposed cancer severity level detection model produces an acceptable result with higher detection accuracy. Therefore, the findings of this study deserve to be used in different applications that use the severity level of cancer or in making health policies related to cancer. The study will be expanded with more data in the future, and it will be one component of a system that notifies the severity level of lung cancer based on risk factors.


The authors would like to sincerely thank Dr. Dagmawi Solomon for his assistance in obtaining the dataset used in this study. In addition, we thank Jimma University for its support with various resources.

Citation: Endalie D, Abebe WT (2023) Analysis of lung cancer risk factors from medical records in Ethiopia using machine learning. PLOS Digit Health 2(7): e0000308.

Editor: Henry Horng-Shing Lu, National Yang Ming Chiao Tung University, TAIWAN

Received: February 28, 2023; Accepted: June 23, 2023; Published: July 19, 2023

Copyright: © 2023 Endalie, Abebe. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: This research work’s data set and source code are publicly available online on GitHub (

Funding: The authors received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Harvard Medical School - Leadership in Medicine Southeast Asia47th IHF World Hospital CongressHealthcare CNO Summit - USAHealthcare CMO Summit - USA