Article contents
Optimizing Lung Cancer Risk Prediction with Advanced Machine Learning Algorithms and Techniques
Abstract
Lung cancer is among the leading causes of cancer death in the U.S.A. as well as globally and causes more deaths than breast, prostate, and colorectal cancers combined. It thus presents a significant health burden globally, with an estimated new case diagnosed and death toll at 2.2 and 1.8 million annually, respectively. Given the complexity of the etiology of lung cancer, there is a real urgent need for more accurate and reliable prediction models with the capability to integrate diverse risk factors. While current modalities for screening and imaging clinical conditions are effective, they are often costly and invasive. The study's main objective was to develop and evaluate machine learning models, using integrated demographic, environmental, and lifestyle variables for predicting lung cancer risk. The source of dataset for lung cancer risk prediction was retrieved from multiple sources, particularly, Cleveland hospital records as well as public health databases in the U.S; Besides, we also used large-scale epidemiology studies such as the National Lung Screening Trial (NLST) or the Prostate, Lung, Colorectal, and Ovarian (PLCO) Cancer Screening Trial. These sources provided invaluable datasets to which machine learning models were developed, as they contained very valuable information on demographic data, past medical history, lifestyle habits, and clinical symptoms. In this study, the experiment used 3 machine learning algorithms: Logistic Regression, XG-Boost, and Random Forest. Accuracy, precision, recall, as well as F1 score, are used as performance metrics. Overall, the performance of the Logistic Regression model surpassed the Random Forest and XG-Boost models. It had the highest scores in all the metrics, particularly, accuracy, precision, recall, and F1 score. This is indicative that the model Logistic Regression was slightly better at balancing the true positives and false positives and false negatives. The Random Forest model exemplified an intermediate performance, positioning itself second to the Logistic Regression. A significant volume of empirical studies has established that the different machine learning techniques, such as Logistic Regression and Random Forest considerably improve the detection of lung cancer. Although logistic regression, due to its simplicity and interpretability, remains very useful, Random Forest and XG-Boost are much more capable of modeling difficult nonlinear interactions in high-dimensional data. Advanced models like these will provide far more accurate, personalized risk estimates and have the potential to be a powerful contribution to early detection and better clinical decisions regarding lung cancer.
Article information
Journal
Journal of Medical and Health Studies
Volume (Issue)
5 (4)
Pages
35-48
Published
Copyright
Open access
This work is licensed under a Creative Commons Attribution 4.0 International License.