Research Article

Comparative analysis of explainable machine learning models for cancer classification using cytological features

Authors

  • Md Ismail Hossain Siddiqui Master of Science in Engineering/Industrial Management, Westcliff University, Irvine, California, USA
  • Md. Soebur Rahman Master of Business Administration in Management Information Systems, International American University, Los Angeles, California, USA
  • Abdul Aziz Kabir Master of Business Administration in Data Analytics, Westcliff University, Irvine, California, USA
  • Farhad Uddin Mahmud Master of Business Administration in Management Information Systems, International American University, Los Angeles, California, USA
  • Saeed Ur Rashid Master of Business Administration in Data Analytics, Westcliff University, Irvine, California, USA
  • Ramisa Samin Shammah College of Technology and Engineering, Westcliff University, Irvine, USA

Abstract

Breast cancer is among the causes of cancer related deaths globally with the greatest impact being in the low resource and high volume health care facilities where timely and accurate screening is paramount. This research report is a explainable machine learning model of breast cancer diagnosis based on quantitative features of fine needle aspirate images of breast masses. The data set contains 569 samples and 30 real-valued predictors of cell nuclei morphology, and it does not contain missing values, and the class distribution is moderate. It uses a structured preprocessing pipeline, such as the division of data into training and held-out test sets, feature normalization, and the careful management of class imbalance.Several classification models are compared, among them, Random Forest and Gaussian Naive Bayes, to compare the predictive accuracy and reliability of the model. Experimental outcomes have shown that the Random Forest model obtains the best performance with an accuracy of 0.96 on the held out test set, and balanced precision and recall on the benign and malignant classes. The confusion matrix shows that there is low misclassification rate and only false positive and false negative are three and three respectively. Contrastingly, Gaussian Naive Bayes has a higher accuracy of 0.93, and is less sensitive to malignant cases because of its independence assumptions which are not completely met in the dataset as verified by correlation analysis. These results are also supported by receiver operating characteristic analysis, whose area under the curve value is 1.00 in random forest and 0.99 in Gaussian naive bayes.The findings emphasize the role of model selection in clinical decision support systems, especially in cases where false negatives have to be minimized. The suggested structure focuses on interpretability and high practical use thus appropriate in deployment-oriented screening processes in resource-constrained settings. This paper shows that explainable machine learning models are trained on structured cytological features and can give effective and reliable support in early detection of breast cancer.

Article information

Journal

Journal of Medical and Health Studies

Volume (Issue)

4 (5)

Pages

110-150

Published

2023-10-29

Downloads

Views

37

Downloads

9

Keywords:

Breast cancer classification, explainable machine learning, fine needle aspirate cytology, cytological feature analysis, Random Forest, Gaussian Naive Bayes