Breast Cancer Classification: Features Investigation using Machine Learning Approaches
Keywords:Breast cancer, classification, machine learning
Breast cancer is the second most common cancer after lung cancer and one of the main causes of death worldwide. Women have a higher risk of breast cancer as compared to men. Thus, one of the early diagnosis with an accurate and reliable system is critical in breast cancer treatment. Machine learning techniques are well known and popular among researchers, especially for classification and prediction. An investigation was conducted to evaluate the performance of breast cancer classification for malignant tumors and benign tumors using various machine learning techniques, namely k-Nearest Neighbors (k-NN), Random Forest, and Support Vector Machine (SVM) and ensemble techniques to compute the prediction of the breast cancer survival by implementing 10-fold cross validation. This study used a dataset obtained from Wisconsin Diagnostic Breast Cancer (WDBC) with 23 selected features measured from 569 patients, from which 212 patients have malignant tumors and 357 patients have benign tumors. The analysis was performed to investigate the feature of the tumors based on its mean, standard error, and worst. Each feature has ten properties which are radius, texture, perimeter, area, smoothness, compactness, concavity, concave, symmetry and fractal dimensions. The selection of features was considered a significant influence to the breast cancer. The analysis is compared and evaluated with thirty features to determine the features used for breast cancer classification. The result shown AdaBoost has obtained the highest accuracy for thirty features at 98.95%, ten features of mean at 98.07%, and ten features of worst at 98.77% with a lowest error rate. Additionally, the proposed methods are classified using 2-fold, 3-fold, and 5-fold cross validation to meet the best accuracy rate. Comparison results between all methods show that AdaBoost ensemble methods gave the highest accuracy at 98.77% for 10-fold cross validation, while 2-fold and 3-fold cross validation at 98.41% and 98.24%, respectively. Nevertheless, the result with 5-fold cross validation shows SVM produced the best accuracy rate at 98.60% with the lowest error rate.
How to Cite
Open access licenses
Open Access is by licensing the content with a Creative Commons (CC) license.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.