|Breast Cancer Diagnosis Using Cluster-based Undersampling and Boosted C5.0 Algorithm
Jue Zhang, Li Chen*, Jian-xue Tian, Fazeel Abid, Wusi Yang, and Xiao-fen Tang
International Journal of Control, Automation, and Systems, vol. 19, no. 5, pp.1998-2008, 2021
Abstract : Learning from imbalanced data set is relatively new challenge for breast cancer diagnosis, where the diseases cases are often quite rare relative to normal population. Although traditional algorithms are all accuracyoriented which result biased towards the majority class. The combinations of sampling methods with ensemble classifiers have shown certainly good performance. In this paper, a hybrid of cluster-based undersampling and boosted C5.0 is proposed. The proposed classification model consists of two phases: cluster analysis and classification. In cluster analysis, affinity propagation algorithm is used to define the number of clusters, and then the k-means clustering is utilized to select the border and informative samples. In the classification phase, C5.0 algorithm is used in conjunction with boosting technical, owing to leverage the strength of the individual classifiers. The proposed algorithm is assessed by 14 benchmark imbalanced data sets taken from UCI dataset repository. The extensive experimental results on different imbalanced datasets demonstrated that the proposed algorithm can achieve better classification performance in terms of Matthews’ Correlation Coefficient (MCC) as compared to other existing imbalanced dataset classification algorithms.
Breast cancer diagnosis, cluste analysis, imbalanced data classification, sample selection, undersampling.
Download PDF : Click this link