The impact of training data selection on the software defect prediction performance and data complexity

Benyamin Langgu Sinaga, Sabrina Ahmad, Zuraida Abal Abas, Antasena Wahyu Anggarajati


Directly learning a defect prediction model from cross-project datasets results in a model with poor performance. Hence, training data selection becomes a feasible solution to this problem. Limited comparative studies investigating the effect of training data selection on the prediction performance have presented contradictory results. Those studies also did not analyze why a training data selection method underperforms. This study aims to investigate the impact of training data selection on the defect prediction model and data complexity measures. The method is based on an empirical comparison between prediction performance and data complexity measure before and after selection. This study compared 13 training data selection methods on 61 projects using six classification algorithms and measured the data complexity using six complexity measures focusing on overlap class, noise level, and class imbalanced ratio. Experimental results indicate that the best method for each dataset varies depending on the dataset and classifiers. The training data selection most affects noise rate and class imbalance. We concluded that critically selecting the training data method could improve the performance of the prediction model. We recommend dealing with noise and unbalanced classes when designing training data methods.


Comparative study; Cross-project defect prediction; Data complexity measure; Software defect prediction; Training data selection

Full Text:




  • There are currently no refbacks.

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Bulletin of EEI Stats