Utilizing Machine Learning Techniques for Cancer Prediction and Classification based on Gene Expression Data
DOI:
https://doi.org/10.21928/uhdjst.v9n1y2025.pp135-148Keywords:
Cancer Classification, Gene Expression Data, RNA-Seq, DNA Microarray, Bidirectional Encoder Representations from Transformers Model, Machine Learning, Pan-cancer, The Cancer Genome Atlas, DistilBERTAbstract
Cancer classification through genetic evaluation has become a hot topic among researchers. It holds the promise of delivering systematic, precise, and scientifically backed diagnoses for different types of cancer. Lately, several studies have delved into cancer classification by leveraging data mining techniques, machine learning algorithms, and statistical methods to thoroughly analyze high-dimensional datasets. Detecting cancer early by examining gene expression data is vital for providing effective patient care. Each sample in the Gene dataset usually includes a range of features, each representing a specific gene. In this paper, we propose a unique approach that utilizes DistilBERT, a distilled version of the Bidirectional Encoder Representations from Transformers, for cancer classification and prediction. In addition, our model integrates a self-attention mechanism in the transformer layers to enhance the model’s focus on key features and employs an embedding layer for dimensionality reduction, improving the processing of gene statistics, preventing overfitting, and boosting generalization. We utilized datasets from important resources: The gene expression omnibus, which provided microarray records of lung and ovarian cancers, and the cancer genome atlas (TCGA), which offered RNA-Seq facts encompassing multiple most cancer types (breast invasive carcinoma, kidney renal clear cell carcinoma, colon adenocarcinoma, lung adenocarcinoma, and prostate adenocarcinoma). Our approach established excessive accuracy across all datasets, showcasing big upgrades in overall model performance compared to present strategies within the subject. The results underscore the ability to leverage transformer-primarily based architectures for strong cancer-type prediction and classification. Our approach achieved and improved exceptional accuracy compared to previous studies, with DS1: 97.56% for lung cancer, DS2: 100% for ovarian cancer, and DS3: 99.504% for the TCGA dataset.
References
F. Aldi, F. Hadi, N. A. Rahmi and S. Defit. “Standardscaler’s potential in enhancing breast cancer accuracy using machine learning”. Journal of Applied Engineering and Technological Science, vol. 5, no. 1, pp. 401-413, 2023.
L. Rukhsar, W. H. Bangyal, M. S. Ali Khan, A. A. Ag Ibrahim, K. Nisar and D. B. Rawat. “Analyzing RNA-seq gene expression data using deep learning approaches for cancer classification”. Applied Sciences, vol. 12, no. 4, p. 1850, 2022.
N. Tabassum, M. A. S. Kamal, M. Akhand and K. Yamada. “Cancer classification from gene expression using ensemble learning with an influential feature selection technique”. BioMedInformatics, vol. 4, no. 2, pp. 1275-1288, 2024.
M. L. R. AbdElNabi, M. Wajeeh Jasim, H. M. El-Bakry, M. Hamed N. Taha and N. E. M. Khalifa. “Breast and colon cancer classification from gene expression profiles using data mining techniques”. Symmetry, vol. 12, no. 3, p. 408, 2020.
H. AlShamlan and H. AlMazrua. “Enhancing cancer classification through a hybrid bio-inspired evolutionary algorithm for biomarker gene selection”. Computers, Materials and Continua, vol. 79, no. 1, pp. 675-694, 2024.
W. Ali and F. Saeed. “Hybrid filter and genetic algorithm-based feature selection for improving cancer classification in high-dimensional microarray data”. Processes, vol. 11, no. 2, p. 562, 2023.
M. Khalsan, L. R. Machado, E. S. Al-Shamery, S. Ajit, K. Anthony, M. Mu and M. O. Agyeman. “A survey of machine learning approaches applied to gene expression analysis for cancer prediction”. IEEE Access, vol. 10, pp. 27522-27534, 2022.
R. K. Singh and M. Sivabalakrishnan. “Feature selection of gene expression data for cancer classification: A review”. Procedia Computer Science, vol. 50, pp. 52-57, 2015.
F. Alharbi and A. Vakanski. “Machine learning methods for cancer classification using gene expression data: A review”. Bioengineering, vol. 10, no. 2, p. 173, 2023.
S. Gupta, M. K. Gupta, M. Shabaz and A. Sharma. “Deep learning techniques for cancer classification using microarray gene expression data”. Frontiers in Physiology, vol. 13, p. 952709, 2022.
M. Mohammed, H. Mwambi, I. B. Mboya, M. K. Elbashir and B. Omolo. “A stacking ensemble deep learning approach to cancer type classification based on TCGA data”. Scientific Reports, vol. 11, no. 1, p. 15626, 2021.
D. Mukhopadhyay, D. D. Phanord, R. J. Dalpatadu, L. P. Gewali and A. K. Singh. “ML classification of cancer types using high dimensional gene expression microarray data”. Preprints. 2024.
B. Büyüköz, A. Hürriyetoğlu and A. Özgür. “Analyzing ELMo and DistilBERT on Socio-political News Classification”. In: Proceedings of the Workshop on Automated Extraction of Socio-political Events from News 2020. Marseille, France, pp. 9-18, 2020.
Y. Wu, Z. Jin, C. Shi, P. Liang and T. Zhan. “Research on the application of deep learning-based BERT model in sentiment analysis”. arXiv preprint arXiv:2403.08217, 2024.
S. Jamshidi, M. Mohammadi, S. Bagheri, H. E. Najafabadi, A. Rezvanian, M. Gheisari, M. Ghaderzadeh, A. S. Shahabi and Z. Wu. “Effective text classification using BERT, MTM LSTM, and DT”. Data and Knowledge Engineering, vol. 151, p. 102306, 2024.
Y. Ji, Z. Zhou, H. Liu and R. V. Davuluri. “DNABERT: Pre-trained bidirectional encoder representations from transformers model for DNA-language in genome”. Bioinformatics, vol. 37, no. 15, pp. 2112-2120, 2021.
E. C. Garrido-Merchan, R. Gozalo-Brizuela and S. Gonzalez-Carvajal. “Comparing BERT against traditional machine learning models in text classification”. Journal of Computational and Cognitive Engineering, vol. 2, no. 4, pp. 352-356, 2023.
V. Dogra, A. Singh, S. Verma, Kavita, N. Jhanjhi and M. Talib. “Analyzing DistilBERT for Sentiment Classification of Banking Financial News”. In: Intelligent Computing and Innovation on Data Science: Proceedings of ICTIDS 2021. Springer, Singapore, pp. 501-510, 2021.
S. Sucharita, B. Sahu and T. Swarnkar. “Efficient Gene expression data analysis using ES-DBN for microarray cancer data classification”. EAI Endorsed Transactions on Pervasive Health and Technology, vol. 10, pp. 1-12, 2024.
H. Hijazi and C. Chan. “A classification framework applied to cancer gene expression profiles”. Journal of Healthcare Engineering, vol. 4, no. 2, pp. 255-283, 2013.
T. Thakur, I. Batra, A. Malik, D. Ghimire, S. H. Kim and A. S. Hosen. “RNN-CNN based cancer prediction model for gene expression”. IEEE Access, vol. 11, pp. 131024-131044, 2023.
I. Guyon, J. Weston, S. Barnhill and V. Vapnik. “Gene selection for cancer classification using support vector machines”. Machine Learning, vol. 46, pp. 389-422, 2002.
Y. Wei, M. Gao, J. Xiao, C. Liu, Y. Tian and Y. He. “Research and implementation of cancer gene data classification based on deep learning”. Journal of Software Engineering and Applications, vol. 16, no. 6, pp. 155-169, 2023.
Y. Li, K. Kang, J. M. Krahn, N. Croutwater, K. Lee, D. M. Umbach and L. Li. “A comprehensive genomic pan-cancer classification using the cancer genome atlas gene expression data”. BMC Genomics, vol. 18, p. 508, 2017.
P. García-Díaz, I. Sánchez-Berriel, J. A. Martínez-Rojas and A. M. Diez-Pascual. “Unsupervised feature selection algorithm for multiclass cancer classification of gene expression RNA-Seq data”. Genomics, vol. 112, no. 2, pp. 1916-1925, 2020.
L. P. Chen. “Classification and prediction for multi-cancer data with ultrahigh-dimensional gene expressions”. PLos One, vol. 17, no. 9, p. e0274440, 2022.
A. Das, N. Neelima, K. Deepa and T. Özer. “Gene selection based cancer classification with adaptive optimization using deep learning architecture”. IEEE Access, vol. 12, pp. 62234–62255, 2024.
A. Yaqoob, N. K. Verma and R. M. Aziz. “Optimizing gene selection and cancer classification with hybrid sine cosine and cuckoo search algorithm”. Journal of Medical Systems, vol. 48, no. 1, p. 10, 2024.
S. Tarek, R. Abd Elwahab and M. Shoman. “Gene expression based cancer classification”. Egyptian Informatics Journal, vol. 18, no. 3, pp. 151-159, 2017.
S. Aburass, O. Dorgha and J. Al Shaqsi. “A hybrid machine learning model for classifying gene mutations in cancer using LSTM, BiLSTM, CNN, GRU, and GloVe”. Systems and Soft Computing, vol. 6, p. 200110, 2024.
J. A. Martínez Logreira. "Machine learning-based cancer classification using gene expression data", (Master’s thesis). Universidad de los Andes, Bogotá, Colombia, 2020.
F. Neutatz, B. Chen, Z. Abedjan and E. Wu. “From cleaning before ML to cleaning For ML”. IEEE Data Engineering Bulletin, vol. 44, no. 1, pp. 24-41, 2021.
L. Huang, J. Qin, Y. Zhou, F. Zhu, L. Liu and L. Shao. “Normalization techniques in training dnns: Methodology, analysis and application”. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 8, pp. 10173-10196, 2023.
J. Sun and Y. Xia. “Pretreating and normalizing metabolomics data for statistical analysis”. Genes and Diseases, vol. 11, no. 3, p. 100979, 2024.
R. Dang and W. Yu. “Standard deviation effect of average structure descriptor on grain boundary energy prediction”. Materials, vol. 16, no. 3, p. 1197, 2023.
Z. Huo, G. Du, F. Luo, Y. Qiao and J. Luo. “D-MSCD: Mean-standard deviation curve descriptor based on deep learning”. IEEE Access, vol. 8, pp. 204509-204517, 2020.
R. Pramanik, B. Banerjee and R. Sarkar. “MSENet: Mean and standard deviation based ensemble network for cervical cancer detection”. Engineering Applications of Artificial Intelligence, vol. 123, p. 106336, 2023.
Y. Chen, X. Kou, J. Bai and Y. Tong. “Improving bert with self-supervised attention”. IEEE Access, vol. 9, pp. 144129- 144139, 2021.
B. Cui, Y. Li, M. Chen and Z. Zhang. “Fine-tune BERT with Sparse Self-attention Mechanism”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, pp. 3548-3553, 2019.
B. Ghojogh and A. Ghodsi, "Attention mechanism, transformers, BERT, and GPT: Tutorial and survey," [Preprint] 2020.
Y. Hao, L. Dong, F. Wei and K. Xu. “Self-attention attribution: Interpreting information interactions inside transformer”. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, vol. 35, no. 14, pp. 12963-12971.
J. Shobana and M. Murali. “An improved self attention mechanism based on optimized BERT-BiLSTM model for accurate polarity prediction”. The Computer Journal, vol. 66, no. 5, pp. 1279-1294, 2023.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Mariwan Mahmood Hama Aziz, Sozan Abdullah Mahmood

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.