Evaluating Machine Learning Algorithms for Detecting Online Text-based Fake News Content

Deni Kurnianto Nugroho, Marwan Noor Fauzy, Kardilah Rohmat Hidayat

Abstract


The rapid spread of disinformation and fabricated news across online platforms poses a critical risk to informed public engagement and the foundations of democratic governance. This study examines how well different machine learning techniques can classify fake news, using textual features extracted through the Term Frequency–Inverse Document Frequency (TF-IDF) method. The analysis includes five commonly used algorithms like Logistic Regression, Support Vector Machine (SVM), Naive Bayes, Random Forest, and XGBoost. A publicly accessible dataset containing annotated real and fake news articles served as the basis for training and testing these models. Dataset underwent extensive preprocessing, including tokenization, stopword removal, and TF-IDF vectorization, resulting in a sparse high-dimensional matrix of 5068 documents and 39,978 features. Performance evaluation was based on multiple metrics: train/test accuracy, misclassification rate, false positives/negatives, cross-validation mean score, and execution time. Results showed that SVM and Logistic Regression achieved the highest test accuracy (93.61% and 92.27%, respectively) and exhibited robust cross-validation scores, indicating strong generalization ability. In contrast, Naive Bayes produced faster results but suffered from a high false positive rate and lower accuracy (84.77%). Random Forest and XGBoost demonstrated good predictive power but showed signs of overfitting and moderate misclassification rates. These findings suggest that SVM and Logistic Regression are well-suited for fake news detection in textual datasets using TF-IDF features. While traditional models remain effective, future work may explore deep learning approaches and context-aware language models to enhance detection accuracy across more complex and multilingual datasets. This study contributes to the ongoing efforts to combat misinformation through automated, scalable, and interpretable machine learning techniques.

Full Text:

PDF

References


H. Allcott and M. Gentzkow, “Social media and fake news in the 2016 election,” Journal of Economic Perspectives, vol. 31, no. 2, pp. 211–236, 2017.

S. Vosoughi, D. Roy, and S. Aral, “The spread of true and false news online,” Science, vol. 359, no. 6380, pp. 1146–1151, 2018.

K. Shu, A. Sliva, S. Wang, J. Tang, and H. Liu, “Fake news detection on social media: A data mining perspective,” ACM SIGKDD Explorations Newsletter, vol. 19, no. 1, pp. 22–36, 2017.

N. Ruchansky, S. Seo, and Y. Liu, “CSI: A hybrid deep model for fake news detection,” in Proc. of the 2017 ACM Conf. on Information and Knowledge Management, pp. 797–806, 2017.

K. Shu, S. Wang, and H. Liu, “Beyond news contents: The role of social context for fake news detection,” in Proc. of the 12th ACM Int. Conf. on Web Search and Data Mining, pp. 312–320, 2019.

N. J. Conroy, V. L. Rubin, and Y. Chen, “Automatic deception detection: Methods for finding fake news,” Proc. of the Assoc. for Information Science and Technology, vol. 52, no. 1, pp. 1–4, 2015.

X. Zhou and R. Zafarani, “Fake news detection: A survey,” ACM Computing Surveys, vol. 53, no. 5, pp. 1–40, 2019.

H. Ahmed, I. Traore, and S. Saad, “Detecting opinion spams and fake news using text classification,” Security and Privacy, vol. 1, no. 1, p. e9, 2018.

V. Pérez-Rosas, B. Kleinberg, A. Lefevre, and R. Mihalcea, “Automatic detection of fake news,” in Proc. of the 27th Int. Conf. on Computational Linguistics (COLING 2018), pp. 3391–3401, 2018.

M. Potthast, J. Kiesel, K. Reinartz, J. Bevendorff, and B. Stein, “A stylometric inquiry into hyperpartisan and fake news,” in Proc. of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 231–240, 2018.

R. Binns, “Fairness in machine learning: Lessons from political philosophy,” in Proc. of the 2018 Conf. on Fairness, Accountability and Transparency (FAT), pp. 149–159, 2018.

K. Crawford, “Artificial intelligence’s white guy problem,” The New York Times, [Online]. Available: https://www.nytimes.com/2016/06/26/opinion/sunday/artificial-intelligences-white-guy-problem.html. [Accessed: Jul. 20, 2025].

J. Ramos, “Using TF-IDF to determine word relevance in document queries,” in Proc. of the First Instructional Conf. on Machine Learning, 2003.

D. W. Hosmer, S. Lemeshow, and R. X. Sturdivant, Applied Logistic Regression, 3rd ed. Hoboken, NJ: Wiley, 2013.

T. Joachims, “Text categorization with support vector machines: Learning with many relevant features,” in Machine Learning: ECML-98, C. Nédellec and C. Rouveirol, Eds. Berlin, Heidelberg: Springer, 1998, pp. 137–142.

A. McCallum and K. Nigam, “A comparison of event models for Naive Bayes text classification,” in AAAI-98 Workshop on Learning for Text Categorization, 1998.

L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.

T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” in Proc. of the 22nd ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 785–794, 2016.

R. Kohavi, “A study of cross-validation and bootstrap for accuracy estimation and model selection,” in Proc. of the 14th Int. Joint Conf. on Artificial Intelligence, pp. 1137–1143, 1995.




DOI: https://doi.org/10.29040/ijcis.v6i3.253

Article Metrics

Abstract view : 4 times
PDF - 2 times

Refbacks

  • There are currently no refbacks.


situs toto

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License