AI-Driven Web Content Scraping Focusing on Transformer-Based : A Systematic Literature Review

Pengikisan Kandungan Web Didorong Kepintaran Buatan (AI) dengan Tumpuan kepada Model Berasaskan Transformer : Ulasan Literatur Sistematik

Authors

  • Khirulnizam Abd Rahman Universiti Islam Selangor
  • Syed Arbaz Ahmed Universiti Islam Selangor
  • Sazanah binti Md Ali Universiti Islam Selangor

DOI:

https://doi.org/10.53840/ejpi.v12i4.291

Keywords:

web-scrapping, AI-drive, transformer-based, information retrieval, systematic literature review

Abstract

Web scraping has become a key technique for harvesting large volumes of data across varied online platforms, underpinning applications in business intelligence, market research, academic studies, and social media oversight. Traditional scraping strategies, which are usually rule-based and depend on fixed pattern-matching techniques—such as XPath, CSS selectors, and regular expressions—tend to fail when web pages undergo structural modification or when content is rendered dynamically via JavaScript. Such brittleness compromises the longevity and flexibility of standard solutions, especially when confronted with diverse and frequently refreshed sites. Recently, breakthroughs in artificial intelligence, particularly the advent of transformer architectures like Bidirectional Encoder Representations from Transformers (BERT) and Generative Pre-trained Transformers (GPT), have paved the way for smarter, context-sensitive content harvesting. These models leverage their profound grasp of contextual semantics and their ability to decipher intricate language constructs, yielding higher precision in information retrieval and greater resilience to changes in page layout. This systematic literature review (SLR) aggregates the prevailing research on embedding transformer-driven models within web scraping pipelines. It surveys current system designs, fusion methods, metric frameworks, and comparative performance records while illuminating outstanding research voids and pathways for continued exploration. The results are designed to give researchers and practitioners a thorough understanding that can help them improve scraping systems by embedding transformer-based AI technologies.

Downloads

Download data is not yet available.

References

Ahluwalia, A., & Wani, S. (2024). Leveraging Large Language Models for Web Scraping (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2406.08246

Appalaraju, S., Jasani, B., Kota, B. U., Xie, Y., & Manmatha, R. (2021). DocFormer: End-to-End Transformer for Document Understanding (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2106.11539

Binns, R. (2017). Fairness in Machine Learning: Lessons from Political Philosophy. https://doi.org/10.48550/ARXIV.1712.03586

Ferrara, E., De Meo, P., Fiumara, G., & Baumgartner, R. (2014). Web data extraction, applications and techniques: A survey. Knowledge-Based Systems, 70, 301–323. https://doi.org/10.1016/j.knosys.2014.07.007

Ford, E., Shepherd, S., Jones, K., & Hassan, L. (2021). Toward an Ethical Framework for the Text Mining of Social Media for Health Research: A Systematic Review. Frontiers in Digital Health, 2, 592237. https://doi.org/10.3389/fdgth.2020.592237

Gill, J. K., Chetty, M., Lim, S., & Hallinan, J. (2024). Large language model based framework for automated extraction of genetic interactions from unstructured data. PLOS ONE, 19(5), e0303231. https://doi.org/10.1371/journal.pone.0303231

Greco, C. M., & Tagarelli, A. (2024). Bringing order into the realm of Transformer-based language models for artificial intelligence and law. Artificial Intelligence and Law, 32(4), 863–1010. https://doi.org/10.1007/s10506-023-09374-7

Khder, M. (2021). Web Scraping or Web Crawling: State of Art, Techniques, Approaches and Application. International Journal of Advances in Soft Computing and Its Applications, 13(3), 145–168. https://doi.org/10.15849/IJASCA.211128.11

Lee, S. S. (2024). A Hybrid Approach for Key-Value Extraction from Technical Specification Documents.

Nevgi, S., Kadam, S., Haldankar, S., Jadhav, S., & Rashmi More, Prof. (2025). AI-Powered Web Scraping and Parsing: A Browser Extension Using LLMs for Adaptive Data Extraction. INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT, 09(04), 1–9. https://doi.org/10.55041/IJSREM44113

Paschalides, D., Pallis, G., & Dikaiakos, M. (2025). A Framework for the Unsupervised Modeling and Extraction of Polarization Knowledge from News Media. ACM Transactions on Social Computing, 8(1–2), 1–38. https://doi.org/10.1145/3703594

Pires, H., Paucar, L., & Carvalho, J. P. (2025). DeB3RTa: A Transformer-Based Model for the Portuguese Financial Domain. Big Data and Cognitive Computing, 9(3), 51. https://doi.org/10.3390/bdcc9030051

Raval, P., & Bhaidasna, H. (2025). Metadata Extraction From Scholarly Document Using Deep Learning. 2025 3rd International Conference on Inventive Computing and Informatics (ICICI), 1–5. https://doi.org/10.1109/ICICI65870.2025.11069873

Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter (Version 4). arXiv. https://doi.org/10.48550/ARXIV.1910.01108

Suresh, P., Kavya, V., S, V. G., Harshitha, B., & Ashritha, D. (2025). A Novel Approach to Web Article Summarization. 2025 3rd International Conference on Inventive Computing and Informatics (ICICI), 37–42. https://doi.org/10.1109/ICICI65870.2025.11069807

Wang, Q., Fang, Y., Ravula, A., Feng, F., Quan, X., & Liu, D. (2022). WebFormer: The Web-page Transformer for Structure Information Extraction. Proceedings of the ACM Web Conference 2022, 3124–3133. https://doi.org/10.1145/3485447.3512032

Weerasinghe, K., Maduranga, D. M., & Kawya, M. (2024). Enhancing Web Scraping with Artificial Intelligence: A Review.

Xu, X., & Zheng, X. (2021). Hybrid Model for Network Anomaly Detection with Gradient Boosting Decision Trees and Tabtransformer. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 8538–8542. https://doi.org/10.1109/ICASSP39728.2021.9414766

Yaman, A., Schwab, J., Nitsche, C., Sinha, A., & Huber, M. (2025). Comparison of Large Language Models for Deployment Requirements (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2508.00185

Downloads

Published

20-12-2025

How to Cite

Abd Rahman, K., Ahmed, S. A., & Md Ali, S. (2025). AI-Driven Web Content Scraping Focusing on Transformer-Based : A Systematic Literature Review : Pengikisan Kandungan Web Didorong Kepintaran Buatan (AI) dengan Tumpuan kepada Model Berasaskan Transformer : Ulasan Literatur Sistematik . E-Jurnal Penyelidikan Dan Inovasi, 12(4), 200–212. https://doi.org/10.53840/ejpi.v12i4.291

Most read articles by the same author(s)