AI-Driven Web Content Scraping Focusing on Transformer-Based : A Systematic Literature Review
Pengikisan Kandungan Web Didorong Kepintaran Buatan (AI) dengan Tumpuan kepada Model Berasaskan Transformer : Ulasan Literatur Sistematik
DOI:
https://doi.org/10.53840/ejpi.v12i4.291Keywords:
web-scrapping, AI-drive, transformer-based, information retrieval, systematic literature reviewAbstract
Web scraping has become a key technique for harvesting large volumes of data across varied online platforms, underpinning applications in business intelligence, market research, academic studies, and social media oversight. Traditional scraping strategies, which are usually rule-based and depend on fixed pattern-matching techniques—such as XPath, CSS selectors, and regular expressions—tend to fail when web pages undergo structural modification or when content is rendered dynamically via JavaScript. Such brittleness compromises the longevity and flexibility of standard solutions, especially when confronted with diverse and frequently refreshed sites. Recently, breakthroughs in artificial intelligence, particularly the advent of transformer architectures like Bidirectional Encoder Representations from Transformers (BERT) and Generative Pre-trained Transformers (GPT), have paved the way for smarter, context-sensitive content harvesting. These models leverage their profound grasp of contextual semantics and their ability to decipher intricate language constructs, yielding higher precision in information retrieval and greater resilience to changes in page layout. This systematic literature review (SLR) aggregates the prevailing research on embedding transformer-driven models within web scraping pipelines. It surveys current system designs, fusion methods, metric frameworks, and comparative performance records while illuminating outstanding research voids and pathways for continued exploration. The results are designed to give researchers and practitioners a thorough understanding that can help them improve scraping systems by embedding transformer-based AI technologies.
Downloads
References
Ahluwalia, A., & Wani, S. (2024). Leveraging Large Language Models for Web Scraping (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2406.08246
Appalaraju, S., Jasani, B., Kota, B. U., Xie, Y., & Manmatha, R. (2021). DocFormer: End-to-End Transformer for Document Understanding (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2106.11539
Binns, R. (2017). Fairness in Machine Learning: Lessons from Political Philosophy. https://doi.org/10.48550/ARXIV.1712.03586
Ferrara, E., De Meo, P., Fiumara, G., & Baumgartner, R. (2014). Web data extraction, applications and techniques: A survey. Knowledge-Based Systems, 70, 301–323. https://doi.org/10.1016/j.knosys.2014.07.007
Ford, E., Shepherd, S., Jones, K., & Hassan, L. (2021). Toward an Ethical Framework for the Text Mining of Social Media for Health Research: A Systematic Review. Frontiers in Digital Health, 2, 592237. https://doi.org/10.3389/fdgth.2020.592237
Gill, J. K., Chetty, M., Lim, S., & Hallinan, J. (2024). Large language model based framework for automated extraction of genetic interactions from unstructured data. PLOS ONE, 19(5), e0303231. https://doi.org/10.1371/journal.pone.0303231
Greco, C. M., & Tagarelli, A. (2024). Bringing order into the realm of Transformer-based language models for artificial intelligence and law. Artificial Intelligence and Law, 32(4), 863–1010. https://doi.org/10.1007/s10506-023-09374-7
Khder, M. (2021). Web Scraping or Web Crawling: State of Art, Techniques, Approaches and Application. International Journal of Advances in Soft Computing and Its Applications, 13(3), 145–168. https://doi.org/10.15849/IJASCA.211128.11
Lee, S. S. (2024). A Hybrid Approach for Key-Value Extraction from Technical Specification Documents.
Nevgi, S., Kadam, S., Haldankar, S., Jadhav, S., & Rashmi More, Prof. (2025). AI-Powered Web Scraping and Parsing: A Browser Extension Using LLMs for Adaptive Data Extraction. INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT, 09(04), 1–9. https://doi.org/10.55041/IJSREM44113
Paschalides, D., Pallis, G., & Dikaiakos, M. (2025). A Framework for the Unsupervised Modeling and Extraction of Polarization Knowledge from News Media. ACM Transactions on Social Computing, 8(1–2), 1–38. https://doi.org/10.1145/3703594
Pires, H., Paucar, L., & Carvalho, J. P. (2025). DeB3RTa: A Transformer-Based Model for the Portuguese Financial Domain. Big Data and Cognitive Computing, 9(3), 51. https://doi.org/10.3390/bdcc9030051
Raval, P., & Bhaidasna, H. (2025). Metadata Extraction From Scholarly Document Using Deep Learning. 2025 3rd International Conference on Inventive Computing and Informatics (ICICI), 1–5. https://doi.org/10.1109/ICICI65870.2025.11069873
Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter (Version 4). arXiv. https://doi.org/10.48550/ARXIV.1910.01108
Suresh, P., Kavya, V., S, V. G., Harshitha, B., & Ashritha, D. (2025). A Novel Approach to Web Article Summarization. 2025 3rd International Conference on Inventive Computing and Informatics (ICICI), 37–42. https://doi.org/10.1109/ICICI65870.2025.11069807
Wang, Q., Fang, Y., Ravula, A., Feng, F., Quan, X., & Liu, D. (2022). WebFormer: The Web-page Transformer for Structure Information Extraction. Proceedings of the ACM Web Conference 2022, 3124–3133. https://doi.org/10.1145/3485447.3512032
Weerasinghe, K., Maduranga, D. M., & Kawya, M. (2024). Enhancing Web Scraping with Artificial Intelligence: A Review.
Xu, X., & Zheng, X. (2021). Hybrid Model for Network Anomaly Detection with Gradient Boosting Decision Trees and Tabtransformer. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 8538–8542. https://doi.org/10.1109/ICASSP39728.2021.9414766
Yaman, A., Schwab, J., Nitsche, C., Sinha, A., & Huber, M. (2025). Comparison of Large Language Models for Deployment Requirements (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2508.00185
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 e-Jurnal Penyelidikan dan Inovasi

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.










