Skip to main navigation menu Skip to main content Skip to site footer

Performance Analysis of Transformer-Enabled Semantic Crawlers for Scalable Text Retrieval

Abstract

With the exponential growth of web-based content, efficient retrieval of contextually relevant textual information starting from seed URLs has become a critical challenge in web content mining and information retrieval. Traditional crawling and search methods—such as breadth-first search (BFS), depth-first search (DFS), best-first (focused crawling), topic-sensitive PageRank, and context-graph models—typically suffer from limitations such as parameter tuning overhead, lack of contextual understanding, requirement of large training datasets, high computational cost, and the need for specialised infrastructure. This research presents a comprehensive comparative study of multiple search and crawling models applied to textual retrieval from seed URLs, with a particular focus on their performance in diverse web‐structures (static vs dynamic) and content types. Employing a unified experimental framework implemented in Python with MySQL backend, we evaluate each algorithm using standard performance metrics (precision, recall, F1-score) alongside newer metrics such as coverage, relevance score, search time, memory usage, throughput and harvest rate. Machine-learning enabled variants (for example semantic-BFS and semantic-DFS using transformer-based embeddings) are also incorporated to assess their value over purely structural methods. Our results demonstrate that while semantic-enhanced BFS (Semantic-BFS) yields higher coverage, better relevance and faster response time in many scenarios, it shows limitations in classical metrics like precision/recall/F1 when ground-truth labels are inadequate for semantic relevance. The study provides insights into algorithmic trade-offs, suitability for different web architectures, and proposes hybrid strategies for next-generation crawlers and retrieval systems. The findings contribute toward the design of more adaptive, semantic-aware, and scalable web content mining frameworks.

Keywords

Web Content Mining, Information Retrieval, Seed URL, Text Search Models, Link Analysis, Context Graph, BFS, DFS, Semantic Search, Algorithm Comparison, Machine Learning

PDF

References

  1. 1. Md Alimul Haque, Shameemul Haque KK and NKS. Digital Transformation and Challenges to Data Security and Privacy [Internet]. Anunciação PF, Pessoa CRM, Jamil GL, editors. Digital Transformation and Challenges to Data Security and Privacy. IGI Global; 2021. (Advances in Information Security, Privacy, and Ethics). Available from: http://services.igi-global.com/resolvedoi/resolve.aspx?doi=10.4018/978-1-7998-4201-9
  2. 2. Mutlu MA, Ulku EE, Yildiz K. A web scraping app for smart literature search of the keywords. PeerJ Comput Sci. 2024;10:e2384.
  3. 3. Haque MA, Ahmad S, Abboud AJ, Hossain MA, Kumar K, Haque S, et al. 6G Wireless Communication Networks: Challenges and Potential Solution. https://services.igi-global.com/resolvedoi/resolve.aspx?doi=104018/IJBDCN339889. 1AD Jan;19(1):1–27.
  4. 4. Zeba S, Haque MA, Alhazmi S, Haque S. Advanced Topics in Machine Learning. Mach Learn Methods Eng Appl Dev. 2022;197.
  5. 5. Haque MA, Haque S, Zeba S, Kumar K, Ahmad S, Rahman M, et al. Sustainable and efficient E-learning internet of things system through blockchain technology. E-Learning Digit Media [Internet]. 2023;0(0):1–20. Available from: https://journals.sagepub.com/doi/abs/10.1177/20427530231156711
  6. 6. Whig V, Othman B, Gehlot A, Haque MA, Qamar S, Singh J. An Empirical Analysis of Artificial Intelligence (AI) as a Growth Engine for the Healthcare Sector. In: 2022 2nd International Conference on Advance Computing and Innovative Technologies in Engineering (ICACITE). IEEE; 2022. p. 2454–7.
  7. 7. Yu S, Liu Z, Xiong C. Craw4LLM: Efficient Web Crawling for LLM Pretraining. arXiv Prepr arXiv250213347. 2025;
  8. 8. Aliyu Y, Sarlan A, Danyaro KU, Rahman AS. Comparative Analysis of Transformer Models for Sentiment Analysis in Low-Resource Languages. Int J Adv Comput Sci Appl. 2024;15(4).
  9. 9. Chakrabarti S, Van den Berg M, Dom B. Focused crawling: a new approach to topic-specific Web resource discovery. Comput networks. 1999;31(11–16):1623–40.
  10. 10. Jobin K V, Mishra A, Jawahar C V. Semantic labels-aware transformer model for searching over a large collection of lecture-slides. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2024. p. 6016–25.
  11. 11. Jiang W. A novel multi-threaded web crawling model. In: Proceedings of the 2024 Asia Pacific Conference on Computing Technologies, Communications and Networking. 2024. p. 71–3.
  12. 12. Kumar A, Kumar A, Kumari K, Mishra BK. Keyword Searching and Digital Archives on Web: Challenges and Innovations in GLAM. L Archit. 2025;(4):155.
  13. 13. Sinha AK, Raj N, Haque S, Haque A, Singh NK. Web Content Mining: Tool, Technique & Concept. IOSR J Comput Eng. 18(6):57–60.
  14. 14. Azam A, Haque A, Rai SR. Predicting Housing Sale Prices Using Machine Learning with Various Data Split Ratios. Data Metadata [Internet]. 2024 Dec 15;3. Available from: https://dm.ageditor.ar/index.php/dm/article/view/231