A Python-based web crawler and spell checker that crawls all accessible pages of a website, extracts visible text content, and identifies spelling mistakes with suggestions for corrections.
This project demonstrates automated web data extraction combined with simple spell checking to help improve content quality on websites.
-
π Website Crawling:
Traverses all paginated product listing pages and extracts individual product URLs starting from the homepage. -
π Visible Text Extraction:
Parses HTML to extract meaningful text content such as product titles, descriptions, and product info β excluding scripts and styles. -
π§Ή Clean and Modular:
Clean text processing with filtering to avoid punctuation and non-alphabetic tokens during spell check. -
π Spell Checking:
Uses thespellchecker
library to detect misspelled words and provide up to 5 suggestions for each misspelling. -
πΎ Data Export:
Saves all extracted links, product details, and spelling reports as CSV files for easy review and further processing.
By combining web crawling with spell checking, this solution helps website owners and content teams identify and fix spelling errors across all pages, improving user experience and professionalism while allowing export of all data and reports as CSV files.
Foundation for NLP Applications
The extracted, cleaned, and validated product data can be used as the basis for building advanced NLP-powered applications, like a book recommendation system.
Tool | Purpose |
---|---|
Python | Core programming language |
Requests | HTTP requests and page fetching |
BeautifulSoup4 | HTML parsing and content extraction |
urllib.parse | URL management |
spellchecker | Spell checking and suggestion engine |
Pandas | Data handling and CSV export |
-
Start Crawling:
Begins at the homepage URL and traverses through all product listing pages using pagination links. -
Extract URLs:
Collects URLs of all individual product pages. -
Parse Content:
Visits each product page and extracts key visible content: title, description, product details, and image URLs. -
Spell Check:
Cleans extracted text and identifies misspelled words with suggested corrections. -
Reporting:
Compiles and exports comprehensive CSV reports of links, products, and spelling issues.
Leverage this clean, structured dataset to build a Book Recommendation NLP app that can provide personalized suggestions based on content and metadata scraped from the website.
Vaisakh Nirupam
π LinkedIn