04 – Real-World Python Projects – Web Scraper

๐ŸŽฏ Project Objective

To build a Web Scraper application that can automatically extract data from websites for analysis, monitoring, or reporting.

Skills Demonstrated:

  • Sending HTTP requests
  • Parsing HTML and XML with BeautifulSoup
  • Handling dynamic content with Selenium
  • Storing scraped data in CSV or Excel
  • Automating repetitive data collection tasks

Project: Web Scraper App

Project Description

The Web Scraper app allows users to collect information from websites, such as:

  • Product prices from e-commerce sites
  • News headlines or articles
  • Job postings
  • Stock prices or cryptocurrency rates

Real-Life Example: Scrape books from Books to Scrape including title, price, and availability.


Python Example Code โ€“ Basic Scraper

import requests
from bs4 import BeautifulSoup
import pandas as pd

# URL to scrape
url = "https://books.toscrape.com/"
response = requests.get(url)

# Parse HTML
soup = BeautifulSoup(response.text, "html.parser")

# Extract book titles, prices, and availability
books = soup.find_all("h3")
prices = soup.find_all("p", class_="price_color")
availability = soup.find_all("p", class_="instock availability")

data = []
for book, price, avail in zip(books, prices, availability):
    data.append({
        "Title": book.a["title"],
        "Price": price.text,
        "Availability": avail.text.strip()
    })

# Save data to CSV
df = pd.DataFrame(data)
df.to_csv("books.csv", index=False)
print("Scraping completed. Data saved to books.csv")

โœ… Outputs: CSV file with book title, price, and availability.


Advanced Scraping โ€“ Pagination

base_url = "https://books.toscrape.com/catalogue/page-{}.html"
all_books = []

for page in range(1, 6):  # First 5 pages
    url = base_url.format(page)
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    
    books = soup.find_all("h3")
    prices = soup.find_all("p", class_="price_color")
    
    for book, price in zip(books, prices):
        all_books.append({"Title": book.a["title"], "Price": price.text})

df = pd.DataFrame(all_books)
df.to_csv("books_paginated.csv", index=False)
print("Paginated scraping completed.")

Scraping Dynamic Websites โ€“ Selenium Example

from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()  # Ensure ChromeDriver is installed
driver.get("https://quotes.toscrape.com/js/")

quotes = driver.find_elements(By.CLASS_NAME, "quote")
data = []
for quote in quotes:
    text = quote.find_element(By.CLASS_NAME, "text").text
    author = quote.find_element(By.CLASS_NAME, "author").text
    data.append({"Quote": text, "Author": author})

driver.quit()

import pandas as pd
df = pd.DataFrame(data)
df.to_csv("quotes_dynamic.csv", index=False)
print("Dynamic scraping completed.")

โœ… Key Features

  • Extract data from static and dynamic websites
  • Handle pagination
  • Store data in CSV or Excel
  • Automate repetitive scraping tasks
  • Optional: Integrate with APIs for JSON scraping

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *