Web Scraping with Python
# Web Scraping with Python
Welcome to Chapter 26! Web scraping lets you extract data from websites automatically — collecting prices, news headlines, research data, and more.
---
1. Learning Objectives
- Understand web scraping concepts and ethics.
-
Use
requestsandBeautifulSoupto scrape HTML.
- Navigate and extract elements from HTML.
- Handle pagination and dynamic content.
- Build a news headline scraper.
---
2. Setup
---
3. Basic Web Scraping
```python id="py26ex1" import requests from bs4 import BeautifulSoup
# Fetch web page url = "https://quotes.toscrape.com/" response = requests.get(url) soup = BeautifulSoup(response.text, "html.parser")
# Extract title print(f"Page Title: {soup.title.string}")
# Find all quotes quotes = soup.findall("span", class="text") for i, quote in enumerate(quotes[:5], 1): print(f" {i}. {quote.gettext()}")
python id="py26ex2" from bs4 import BeautifulSoup
html = """ <html> <body> <div class="container"> <h1 id="title">My Page</h1> <ul class="items"> <li class="item">Apple</li> <li class="item">Banana</li> <li class="item">Cherry</li> </ul> <a href="https://python.org">Python</a> </div> </body> </html> """
soup = BeautifulSoup(html, "html.parser")
# Find by tag print(soup.find("h1").text) # My Page
# Find by ID print(soup.find(id="title").text) # My Page
# Find by class items = soup.findall("li", class_="item") for item in items: print(f" - {item.text}")
# Find link link = soup.find("a") print(f"URL: {link['href']}") # https://python.org print(f"Text: {link.text}") # Python
# CSS selectors items = soup.select("ul.items li.item") print([item.text for item in items])
python id="py26ex3" import requests from bs4 import BeautifulSoup
url = "https://quotes.toscrape.com/" response = requests.get(url) soup = BeautifulSoup(response.text, "html.parser")
# Extract quotes with author and tags quotedivs = soup.findall("div", class="quote")
for div in quotedivs[:3]: text = div.find("span", class="text").gettext() author = div.find("small", class="author").gettext() tags = [tag.gettext() for tag in div.findall("a", class="tag")] print(f" \"{text[:50]}...\"") print(f" — {author}") print(f" Tags: {', '.join(tags)}\n")
python id="py26ex4" import requests from bs4 import BeautifulSoup
allquotes = [] baseurl = "https://quotes.toscrape.com/page/{}/"
for page in range(1, 4): # First 3 pages response = requests.get(baseurl.format(page)) soup = BeautifulSoup(response.text, "html.parser") quotes = soup.findall("span", class="text") for q in quotes: allquotes.append(q.gettext())
print(f"Total quotes collected: {len(allquotes)}") for i, q in enumerate(allquotes[:5], 1): print(f" {i}. {q[:60]}...")
python id="py26ex5" import csv import json
# Save to CSV data = [ {"title": "Article 1", "author": "Alice", "date": "2025-01-15"}, {"title": "Article 2", "author": "Bob", "date": "2025-01-16"}, ]
with open("scrapeddata.csv", "w", newline="") as f: writer = csv.DictWriter(f, fieldnames=["title", "author", "date"]) writer.writeheader() writer.writerows(data)
# Save to JSON with open("scraped_data.json", "w") as f: json.dump(data, f, indent=2)
print("✅ Data saved to CSV and JSON!")
python id="py26_ex6" import time
# Polite scraping with delays for page in range(1, 5): # response = requests.get(url) time.sleep(2) # Wait 2 seconds between requests
python id="py26project" import requests from bs4 import BeautifulSoup
def scrapequotesasnews(): """Scrape quotes site as a demo (reliable for learning)""" url = "https://quotes.toscrape.com/" try: response = requests.get(url, timeout=10) response.raiseforstatus() soup = BeautifulSoup(response.text, "html.parser") print("=" * 50) print(" 📰 HEADLINE SCRAPER (Demo)") print("=" * 50) quotes = soup.findall("div", class="quote") for i, quotediv in enumerate(quotes, 1): text = quotediv.find("span", class="text").gettext() author = quotediv.find("small", class="author").gettext() tags = [t.gettext() for t in quotediv.findall("a", class="tag")] print(f"\n [{i}] {text[:70]}...") print(f" Author: {author}") print(f" Tags: {', '.join(tags[:3])}") print(f"\n {'='*50}") print(f" Total items scraped: {len(quotes)}") except requests.exceptions.RequestException as e: print(f" ❌ Error: {e}")
scrapequotesasnews()
``
---
10. MCQs with Answers
Q1: BeautifulSoup is for: A) Making requests B) Parsing HTML C) Building websites D) Sending emails Answer: B
Q2: soup.findall() returns:
A) First match B) List of matches C) Boolean D) String
Answer: B
Q3: robots.txt specifies:
A) Passwords B) Scraping rules C) Server config D) CSS styles
Answer: B
Q4: soup.select() uses:
A) XPath B) CSS selectors C) Regex D) JSON
Answer: B
Q5: To get attribute from tag <a href="url">:
A) tag.href B) tag["href"] C) tag.getattr() D) tag.href()
Answer: B
Q6: Why add delays between requests? A) Speed up B) Be polite to server C) Debug D) Save data Answer: B
Q7: gettext() returns:
A) HTML B) Plain text without tags C) Attributes D) Tag name
Answer: B
Q8: Best parser for BeautifulSoup: A) html.parser B) json.parser C) xml.parser D) text.parser Answer: A — Built-in, no extra install needed.
Q9: Dynamic content (JavaScript) needs: A) BeautifulSoup B) Selenium or Playwright C) csv module D) regex Answer: B
Q10: Ethical scraping includes: A) Ignoring robots.txt B) Overwhelming servers C) Respecting rate limits D) Scraping everything Answer: C
---
11. Interview Questions
- 1. What is web scraping? Automatically extracting data from websites by parsing HTML.
- 2. BeautifulSoup vs Scrapy? BS4 is simple, for small tasks. Scrapy is a framework for large-scale scraping.
- 3. How to handle dynamic content? Use Selenium, Playwright, or headless browsers for JavaScript-rendered pages.
- 4. Is web scraping legal? Depends on terms of service, jurisdiction, and data type. Always check robots.txt.
- 5. How to avoid getting blocked? Rotate user agents, use proxies, add delays, respect robots.txt.
---
12. Summary
- requests fetches web pages; BeautifulSoup parses HTML.
-
Use find()
,findall(), andselect()` to navigate HTML.
- Save scraped data to CSV or JSON.
- Always practice ethical scraping — check robots.txt, add delays.
- Use Selenium for JavaScript-rendered content.
---
13. Next Chapter Recommendation
In Chapter 27: Python Automation Projects, you'll automate file management, emails, and more! 🚀