Skip to main content
Python for Beginners
CHAPTER 26 Beginner

Web Scraping with Python

Updated: May 17, 2026
25 min read

# Web Scraping with Python

Welcome to Chapter 26! Web scraping lets you extract data from websites automatically — collecting prices, news headlines, research data, and more.

---

1. Learning Objectives

  • Understand web scraping concepts and ethics.
  • Use requests and BeautifulSoup to scrape HTML.
  • Navigate and extract elements from HTML.
  • Handle pagination and dynamic content.
  • Build a news headline scraper.

---

2. Setup

bash
1
pip install requests beautifulsoup4

---

3. Basic Web Scraping

```python id="py26ex1" import requests from bs4 import BeautifulSoup

# Fetch web page url = "https://quotes.toscrape.com/" response = requests.get(url) soup = BeautifulSoup(response.text, "html.parser")

# Extract title print(f"Page Title: {soup.title.string}")

# Find all quotes quotes = soup.findall("span", class="text") for i, quote in enumerate(quotes[:5], 1): print(f" {i}. {quote.gettext()}")

1234
---

## 4. Navigating HTML

python id="py26ex2" from bs4 import BeautifulSoup

html = """ <html> <body> <div class="container"> <h1 id="title">My Page</h1> <ul class="items"> <li class="item">Apple</li> <li class="item">Banana</li> <li class="item">Cherry</li> </ul> <a href="https://python.org">Python</a> </div> </body> </html> """

soup = BeautifulSoup(html, "html.parser")

# Find by tag print(soup.find("h1").text) # My Page

# Find by ID print(soup.find(id="title").text) # My Page

# Find by class items = soup.findall("li", class_="item") for item in items: print(f" - {item.text}")

# Find link link = soup.find("a") print(f"URL: {link['href']}") # https://python.org print(f"Text: {link.text}") # Python

# CSS selectors items = soup.select("ul.items li.item") print([item.text for item in items])

1234
---

## 5. Extracting Data

python id="py26ex3" import requests from bs4 import BeautifulSoup

url = "https://quotes.toscrape.com/" response = requests.get(url) soup = BeautifulSoup(response.text, "html.parser")

# Extract quotes with author and tags quotedivs = soup.findall("div", class="quote")

for div in quotedivs[:3]: text = div.find("span", class="text").gettext() author = div.find("small", class="author").gettext() tags = [tag.gettext() for tag in div.findall("a", class="tag")] print(f" \"{text[:50]}...\"") print(f" — {author}") print(f" Tags: {', '.join(tags)}\n")

1234
---

## 6. Scraping Multiple Pages

python id="py26ex4" import requests from bs4 import BeautifulSoup

allquotes = [] baseurl = "https://quotes.toscrape.com/page/{}/"

for page in range(1, 4): # First 3 pages response = requests.get(baseurl.format(page)) soup = BeautifulSoup(response.text, "html.parser") quotes = soup.findall("span", class="text") for q in quotes: allquotes.append(q.gettext())

print(f"Total quotes collected: {len(allquotes)}") for i, q in enumerate(allquotes[:5], 1): print(f" {i}. {q[:60]}...")

1234
---

## 7. Saving Scraped Data

python id="py26ex5" import csv import json

# Save to CSV data = [ {"title": "Article 1", "author": "Alice", "date": "2025-01-15"}, {"title": "Article 2", "author": "Bob", "date": "2025-01-16"}, ]

with open("scrapeddata.csv", "w", newline="") as f: writer = csv.DictWriter(f, fieldnames=["title", "author", "date"]) writer.writeheader() writer.writerows(data)

# Save to JSON with open("scraped_data.json", "w") as f: json.dump(data, f, indent=2)

print("✅ Data saved to CSV and JSON!")

12345678910
---

## 8. Ethical Scraping Guidelines

1. **Check `robots.txt`** — `website.com/robots.txt` tells what's allowed.
2. **Respect rate limits** — Add delays between requests.
3. **Don't overload servers** — Use `time.sleep()`.
4. **Check terms of service** — Some sites prohibit scraping.
5. **Use APIs when available** — They're the intended data access method.

python id="py26_ex6" import time

# Polite scraping with delays for page in range(1, 5): # response = requests.get(url) time.sleep(2) # Wait 2 seconds between requests

1234
---

## 9. Mini Project: News Headline Scraper

python id="py26project" import requests from bs4 import BeautifulSoup

def scrapequotesasnews(): """Scrape quotes site as a demo (reliable for learning)""" url = "https://quotes.toscrape.com/" try: response = requests.get(url, timeout=10) response.raiseforstatus() soup = BeautifulSoup(response.text, "html.parser") print("=" * 50) print(" 📰 HEADLINE SCRAPER (Demo)") print("=" * 50) quotes = soup.findall("div", class="quote") for i, quotediv in enumerate(quotes, 1): text = quotediv.find("span", class="text").gettext() author = quotediv.find("small", class="author").gettext() tags = [t.gettext() for t in quotediv.findall("a", class="tag")] print(f"\n [{i}] {text[:70]}...") print(f" Author: {author}") print(f" Tags: {', '.join(tags[:3])}") print(f"\n {'='*50}") print(f" Total items scraped: {len(quotes)}") except requests.exceptions.RequestException as e: print(f" ❌ Error: {e}")

scrapequotesasnews() ``

---

10. MCQs with Answers

Q1: BeautifulSoup is for: A) Making requests B) Parsing HTML C) Building websites D) Sending emails Answer: B

Q2: soup.findall() returns: A) First match B) List of matches C) Boolean D) String Answer: B

Q3: robots.txt specifies: A) Passwords B) Scraping rules C) Server config D) CSS styles Answer: B

Q4: soup.select() uses: A) XPath B) CSS selectors C) Regex D) JSON Answer: B

Q5: To get attribute from tag <a href="url">: A) tag.href B) tag["href"] C) tag.getattr() D) tag.href() Answer: B

Q6: Why add delays between requests? A) Speed up B) Be polite to server C) Debug D) Save data Answer: B

Q7: gettext() returns: A) HTML B) Plain text without tags C) Attributes D) Tag name Answer: B

Q8: Best parser for BeautifulSoup: A) html.parser B) json.parser C) xml.parser D) text.parser Answer: A — Built-in, no extra install needed.

Q9: Dynamic content (JavaScript) needs: A) BeautifulSoup B) Selenium or Playwright C) csv module D) regex Answer: B

Q10: Ethical scraping includes: A) Ignoring robots.txt B) Overwhelming servers C) Respecting rate limits D) Scraping everything Answer: C

---

11. Interview Questions

  1. 1. What is web scraping? Automatically extracting data from websites by parsing HTML.
  1. 2. BeautifulSoup vs Scrapy? BS4 is simple, for small tasks. Scrapy is a framework for large-scale scraping.
  1. 3. How to handle dynamic content? Use Selenium, Playwright, or headless browsers for JavaScript-rendered pages.
  1. 4. Is web scraping legal? Depends on terms of service, jurisdiction, and data type. Always check robots.txt.
  1. 5. How to avoid getting blocked? Rotate user agents, use proxies, add delays, respect robots.txt.

---

12. Summary

  • requests fetches web pages; BeautifulSoup parses HTML.
  • Use find(), findall(), and select()` to navigate HTML.
  • Save scraped data to CSV or JSON.
  • Always practice ethical scraping — check robots.txt, add delays.
  • Use Selenium for JavaScript-rendered content.

---

13. Next Chapter Recommendation

In Chapter 27: Python Automation Projects, you'll automate file management, emails, and more! 🚀

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·