CHAPTER 26 Beginner

Web Scraping with Python

Updated: May 17, 2026

25 min read

# Web Scraping with Python

Welcome to Chapter 26! Web scraping lets you extract data from websites automatically — collecting prices, news headlines, research data, and more.

---

1. Learning Objectives

Understand web scraping concepts and ethics.

Use requests and BeautifulSoup to scrape HTML.

Navigate and extract elements from HTML.

Handle pagination and dynamic content.

Build a news headline scraper.

---

2. Setup

bash

pip install requests beautifulsoup4

---

3. Basic Web Scraping

```python id="py26ex1" import requests from bs4 import BeautifulSoup

# Fetch web page url = "https://quotes.toscrape.com/" response = requests.get(url) soup = BeautifulSoup(response.text, "html.parser")

# Extract title print(f"Page Title: {soup.title.string}")

# Find all quotes quotes = soup.findall("span", class="text") for i, quote in enumerate(quotes[:5], 1): print(f" {i}. {quote.gettext()}")

1234

---

## 4. Navigating HTML

python id="py26ex2" from bs4 import BeautifulSoup

html = """ <html> <body> <div class="container"> <h1 id="title">My Page</h1> <ul class="items"> <li class="item">Apple</li> <li class="item">Banana</li> <li class="item">Cherry</li> </ul> <a href="https://python.org">Python</a> </div> </body> </html> """

soup = BeautifulSoup(html, "html.parser")

# Find by tag print(soup.find("h1").text) # My Page

# Find by ID print(soup.find(id="title").text) # My Page

# Find by class items = soup.findall("li", class_="item") for item in items: print(f" - {item.text}")

# Find link link = soup.find("a") print(f"URL: {link['href']}") # https://python.org print(f"Text: {link.text}") # Python

# CSS selectors items = soup.select("ul.items li.item") print([item.text for item in items])

1234

---

## 5. Extracting Data

python id="py26ex3" import requests from bs4 import BeautifulSoup

url = "https://quotes.toscrape.com/" response = requests.get(url) soup = BeautifulSoup(response.text, "html.parser")

# Extract quotes with author and tags quotedivs = soup.findall("div", class="quote")

for div in quotedivs[:3]: text = div.find("span", class="text").gettext() author = div.find("small", class="author").gettext() tags = [tag.gettext() for tag in div.findall("a", class="tag")] print(f" \"{text[:50]}...\"") print(f" — {author}") print(f" Tags: {', '.join(tags)}\n")

1234

---

## 6. Scraping Multiple Pages

python id="py26ex4" import requests from bs4 import BeautifulSoup

allquotes = [] baseurl = "https://quotes.toscrape.com/page/{}/"

for page in range(1, 4): # First 3 pages response = requests.get(baseurl.format(page)) soup = BeautifulSoup(response.text, "html.parser") quotes = soup.findall("span", class="text") for q in quotes: allquotes.append(q.gettext())

print(f"Total quotes collected: {len(allquotes)}") for i, q in enumerate(allquotes[:5], 1): print(f" {i}. {q[:60]}...")

1234

---

## 7. Saving Scraped Data

python id="py26ex5" import csv import json

# Save to CSV data = [ {"title": "Article 1", "author": "Alice", "date": "2025-01-15"}, {"title": "Article 2", "author": "Bob", "date": "2025-01-16"}, ]

with open("scrapeddata.csv", "w", newline="") as f: writer = csv.DictWriter(f, fieldnames=["title", "author", "date"]) writer.writeheader() writer.writerows(data)

# Save to JSON with open("scraped_data.json", "w") as f: json.dump(data, f, indent=2)

print("✅ Data saved to CSV and JSON!")

12345678910

---

## 8. Ethical Scraping Guidelines

1. **Check `robots.txt`** — `website.com/robots.txt` tells what's allowed.
2. **Respect rate limits** — Add delays between requests.
3. **Don't overload servers** — Use `time.sleep()`.
4. **Check terms of service** — Some sites prohibit scraping.
5. **Use APIs when available** — They're the intended data access method.

python id="py26_ex6" import time

# Polite scraping with delays for page in range(1, 5): # response = requests.get(url) time.sleep(2) # Wait 2 seconds between requests

1234

---

## 9. Mini Project: News Headline Scraper

python id="py26project" import requests from bs4 import BeautifulSoup

def scrapequotesasnews(): """Scrape quotes site as a demo (reliable for learning)""" url = "https://quotes.toscrape.com/" try: response = requests.get(url, timeout=10) response.raiseforstatus() soup = BeautifulSoup(response.text, "html.parser") print("=" * 50) print(" 📰 HEADLINE SCRAPER (Demo)") print("=" * 50) quotes = soup.findall("div", class="quote") for i, quotediv in enumerate(quotes, 1): text = quotediv.find("span", class="text").gettext() author = quotediv.find("small", class="author").gettext() tags = [t.gettext() for t in quotediv.findall("a", class="tag")] print(f"\n [{i}] {text[:70]}...") print(f" Author: {author}") print(f" Tags: {', '.join(tags[:3])}") print(f"\n {'='*50}") print(f" Total items scraped: {len(quotes)}") except requests.exceptions.RequestException as e: print(f" ❌ Error: {e}")

scrapequotesasnews() ``


---

                        10. MCQs with Answers
                        #
                    
Q1: BeautifulSoup is for:
A) Making requests  B) Parsing HTML  C) Building websites  D) Sending emails
Answer: B

Q2: soup.findall() returns:A) First match B) List of matches C) Boolean D) String Answer: B

Q3: robots.txt specifies:A) Passwords B) Scraping rules C) Server config D) CSS styles Answer: B
Q4: soup.select() uses:A) XPath B) CSS selectors C) Regex D) JSON Answer: B

Q5: To get attribute from tag <a href="url">:A) tag.href B) tag["href"] C) tag.getattr() D) tag.href() Answer: B


Q6: Why add delays between requests?
A) Speed up  B) Be polite to server  C) Debug  D) Save data
Answer: B

Q7: gettext() returns:A) HTML B) Plain text without tags C) Attributes D) Tag name Answer: B

Q8: Best parser for BeautifulSoup: A) html.parser B) json.parser C) xml.parser D) text.parser Answer: A — Built-in, no extra install needed. Q9: Dynamic content (JavaScript) needs: A) BeautifulSoup B) Selenium or Playwright C) csv module D) regex Answer: B Q10: Ethical scraping includes: A) Ignoring robots.txt B) Overwhelming servers C) Respecting rate limits D) Scraping everything Answer: C --- 11. Interview Questions # 1. What is web scraping? Automatically extracting data from websites by parsing HTML. 2. BeautifulSoup vs Scrapy? BS4 is simple, for small tasks. Scrapy is a framework for large-scale scraping. 3. How to handle dynamic content? Use Selenium, Playwright, or headless browsers for JavaScript-rendered pages. 4. Is web scraping legal? Depends on terms of service, jurisdiction, and data type. Always check robots.txt. 5. How to avoid getting blocked? Rotate user agents, use proxies, add delays, respect robots.txt. --- 12. Summary # requests fetches web pages; BeautifulSoup parses HTML.

Usefind(), findall(), and select()` to navigate HTML.

Save scraped data to CSV or JSON.

Always practice ethical scraping — check robots.txt, add delays.

Use Selenium for JavaScript-rendered content.

---

13. Next Chapter Recommendation

In Chapter 27: Python Automation Projects, you'll automate file management, emails, and more! 🚀

Featured

Browse All 21+ Subject Areas

Popular Topics

More Topics

Quick Links

Featured

Visual Algorithm Labs

Sorting Algorithms

Data Structures

Featured

Frontend Dev

Career Paths

Skill Tracks

Featured

The Future of Web Architecture in 2026

Categories

Community

Practice Quizzes

Web Scraping with Python

1. Learning Objectives

2. Setup

3. Basic Web Scraping

10. MCQs with Answers

11. Interview Questions

12. Summary

13. Next Chapter Recommendation

Finish this Chapter

Discussion

Send Feedback / Bug

Feedback Submitted!

Browse All 21+ Subject Areas

Quick Links

Visual Algorithm Labs

Frontend Dev

The Future of Web Architecture in 2026

Practice Quizzes

1. Learning Objectives #

2. Setup #

3. Basic Web Scraping #

10. MCQs with Answers #

11. Interview Questions #

12. Summary #

13. Next Chapter Recommendation #

Finish this Chapter

Discussion

Explore More

📖 Related Tutorials 6

❓ Related Quizzes 4

🎥 Related Videos 2

Send Feedback / Bug

Feedback Submitted!

1. Learning Objectives

2. Setup

3. Basic Web Scraping

10. MCQs with Answers

11. Interview Questions

12. Summary

13. Next Chapter Recommendation