David's Blog

Web Scraping with Beautiful Soup in Python

By David Li on 2024-02-10T08:30:57.000Z

Web Scraping with Beautiful Soup in Python

Web scraping is a technique for extracting data from websites. It involves making HTTP requests to web pages and parsing the HTML content to retrieve the desired information. Beautiful Soup is a popular and easy-to-use Python library for web scraping. In this article, we will look at how to use Beautiful Soup for web scraping in Python.

Prerequisites

Before we begin, ensure that you have the following installed on your computer:

  1. Python 3
  2. Beautiful Soup 4
  3. Requests library

You can install Beautiful Soup and Requests using pip:

pip install beautifulsoup4 requests

Getting Started with Beautiful Soup

First, let’s import the necessary libraries:

import requests
from bs4 import BeautifulSoup

Next, we need to make an HTTP request to the target webpage and store the content in a variable. In this example, we will scrape the Wikipedia page for Python programming language.

url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
response = requests.get(url)

## Check if the request was successful
if response.status_code == 200:
    page_content = response.text
else:
    print(f"Error {response.status_code}: Unable to fetch the webpage.")

Now that we have the HTML content, we can create a Beautiful Soup object and parse the HTML:

soup = BeautifulSoup(page_content, "html.parser")

Extracting Data with Beautiful Soup

Beautiful Soup provides several methods to search and navigate the HTML tree. Some of the commonly used methods are:

  1. find(): Searches for the first occurrence of a tag that matches the given criteria.
  2. find_all(): Returns a list of all tags that match the given criteria.
  3. select(): Searches for tags that match the given CSS selector.

Let’s use these methods to extract some information from the Wikipedia page.

Extracting the page title

title = soup.find("title").text
print(f"Page title: {title}")

Extracting all headings

headings = soup.find_all(["h1", "h2", "h3", "h4", "h5", "h6"])
for heading in headings:
    print(heading.text.strip())

Extracting the table of contents

toc = soup.find("div", {"id": "toc"})
toc_items = toc.find_all("li")
for item in toc_items:
    print(item.text.strip())

Putting It All Together

Here’s the complete code for our example:

import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
response = requests.get(url)

if response.status_code == 200:
    page_content = response.text
else:
    print(f"Error {response.status_code}: Unable to fetch the webpage.")

soup = BeautifulSoup(page_content, "html.parser")

title = soup.find("title").text
print(f"Page title: {title}\n")

print("Headings:")
headings = soup.find_all(["h1", "h2", "h3", "h4", "h5", "h6"])
for heading in headings:
    print(heading.text.strip())

print("\nTable of contents:")
toc = soup.find("div", {"id": "toc"})
toc_items = toc.find_all("li")
for item in toc_items:
    print(item.text.strip())

Conclusion

In this article, we learned how to use Beautiful Soup for web scraping in Python. Beautiful Soup is a powerful and flexible library that makes it easy to extract data from websites. With a few lines of code, you can quickly retrieve the information you need from a webpage.

Remember that web scraping may be subject to the terms of service of the websites you are scraping, as well as legal and ethical considerations. Always respect the website’s robots.txt file, and avoid excessive requests that may cause a burden on the server.

© Copyright 2024 by FriendlyUsers Tech Blog. Built with ♥ by FriendlyUser. Last updated on 2024-12-02.