The flipp_flyer_parser
Python script is a sophisticated web scraping tool, designed for extracting promotional flyer data from various retail websites. Authored by FriendlyUser, this script leverages Selenium, a powerful tool for browser automation, to navigate through web pages and extract relevant data. It focuses on three major Canadian retailers: Save-On-Foods, Walmart, and Superstore.
undetected_chromedriver
): Used for controlling a Chrome browser. This driver is essential for navigating through the web pages and interacting with web elements.dateutil.parser
): Utilized for parsing date strings.re
): Employed for text pattern matching and data extraction from descriptions.PIL
): The Python Imaging Library (PIL) can be used for handling images, though its specific usage isn’t clear from the provided script.argparse
): Facilitates command-line argument parsing, allowing users to specify the store type.make_driver()
: Creates a Chrome WebDriver instance with optional headless browsing.selenium_setup_saveon
, selenium_setup_walmart
, setup_superstore
): These functions initialize the WebDriver and navigate to the respective store’s flyer page.parse_flipp_aside(driver, cfg)
: Extracts detailed information from a specific part of the webpage (flipp aside iframe). It retrieves various data like start and end dates, product descriptions, sizes, quantities, and more.scrap_flyer(driver, cfg)
: Orchestrates the scraping process. It involves navigating through iframes, handling cookies, extracting HTML content, and iterating over flyer images to gather product data.swap_to_iframe(driver, iframe_class)
: Aids in switching between different iframes within a webpage.Enum
: Enumerates store types (SAVEON, WALMART, SUPERSTORE) for easier management.parse_flipp_aside
and scrap_flyer
functions, to manage exceptions like NoSuchElementException
.flipp_flyer_parser
We will delve into the more complex parts of the flipp_flyer_parser
script, breaking down key functions and processes step by step.
The script begins with setting up the Selenium WebDriver, crucial for browser automation.
make_driver
Functionundetected_chromedriver
.headless=False
(meaning the browser UI is visible during scraping) and use_subprocess=False
.These functions are tailored for each retail website, navigating to the respective flyer pages.
selenium_setup_walmart
, selenium_setup_saveon
, and setup_superstore
Functionsmake_driver()
.setup_superstore
, additional cookie manipulation is performed to mimic a user’s browser settings.parse_flipp_aside
FunctionThis function extracts detailed information from a part of the webpage, typically an iframe.
Switching to the Relevant Iframe:
Calls swap_to_iframe
with the class name of the iframe to be accessed.
Extracting Information:
Finds elements by tag or class name (e.g., validity dates, descriptions). Regular expressions are used to parse and extract data like sizes, quantities, and product types from the product description. Exception handling is used to manage elements that might not be present.
scrap_flyer
FunctionThis function orchestrates the overall scraping process.
Initial Setup:
Waits for the main element of the page to become visible. Handles exceptions by saving the page source to a file for debugging.
Handling Cookies and HTML Content: Retrieves and saves cookies to a JSON file. Saves the HTML content of the page for further processing.
Navigating Through Flyer Images: Iterates over elements containing flyer images. For each image, iterates over associated buttons that likely contain product information. Executes a script to ensure the visibility of elements and interacts with them (clicking buttons).
Extracting Product Data:
Each button’s label is parsed for product data.
Regular expressions are used to extract pricing information.
Calls parse_flipp_aside
to extract additional details from the aside section.
Aggregates all extracted data into a dictionary and appends it to a list.
Final Steps: The data list is saved to a JSON file. Handles a set maximum number of items to prevent excessive scraping.
The script uses an argument parser to allow the user to specify the store type via the command line.
Based on the store type provided, the corresponding scraping function is called. This modular approach allows for easy expansion or modification for different stores.
The flipp_flyer_parser
script is a comprehensive example of advanced web scraping using Python and Selenium. It demonstrates handling dynamic web content, navigating through complex webpage structures, and extracting structured data from unstructured HTML. The use of regular expressions and strategic error handling are notable for their efficiency in data parsing and resilience against web scraping challenges. This script serves as an excellent template for similar web scraping tasks, particularly those involving dynamic and interactive web content.