Scrape Dynamic APIs Like a Pro with Selenium

LEAD DATA ENGINEERING

Kail Kuhn Schlicht

At Syntera, we’re passionate about blending creativity with technology, especially when it comes to automation and data extraction. In this blog post, I’ll show you how to scrape data from dynamic APIs like a pro, using Selenium with Firefox, a simple Django Ninja API, and a basic web page. You’ll learn how to automate interactions on a web page, capture data from an API, and save it in a structured format—all without manually parsing messy HTML.

Use Case: Fetching Mountain Passes Data from a Local API

Imagine you have a web application that provides information about Swiss mountain passes. The app features a button labeled “Load Passes”, which triggers an AJAX request to a /api/mountain-passes endpoint to fetch the data. Your task is to:

Automate interaction with the web page to click the “Load Passes” button.
Capture the returned data from the dynamic API request.
Save the data in a structured format (JSON).

This post will guide you through setting up Django Ninja as a lightweight API for serving the mountain pass data, creating a sample web page, and then automating the data extraction process with Selenium.

Step 1: Setting Up the Django Ninja API

First, let’s create a simple API using Django Ninja. Django Ninja is a powerful, fast, and modern web framework that makes it easy to build APIs with minimal code. For this use case, we’ll set up a simple endpoint to return mountain pass information.

You can refer to the GitHub repository for the complete code setup for the Django Ninja API, including the models and endpoints. Here’s a quick overview:

Create a Django project and app named mountains.
Set up Django Ninja in the app and define an API endpoint /api/mountain-passes that returns a list of mountain passes.
Run the Django server locally on http://127.0.0.1:8000.

Now, the Django Ninja API will serve the mountain pass data in JSON format when the /api/mountain-passes endpoint is accessed.

Step 2: Creating a Sample Web Page with a “Load Passes” Button

We need a simple web page that contains a button to trigger the AJAX request to our API. Here’s the HTML code for the sample web page:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Mountain Passes</title>
    <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
</head>
<body>
    <h1>Swiss Mountain Passes</h1>
    <button id="load-passes">Load Mountain Passes</button>
    <ul id="passes-list"></ul>

    <script>
        $('#load-passes').on('click', function() {
            $.ajax({
                url: '/api/mountain-passes',
                method: 'GET',
                success: function(data) {
                    $('#passes-list').empty();
                    data.forEach(function(pass) {
                        $('#passes-list').append('<li>' + pass.name + ' - ' + pass.elevation_meters + ' meters (' + pass.location + ')</li>');
                    });
                },
                error: function(error) {
                    console.log("Error fetching data:", error);
                }
            });
        });
    </script>
</body>
</html>

Explanation of the Sample Web Page:

HTML Structure: The page contains a title, a button with the ID load-passes, and an unordered list to display the mountain pass data.
AJAX Request: When the button is clicked, a jQuery AJAX request is made to the /api/mountain-passes endpoint. If the request is successful, the data is displayed in the list. In case of an error, it’s logged to the console.

To serve this web page, place the file in the static directory of your Django project and configure your Django settings to serve static files.

Step 3: Setting Up Selenium with Firefox

Now, let’s set up Selenium to automate the interaction with our local web page. We’ll use Firefox with GeckoDriver, and the dotenv library will help us load environment variables for configuration.

Installing Dependencies

Make sure you have Selenium, dotenv, and GeckoDriver installed. Here’s how to get started (I’m doing it on a Mac):

pip install selenium python-dotenv
brew install geckodriver

Step 4: Writing the Selenium Script

We’ll write a Selenium script to automate the following tasks:

Navigate to the web page: http://127.0.0.1:8000/mountain-passes/
Find the “Load Passes” button and click it to trigger the AJAX request.
Capture the data returned from the /api/mountain-passes endpoint.
Print and save the data in JSON format.

Here’s the complete script:

import os
import time
import json
from dotenv import load_dotenv
from selenium import webdriver
from selenium.webdriver.firefox.service import Service as FirefoxService
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Load environment variables from the .env file
load_dotenv()

# Access the paths from the environment variables
GECKO_DRIVER_PATH = os.getenv('GECKO_DRIVER_PATH')

# Set up Firefox WebDriver
service = FirefoxService(GECKO_DRIVER_PATH)
driver = webdriver.Firefox(service=service)

# Navigate to the web page
print("Navigating to the web page...")
driver.get("http://127.0.0.1:8000/mountain-passes/")

# Find the load button and click it
try:
    load_button = WebDriverWait(driver, 10).until(
        EC.element_to_be_clickable((By.ID, "load-passes"))
    )
    print("Load button is visible and clickable.")

    # Scroll to the button using JavaScript
    driver.execute_script("arguments[0].scrollIntoView();", load_button)
    print("Scrolled to the load button.")

    # Click the button using JavaScript
    driver.execute_script("arguments[0].click();", load_button)
    print("Load button clicked via JavaScript.")
except Exception as e:
    print(f"Error finding or clicking the button: {e}")

# Add a sleep to allow the AJAX request to complete
time.sleep(5)

# Execute JavaScript to capture the mountain passes data directly from the page
mountain_passes_data = driver.execute_script("""
    return fetch('/api/mountain-passes')
        .then(response => response.json())
        .then(data => data);
""")

# Close the browser
driver.quit()

# Print the captured data as JSON
print("Mountain Passes Data (JSON Format):")
print(json.dumps(mountain_passes_data, indent=4))

# Save the data to a JSON file
with open('mountain_passes_data.json', 'w') as json_file:
    json.dump(mountain_passes_data, json_file, indent=4)

print("Mountain passes data has been saved to 'mountain_passes_data.json'.")

Step 5: Why This Approach? The Advantages of Scraping Dynamic APIs Over Traditional HTML Scraping

Scraping dynamic APIs is often more effective than parsing static HTML, especially for modern websites where content is loaded asynchronously. Here’s why:

Clean Data Extraction: APIs return structured data (e.g., JSON) directly, eliminating the need for parsing complex HTML.
Handling Dynamic Content: This approach easily captures data from web pages that load content with JavaScript.
Reduced Risk of Breakage: API endpoints are usually more stable than the website’s HTML structure.
Automating User Actions: Selenium enables automating interactions such as button clicks, which isn’t possible with traditional scraping.

Conclusion

I’ve demonstrated how to scrape data from a dynamic API using Selenium, Django Ninja, and a basic sample web page. This method streamlines the data extraction process by leveraging API endpoints instead of parsing static HTML, making it ideal for modern web applications.

Explore the GitHub repository for the complete code setup, and feel free to reach out with questions or ideas for future projects!