How to Capture Network Traffic When Scraping with Selenium & Python
I'll show you how to capture the network traffic occurring on the page using python, selenium (with chromedriver), and the logging features of Chrome. Capturing network traffic can be useful for any number of things when dealing with dynamic webpages, like grabbing websocket communications or data from the private APIs of a website.
If you need an introduction to why "dynamic" webpages are harder to scrape, see my explanatory post here.
I recently had a webscraping project for which I did not need any information from the response page itself. Instead, I needed to know what was being transmitted in the websocket the page had opened. I knew I could use this information because I found it inside a websocket while inspecting the network traffic using the 'network' tab of the developer tools. The specific information I needed wasn't being rendered in the response, but it was being communicated through the websocket.
The first step of this process was to capture the network traffic of the page, which this post will show you how to do.
The final script can be found at this gist.
Requirements
In this example, I'm using the following:
- python 3.8
- selenium 3.141
- chromedriver 81 (which requires a compatible version of Chrome installed)
The version of Chrome and chromedriver is important. There was a change in the logging feature around version 75 to adapt for W3C compliance. If you're stuck below version 75 of Chrome/chromedriver, you'll want to use loggingPrefs
instead of goog:loggingPrefs
in the first code snippet below.
This project was on a Windows machine, but should work the same if you adjust the chromedriver.exe
file name.
How to log the network traffic occuring on a page
First, set up the driver
to do "performance logging" by adjusting its desired_capabilities
. Performance logging is not on by default, but can be used to get "Timeline", "Network", and "Page" events. (ref.) We're interested in the network events in this example.
Here I'm assuming the chromedriver executable is in the same folder as the script and is called chromedriver.exe
.
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
capabilities = DesiredCapabilities.CHROME
# capabilities["loggingPrefs"] = {"performance": "ALL"} # chromedriver < ~75
capabilities["goog:loggingPrefs"] = {"performance": "ALL"} # chromedriver 75+
driver = webdriver.Chrome(
r"chromedriver.exe",
desired_capabilities=capabilities,
)
Next, visit a website then take a look at the logs using the following:
driver.get("https://www.rkengler.com")
logs = driver.get_log("performance")
print(logs)
Processing the logs for network events
Most webpages will have at least a few network events occurring. Some busier pages will have a lot. Each log event will be JSON in this form:
{
"webview": <originating WebView ID>,
"message": { "method": "...", "params": { ... }}
}
This example is concerned with network events. The network event method
values (inside the message
) start with Network.response
, Network.request
, or Network.webSocket
.
A generator function will be nice to process the logs and yield only the relevant network-related logs. Let's make that:
import json
def process_browser_logs_for_network_events(logs):
for entry in logs:
log = json.loads(entry["message"])["message"]
if (
"Network.response" in log["method"]
or "Network.request" in log["method"]
or "Network.webSocket" in log["method"]
):
yield log
Finally, for the sake of this example, we'll output the logs to a text file to see what was happening instead of processing the logs with more code:
import pprint
logs = driver.get_log("performance")
events = process_browser_logs_for_network_events(logs)
with open("log_entries.txt", "wt") as out:
for event in events:
pprint.pprint(event, stream=out)
Take note, using driver.get_log()
wipes the logs. Calling get_log()
a second time without any more traffic won't yield anything.
And just like that, we've recorded thousands of lines of network traffic that our page was sending and receiving!
More possibilities
Capturing the network traffic is a useful technique for grabbing information from asynchronous requests that the page is firing off. You may find private APIs that are easier to work with than parsing data from the page, or grab streaming information, or monitor websocket communications, or many other uses.
Best of luck with your webscraping adventures!
As a reminder, the final script can be found at this gist.