Web scraping with selenium

My paid membership on an online video platform (a mooc, if you know the word) ends in a few days and I haven't watched all the 1k+ videos, yet. So I decided to write a python script to download them.

The task is complex because the website checks for user credentials and generates temporary links for each videos on demand. So I need to be able to simulate user activity on the website and then retrieve these unique links somehow to download the videos.

A simple approach with wget and regexes won't cut it because we need to execute javascript and store cookies. But luckily, we have advanced python libraries such as selenium that allows us to control and automate a real web-browser.

In this article, I'll share with you the lessons learned while crafting my scraper.

Table of content:

  • Quick introduction to selenium
  • Handling user login
  • Downloading files
  • Tips when downloading large files

Quick introduction to Selenium

Selenium is a library to automate a web-browser. Through drivers, it can be used with any browser. The documentation for the python version of selenium can be found here.

To use it, you'll first need to download the proper driver and add the binary to your path. I use Google Chrome as browser and the corresponding chrome driver.

Then, install selenium using pip:

pip install selenium

In your script, create an instance of the browser like this:

from selenium import webdriver

browser = webdriver.Chrome()

If you want to be able to save state and cookies between successive runs of your script, you need to specify a directory where to store them:

from selenium import webdriver

DIR = 'selenium-data'

options = webdriver.ChromeOptions()
options.add_argument("user-data-dir="+DIR)
browser = webdriver.Chrome(options=options)

There are many more available options. For instance, if you don't want the browser's window to show up, use the headless mode:

# ...
options.add_argument('headless')
browser = webdriver.Chrome(options=options)

Then, you can access a webpage using the get method:

browser.get('https://scripting.tips/')
print(browser.current_url)

Selenium has a nice API to locate elements on a webpage. For instance, here's how you find all links:

links = browser.find_elements_by_tag_name('a')
for link in links:
    text = link.text
    url = link.get_attribute('href')
    print(f'"{text}": {url}')

You can even find elements by the text they contain:

continue_link = browser.find_element_by_link_text('Continue')
continue_link = browser.find_element_by_partial_link_text('Conti')

Scraping an online app

Thinking in terms of a state-machines can greatly simplify the design process.

An online app is a finite automaton with multiple states (logged-in, not-logged-in, etc.). As a rule of thumb, the scraper needs to have one state for each of the app's states: when it isn't logged-in, it will behave differently than when it is.

Handling user login

When a script runs for several hours, logging-in once at startup won't cut it: the session might expire after one hour or so and we need to be able to redo the login steps on demand.

In these cases, I wrap the browser instance to make it so that everytime the website doesn't serve me a member content, I redo the login steps. Here's the gist of it:

# Instead of browser.get(url), use this:

def safe_get(url, browser):
    browser.get(url)
    if must_login(browser.current_url):
        login(browser)
    browser.get(url)
    if must_login(browser.current_url):
        notify(f'Unable to access {url}')

I have implemented a toy server that randomly logs you out, and a resilient scraper with the behavior described above in this repo.

Downloading files

There is no easy and safe way to download files through selenium. You could configure you browser's driver to automatically download files, but this poses security concerns because you can't allow/disallow download on file-per-file basis.

If you don't need advanced session handling, you can just use urlretrieve from the standard library:

from urllib.request import urlretrieve

# Download `url` and save it as `local_filename`
urlretrieve(url, local_filename)

If you want to use the standard library for more advanced scenarios, you can use urlopen to set custom HTTP headers (as shown in the examples). However, I advise to use the requests library instead.

For instance, here's how you can reuse selenium's cookies using requests:

form requests import get  # pip install requests

cookies = browser.get_cookies()
cookies_dict = {c['name']: c['value'] for c in cookies}  

with get(url, stream=True, cookies=cookies_dict) as r:
    assert r.status_code == 200
    with open(local_filename, 'wb') as f:
        f.writelines(r)

(Relevant external links for this snippet: get_cookies, requests.cookies, stream=True, other approaches)

Many variables should be transfered between selenium and requests:

  • cookies;
  • user-agent;
  • referer;
  • new cookies after the download.

I wrote a download(browser, url, filename) function that handles all this. You can find it here: My Selenium Tools.

Tips for handling failure

To ensure that you detect when a file is not completely downloaded, I suggest to write a small text file after each download. Use this as an occasion to save metadata such as the date, name rating and description of the resource:

from os.path import isfile
import json

data_url, data_info = scrape_website()
video_file = data_info['name'] + '.mp4'
txt_file = data_info['name'] + '.txt'

if not isfile(txt_file):
    download(data_url, video_file)
    with open(txt_file, 'w') as f:
        json.dump(data_info, f)

When the text file does not exist but the video file exists, it means that a previous download was interrupted mid-way.

See also: How to safely get file extension from an url.