Tech Nuggets (3) 日本語版はこちら

Download all videos in vimcasts.org

Posted:

A few days ago, I've found Vimcasts. There are great videos and articles about basic of vim. It's really great.

Each video is about 5 minutes long but there are 76 episodes at now. It's a lot. I want to watch these videos offline so I wrote scripts and downloaded. I haven't used BeautifulSoup in a while, but it's easy and still useful.

Here is how I did:

  1. Get title and video's URL with BeautifulSoup
  2. Create fish script that download one video
  3. Run (2) with xargs -P to download in parallel

I think (2) and (3) are a bit redundant. Like this one, to run simple task in parallel, I often use xargs -P. But when I pass multiple commands to xargs, it's a little hard to read so I often write a tiny script. Anyway, in this case, I should have done 'parallel download' with python. I wish I know how but this is a one-shot task so I didn't want to be bothered to search, learn & test... Well, it's just an excuse. Next time something like this, I'll learn.

Here is the question; what is the appropriate frequency of requests?

This time, I downloaded the videos in parallel based on my baseless expectation that 12 simultaneous connections would not overload the server. In Japan, about 10 years ago, a man was arrested for web scraping. He scraped the local library web site and overloaded the server. Though, the scraper he created is well-behaved that only made one request per second. In the end, it turned out that the cause of the overload was the server's processing, and he didn't intend to interfere with the business, so the case was dropped. Since then, "one request per second" has become a sort of gold standard I think. But it's a decade old standard. Again, what is the appropriate frequency of requests? Please let me know what you think.

Following is the scripts I wrote and the results.

1. Get title and video's URL with BeautifulSoup

import requests
from bs4 import BeautifulSoup


def get_title_and_url(n, extension):
    url = f"http://media.vimcasts.org/videos/{n}/"
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')

    for a in soup.select('a'):
        href = a.get('href')
        if f".{extension}" in href:
            return f"{n}_{a.text},{url}{href}"


for i in range(1, 68 + 1):
    print(get_title_and_url(i, 'ogv'))
for i in range(69, 76 + 1):
    print(get_title_and_url(i, 'mp4'))

Episode #1~#68 download .ogv, and episode #69 and later download .mp4. This is not good right? At first, I assumed that all episodes had .ogv, but when I ran it, it didn't exist after #69 and I got an error, then I fixed it ad hoc. I should have been written like "get .ogv if exists, otherwise get first URL" from the beginning.

2. Create fish script that download one video

#!/usr/bin/fish
set title (echo $argv | cut -d',' -f1)
set url (echo $argv | cut -d',' -f2)
curl -q --output "$title" "$url"

Assuming that the CSV output in (1) is passed as a string, one line at a time.

$ download.fish "title,url"

3. Run (2) with xargs -P to download in parallel

$ chmod +x download.fish
$ python get_title_and_url.py > title_and_url.csv
$ cat title_and_url.csv | xargs -I'{}' -P 12 ./download.fish {}
.
.
.

$ ls -1
vimcasts-01_show_invisibles.ogv
vimcasts-02_tabs_and_spaces.ogv
vimcasts-03_whitespace_preferences_and_filetypes.ogv
.
.
.

🎉

Postscript

As I mentioned that, I use xargs -P when I want to run command in parallel. GNU Parallel also crosses my mind, but I can't remember the syntax and I'm too lazy to read the man page. I think parallel would be a better choice if the process is combinatorial.

Vimcasts reminded me that there are still a lot of things I don't know (even the basics!). Thanks Drew Neil for the great contents.