Parsing webpages to build a collection of Qawwali performances

(18 Jul 2020) python music

I’ve been itching to get better at web-parsing. I’ve been recently listening to a lot of qawwali on Youtube, but I’ve been meaning to expand the list of songs I listen to and not get locked in to the recommendations. I came across this website of a list of “100 best Qawwali tracks”. Below is my attempt at parsing webpages to automatically extract .mp3 links which I then downloaded using wget.

I started off with making two directories, one to hold the webpages and one to hold the music files

    mkdir data
    mkdir music

I was initally planning on using wget to download the actual files from URLs, but I realized there was a wget module for python, so I installed that first. Next, I downloaded the index page, hoping that it would list links to the MP3 files.

    import wget
    import os
    from tqdm import tqdm
    frontpage = "https://www.thesufi.com/sufimusic/100-best-qawwali-music-tracks-ever.html"
    datapath = 'data/frontpage.html'
    
    if not os.path.exists(datapath):
        wget.download(frontpage, datapath)

Some quick grepping showed that this page contains links to other HTML webpages for each track, which in turn has a Download link for the track. To parse each page, I started off with the template example on the python documentation site for the HTML parser to store links from html attrs that lead to other webpages. I fine-tuned the conditional to get the links I was interested in.

    from html.parser import HTMLParser
    from html.entities import name2codepoint
    
    class MyHTMLParser(HTMLParser):
        def __init__(self):
            HTMLParser.__init__(self) # This is important to inherit HTMLParser's local variables
            self.links = []           # list to store links
    
        def handle_starttag(self, tag, attrs):
            for attr in attrs:
                k, v = attr
                # Somewhat arbitrary filter which gets the job done
                if k == 'href' and ('thesufi.com/sufimusic/' in v and  '.html' in v) and (len(v.split('/')) > 5):
                    self.links.append(v)

Using this parser, I extracted the relevant links to webapges and dumped them in a file called data/alllinks

    parser = MyHTMLParser()
    data = ""
    with open(datapath, 'r') as infile:
        data = "".join(infile.readlines())
    parser.feed(data)
    with open('data/alllinks','w') as outfile:
        for l in parser.links:
            outfile.write(l+'\n')

This is the main loop. For each link the the file:

download the webpage if it doesn’t exist, while handling 404 errors with a try block
read the webpage in as a string
parse the webpage and collect all links which end with .mp3 using the ParseMP3 class.
download the mp3 file, while handling any request errors.

First, I had to define another HTML parser to look for mp3 links

    class ParseMP3(HTMLParser):
        def __init__(self):
            HTMLParser.__init__(self)
            self.mp3link = []
    
        def handle_starttag(self, tag, attrs):
            for attr in attrs:
                k, v = attr
                if k == "href" and v.endswith(".mp3"):
                    self.mp3link.append(v)

Next, I read the list of links from data/alllinks, which I will loop over

    alllinks = []
    with open('data/alllinks','r') as infile:
        for link in infile.readlines():
            alllinks.append(link.strip())

Finally, here’s the main loop, as described above

    for link in tqdm(alllinks[1:]):  # Hack starting from second element, because first link was not relevant!
        print("Current link: ", link)
        # Download the webpage
        fname = 'data/' + link.split('/')[-1]
        if os.path.exists(fname):
            print("Skipping download...")
        else:
            print("Downloading: ", fname)
            try:
                wget.download(link, fname)
            except:
                print(link, "could not be downloaded. Skipping...")
    
        # load webpage
        with open(fname, 'r') as webp:
            page = "".join(webp.readlines())
    
        # Parse webpage for the mp3 link
        parser = ParseMP3()
        parser.feed(page)
        if len(parser.mp3link) == 0:
            print(fname, 'contains no links. skipping')
        else:
            mp3link = list(set(parser.mp3link))[0]        
            targetpath = 'music/' + mp3link.split('/')[-1]
            if os.path.exists(targetpath):
                print(mp3link, 'exists, skipping...')
            else:
                print("Downloading mp3: ", targetpath)
                try:
                    wget.download(mp3link,targetpath )
                except:
                    print(link, "could not be downloaded. Skipping...")

I added the try-except blocks after I first encountered a 404 error that caused the script to exit, and the code looks messy, but the code overall is quite straightforward.

Of the advertised 100, I could only find 91 links using my initial filter, but otherwise (discounting the handful that errored out), I now have a substantial collection of qawwali that I can listen to offline.