In this quick tip, we will learn about scanning a project directory HTML files for quickly fetching links and scan for broken links that we have. Broken links are usually a bad practice and something that one worried about their SEO score should detect and remove or update, as they affect your ranking in a bad way.

In this tutorial, we’re going to use pyquery to parse the files into HTML, and query for links the same way we do in JQuery.

To get pyquery, you can install it via Python package manager pip:

pip install pyquery

We will also use urllib2 to fetch a link and find out the response status code whether it is a normal response or the opposite. for this purpose, we’re looking into 404 HTTP error code, which stands for a not found web page.

Here’s the code:

# -*- coding: utf-8 -*-
from pyquery import PyQuery as pq
import sys
import urllib2
import glob
import os
import fnmatch
import re

status_code = None

def is404(url):
    global status_code
    req = urllib2.Request(url)
    try:
        resp = urllib2.urlopen(req)
    except urllib2.HTTPError as e:
        status_code = e.code
        return e.code == 404
    except urllib2.URLError as e:
        return None
    else:
        return False

try:
    ext = sys.argv[1]
except:
    ext = None

if not ext:
    print 'You must provide a file extension (e.g html, php)!'
    sys.exit(0)

items = []

for root, dirs, files in os.walk('.'):
    for basename in files:
        if fnmatch.fnmatch(basename, '*.%s'%ext):
            items.append(os.path.join(root, basename))

for item in items:
    with open(item) as c:
        raw = c.read()
        q = pq(raw)

        for a in q('a'):
            try:
                href = a.attrib['href']
            except:
                continue

            if not re.match('(https?)?:\/\/', href):
                # invalid link, internal link probably
                continue

            if is404(href):
                print '%s is a broken link! Status Code %s (located in %s)' % (href, status_code, item)

You should probably save that code into a new file named scan.py or whatever, and execute it inside the directory where your project files are. The file extension is HTML but you can also scan other file extensions such as .php or others, here’s a simple usage:

I have a directory where a sample project is. Here’s a quick view.

(ct) samuel@samuel-dell:~/www/python/ct/broken$ ls -R
.:
about  index.html  scan.py

./about:
index.html

Now I can just run

python scan.py html

and it will recursively search all HTML files within this directory and search for HTML anchors, extract their HREF attributes, validate it, and then check if it is not a broken link.

(ct) samuel@samuel-dell:~/www/python/ct/broken$ python scan.py html
https://www.google.com/cats is a broken link! Status Code 404 (located in ./about/index.html)

It will tell where the broken link is (file location) so you could jump into there to fix it. It may also take some long time to process and it all depends on how many links you have, and your connectivity may also play a role in this.

Digital Ocean

Cheap Cloud SSD Hosting

Get a VPS now starting at $5/m, fast and perfect for WordPress and PHP applications

Sign Up with $10 Credit