In this quick tip, we will learn about scanning a project directory HTML files for quickly fetching links and scan for broken links that we have. Broken links are usually a bad practice and something that one worried about their SEO score should detect and remove or update, as they affect your ranking in a bad way.
In this tutorial, we’re going to use pyquery to parse the files into HTML, and query for links the same way we do in JQuery.
To get pyquery, you can install it via Python package manager pip:
pip install pyquery
We will also use urllib2 to fetch a link and find out the response status code whether it is a normal response or the opposite. for this purpose, we’re looking into 404 HTTP error code, which stands for a not found web page.
Here’s the code:
# -*- coding: utf-8 -*- from pyquery import PyQuery as pq import sys import urllib2 import glob import os import fnmatch import re status_code = None def is404(url): global status_code req = urllib2.Request(url) try: resp = urllib2.urlopen(req) except urllib2.HTTPError as e: status_code = e.code return e.code == 404 except urllib2.URLError as e: return None else: return False try: ext = sys.argv except: ext = None if not ext: print 'You must provide a file extension (e.g html, php)!' sys.exit(0) items =  for root, dirs, files in os.walk('.'): for basename in files: if fnmatch.fnmatch(basename, '*.%s'%ext): items.append(os.path.join(root, basename)) for item in items: with open(item) as c: raw = c.read() q = pq(raw) for a in q('a'): try: href = a.attrib['href'] except: continue if not re.match('(https?)?:\/\/', href): # invalid link, internal link probably continue if is404(href): print '%s is a broken link! Status Code %s (located in %s)' % (href, status_code, item)
You should probably save that code into a new file named scan.py or whatever, and execute it inside the directory where your project files are. The file extension is HTML but you can also scan other file extensions such as .php or others, here’s a simple usage:
I have a directory where a sample project is. Here’s a quick view.
(ct) samuel@samuel-dell:~/www/python/ct/broken$ ls -R .: about index.html scan.py ./about: index.html
Now I can just run
python scan.py html
and it will recursively search all HTML files within this directory and search for HTML anchors, extract their HREF attributes, validate it, and then check if it is not a broken link.
(ct) samuel@samuel-dell:~/www/python/ct/broken$ python scan.py html https://www.google.com/cats is a broken link! Status Code 404 (located in ./about/index.html)
It will tell where the broken link is (file location) so you could jump into there to fix it. It may also take some long time to process and it all depends on how many links you have, and your connectivity may also play a role in this.
Cheap Cloud SSD Hosting
Get a VPS now starting at $5/m, fast and perfect for WordPress and PHP applications