Sources Contact Advanced Search Tutorials

An Interest In:

Web News this Week

Search Archive

Some of Our Sources

View All Sources

Help Webnuz

Referal links:

January 26, 2021 05:46 pm GMT

Html Parser - How to scan HTML files for missing assets and broken links

Hello Coders,

The article presents a simple, open-source tool that I'm using to statically analyze HTML files for missing assets and broken links, before using the files in real projects. This Html Parser is basically a Python3 wrapper over Beautiful Soup Library, the popular OSS parsing library for HTML files and XMLs. The source code can be found on Github released under EULA License.

Features:

Open-Source - can be also used for eLearning
Works with directories - all HTML files are scanned
Detects missing assets (JS, CSS, images ) for each page
Detects broken links and suggest the right path
Acceptable execution time - 100 Pages processed <1min

Thanks for reading! TL;DR;

Html Parser - source code
Sample Output - captured from a real project
EULA License - free for solo-developers, small companies, startUps, and NGOs

To use the tool we need to specify two things:

The folder where HTML files are saved
The assets folder - parent Directory for all JS, CSS, Images ..

Once we have provided this simple setup, we can call the scripts in the terminal:

$ python ./check-assets.py

Enter fullscreen mode Exit fullscreen mode

HTML Parser - The Relevant Parts

To scan and correlate the information, the tool uses a few structures to save and reuse the relevant information and also perform simple operations over detected HTML files.

Hot it works

define a map where the key is the file name
associate a data structure to each file where the relevant information is stored and updated
Each HTML file is scanned for assets and links
Validate the information for each file and save the missing assets for each by looking on the disk

HTML Parser - Source Code

The relevant functions and code chunks are below. If something relevant is missing, feel free to ask for it in the comments section:

Read files from a directory

def get_files( aPath ):    FILES_LIST = []     for (root, dirs, files) in walk( aPath ):        FILES_LIST.extend( files )        break    return FILES_LIST

Enter fullscreen mode Exit fullscreen mode

The structure/class to save the information for each file

class TMPL:    # constructor    def __init__(self, aFile=''):        self.file      = aFile        self.title     = ''        self.css       = [] # All CSS Files        self.js        = [] # All JS Files        self.img       = [] # All Images        self.links     = [] # All Links        self.err       = [] # used to report missing assets        self.err_links = [] # used to report missing assets    # Used to have a string representation     def __repr__(self):        return "" + self.file + ' some other info'

Enter fullscreen mode Exit fullscreen mode

Initiate Beautiful Soup object for each file

def get_bs( aFile ):  minified = htmlmin.minify( file_load( aFile ), remove_empty_space=True)  return bs(minified,'html.parser')

Enter fullscreen mode Exit fullscreen mode

Scan each file for Links and assets

The results are injected into associated structures for each file.

# BS object is constructed and available for queries  soup = get_bs( FULL_PATH )# Scan for CSS filestmpl.css = get_css( soup )# # Scan for JS filestmpl.css = get_js( soup )...

Enter fullscreen mode Exit fullscreen mode

Links and images are scanned in the same way using simple helpers.
Once the information is saved, we can traverse the DOM using BS objects and perform mutations over elements.

HTML Parser - Sample output

To visualize a real production output, please access a sample file saved into the public repository: check assets - output

(env) PS > python.exe .\check-assets.py Files (2)['apps-calendar.html', 'index.html'] ***** ***** ***** PROCESSING --> apps-calendar.html | files (1) remaining PROCESSING --> index.html | files (0) remaining PROCESSING --> apps-calendar.html ERR - Missing Asset -> /static/assets/css/classic-horizontal/style-ERROR.css ERR - Missing Asset -> /static/assets/images/logo-mini-ERROR.svg PROCESSING --> index.html ERR - Missing Asset -> /static/assets/images/favicon-ERROR.png    |    |- apps-calendar.html    |    |    |    |--- CSS: 6 file(s)    |          | /static/assets/vendors/mdi/css/materialdesignicons.min.css    |          | /static/assets/vendors/css/vendor.bundle.base.css    |          | /static/assets/vendors/fullcalendar/fullcalendar.min.css    |          | /static/assets/css/classic-horizontal/style.css    |          | /static/assets/css/classic-horizontal/style-ERROR.css    |          | /static/assets/images/favicon.png    |     ...Pages with errors: 2    |    |- apps-calendar.html    |    |     | /static/assets/css/classic-horizontal/style-ERROR.css    |    |     | /static/assets/images/logo-mini-ERROR.svg    |    |- index.html    |    |     | /static/assets/images/favicon-ERROR.png

Enter fullscreen mode Exit fullscreen mode

The tool can be easily extended to LIVE websites using the existing core. In case any of you find it useful, feel free to suggest features in the comments section or push a PR on Github.

Thank you! - For more resources, please access:

Beautiful Soup - the official docs
AppSeed - for more tools and starters

Btw, my (nick) name is Sm0ke and I'm pretty active also on Twitter.

Original Link: https://dev.to/sm0ke/html-parser-how-to-scan-html-files-for-missing-assets-and-broken-links-2mke

Share this article:

View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To