Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
July 1, 2020 01:10 am GMT

Comparing the same web scraper in Haskell, Python, Go

So this project started with a need - or, not really a need, but an annoyance I realized would be a good opportunity to strengthen my Haskell, even if the solution probably wasn't worth it in the end.

There's a blog I follow (Fake Nous) that uses Wordpress, meaning its comment section mechanics and account system are as convoluted and nightmarish as Haskell's package management. In particular I wanted to see if I could do away with relying on kludgy Wordpress notifications that only seem to work occasionally and write a web scraper that'd fetch the page, find the recent comments element and see if a new comment had been posted.

I've done the brunt of the job now - I wrote a Haskell script that outputs the "Name on Post" string of the most recent comment. And I thought it'd be interesting to compare the Haskell solution to Python and Go solutions.

{-# LANGUAGE OverloadedStrings #-}{-# LANGUAGE TupleSections #-}{-# LANGUAGE ScopedTypeVariables #-}{-# LANGUAGE MultiWayIf #-}{-# LANGUAGE ViewPatterns #-}import Network.HTTP.Reqimport qualified Text.HTML.DOM as DOMimport qualified Text.XML.Cursor as Cursorimport qualified Text.XML.Selector as Selectorimport qualified Data.XML.Types as Typesimport qualified Text.XML as XMLimport Data.Text (Text, unpack)import Control.Monadmain = do    resp <- runReq defaultHttpConfig $ req GET (https "fakenous.net") NoReqBody lbsResponse mempty    let dom = Cursor.fromDocument $ DOM.parseLBS $ responseBody resp        recentComments = XML.toXMLNode $ Cursor.node $ head $ Selector.query "#recentcomments" $ dom        newest = head $ Types.nodeChildren recentComments    putStrLn $ getCommentText newestgetCommentText commentElem =    let children = Types.nodeChildren commentElem    in foldl (++) "" $ unwrap <$> childrenunwrap :: Types.Node -> Stringunwrap (Types.NodeContent (Types.ContentText s)) = unpack s unwrap e = unwrap $ head $ Types.nodeChildren e

My Haskell clocs in at 25 lines, although if you remove unused language extensions, it comes down to 21 (The other four in there just because they're "go to" extensions for me). So 21 is a fairer count. If you don't count imports as lines of code, it can be 13.

Writing this was actually not terribly difficult; of the 5 or so hours I probably put into it in the end, 90% of that time was spent struggling with package management (the worst aspect of Haskell). In the end I finally resorted to Stack even though this is a single-file script that should be able to compile with just ghc.

I'm proud of my work though, and thought it reflected fairly well on a language to do this so concisely. My enthusiasm dropped a bit when I wrote a Python solution:

import requestsfrom bs4 import BeautifulSoupfile = requests.get("https://fakenous.net").textdom = BeautifulSoup(file, features='html.parser')recentcomments = dom.find(id = 'recentcomments')print(''.join(list(recentcomments.children)[0].strings))

6 lines to Haskell's 21, or 4 to 13. Damn. I'm becoming more and more convinced nothing will ever displace my love for Python.

Course you can attribute some of Haskell's relative size to having an inferior library, but still.

Here's a Go solution:

package mainimport (    "fmt"    "net/http"    "github.com/ericchiang/css"    "golang.org/x/net/html")func main() {    var resp, err = http.Get("https://fakenous.net")    must(err)    defer resp.Body.Close()    tree, err := html.Parse(resp.Body)    must(err)    sel, err := css.Compile("#recentcomments > *:first-child")    must(err)    // It will only match one element.    for _, elem := range sel.Select(tree) {        var name = elem.FirstChild        var on = name.NextSibling        fmt.Printf("%s%s%s\n", unwrap(name), unwrap(on), unwrap(on.NextSibling))    }}func unwrap(node *html.Node) string {    if node.Type == html.TextNode {        return node.Data    }    return unwrap(node.FirstChild)}func must(err error) {    if err != nil {        panic(err)    }}

32 lines, including imports. So at least Haskell came in shorter than Go. I'm proud of you, Has- oh nevermind, that's not a very high bar to clear.

It would be reasonable to object that the Python solution is so brief because it doesn't need a main function, but in real Python applications you generally still want that. But even if I modify it:

import requestsfrom bs4 import BeautifulSoupdef main():    file = requests.get("https://fakenous.net").text    dom = BeautifulSoup(file, features='html.parser')    recentcomments = dom.find(id = 'recentcomments')    return ''.join(list(recentcomments.children)[0].strings)if __name__ == '__main__': main()

It only clocs in at 8 lines, including imports.

An alternate version of the Go solution that doesn't hardcode the number of nodes (since the Python and Haskell ones don't):

package mainimport (    "fmt"    "net/http"    "github.com/ericchiang/css"    "golang.org/x/net/html")func main() {    var resp, err = http.Get("https://fakenous.net")    must(err)    defer resp.Body.Close()    tree, err := html.Parse(resp.Body)    must(err)    sel, err := css.Compile("#recentcomments > *:first-child")    must(err)    // It will only match one element.    for _, elem := range sel.Select(tree) {        fmt.Printf("%s\n", textOfNode(elem))    }}func textOfNode(node *html.Node) string {    var total string    var elem = node.FirstChild    for elem != nil {        total += unwrap(elem)        elem = elem.NextSibling    }    return total}func unwrap(node *html.Node) string {    if node.Type == html.TextNode {        return node.Data    }    return unwrap(node.FirstChild)}func must(err error) {    if err != nil {        panic(err)    }}

Though it ends up being 39 lines.

Maybe Python's lead would decrease if I implemented the second half, having the scripts save the last comment they found in a file, read it on startup, and update if it's different and notify me somehow (email could be an interesting test). I doubt it, but if people like this post I'll finish them.


Original Link: https://dev.to/yujiri8/comparing-the-same-web-scraper-in-haskell-python-go-387a

Share this article:    Share on Facebook
View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To