Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
June 18, 2022 04:26 pm GMT

Golang Html tokenizer

Photo by <a href="https://unsplash.com/@afgprogrammer?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Mohammad Rahmani</a> on <a href="https://unsplash.com/s/photos/html?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>

Looking for parsing & extracting HTML content in golang as we can simply do in PHP or Js by creating a new dom document. In golang, there are multiple ways to do it by using different packages based on your requirements. Some of the ways I found out are:

  • gohtml: gohtml is an HTML5 tokenizer and parser implementation. It returns nodes after parsing, and then the elements can be extracted by various attributes such as tag type, tag name, attr, and text data using a tokenizer concept.

  • goquery: goquery is built on the gohtml package and the CSS Selector library Cascadia, giving it more power over content selection and extraction. It has a similar syntax as jquery.

  • godom: godom is a library that allows you to manipulate the DOM in Golang similar to javascript. It compiles Go code to JavaScript using GopherJS.

For now, I will use gohtml for the demonstration purpose, to use tokenization.

Tokenization is the lexical analysis, parsing the input into tokens. Among HTML tokens are start tags, end tags, attribute names and attribute values.

Tokenizing the document is the first step in parsing it into a tree of element and text nodes, similar to the DOM.

Types of HTML Tokens Supported:

  • html.StartTagToken: a start tag such as
  • html.EndTagToken: an end tag such as
  • html.SelfClosingTagToken: a self-closing tag such as <img.../>
  • html.TextToken: text content within a tag
  • html.CommentToken: an HTML comment such as <!-- comment -->
  • html.DoctypeToken: a document type declaration such as <!DOCTYPE html>

Example:

package mainimport ( "fmt" "strings" "io" "golang.org/x/net/html")func main() { tokenizer := html.NewTokenizer(strings.NewReader(sampleHtml)) for {  tokenType := tokenizer.Next()  token := tokenizer.Token()  if tokenType == html.ErrorToken {   if tokenizer.Err() == io.EOF {    return   }   fmt.Printf("Error: %v", tokenizer.Err())   return  }  fmt.Printf("Token: %v
", html.UnescapeString(token.String())) }}const sampleHtml = `<!DOCTYPE html><html><head><style> body {background-color: powderblue;} h1 {color: red;} p {color: orange;}</style><title>Sample HTML Code</title><script src="my-script.js">abc</script></head><body><h1>Main title</h1><p id="demo"></p><a href="https://dev.to/">Dev Community</a><script>document.getElementById("demo").innerHTML = "Hello JavaScript!";</script></body></html>`

Output:

Token: <!DOCTYPE html>Token: <html>Token: <head>Token: <style>Token:  body {background-color: powderblue;} h1 {color: red;} p {color: orange;}Token: </style>Token: <title>Token: Sample HTML CodeToken: </title>Token: <script src="my-script.js">Token: abcToken: </script>Token: </head>Token: <body>Token: <h1>Token: Main titleToken: </h1>Token: <p id="demo">Token: </p>Token: <a href="https://dev.to/">Token: Dev CommunityToken: </a>Token: <script>Token: document.getElementById("demo").innerHTML = "Hello JavaScript!";Token: </script>Token: </body>Token: </html>

Here, I had just simply checked for Error Token or EOF and printed all the token types as it is.

We can also parse HTML based on the Token such as html.StartTagToken, html.EndTagToken, etc as mentioned above.

Also, on the element type such as html, h1, script, style, etc.

tokenizer := html.NewTokenizer(strings.NewReader(sampleHtml)) for {  tokenType := tokenizer.Next()  token := tokenizer.Token()  if tokenType == html.ErrorToken {   if tokenizer.Err() == io.EOF {    return   }   fmt.Printf("Error: %v", tokenizer.Err())   return  }  switch token.Data {  case "script":   fmt.Printf("Script Token: %v
", html.UnescapeString(token.String())) case "style": fmt.Printf("Style Token: %v
", html.UnescapeString(token.String())) default: //This will also include contents of <script>, <style> tags content fmt.Printf("Others: %v
", html.UnescapeString(token.String())) } }

Reference


Original Link: https://dev.to/dave3130/golang-html-tokenizer-5fh7

Share this article:    Share on Facebook
View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To