Sources Contact Advanced Search Tutorials

An Interest In:

Web News this Week

Search Archive

Some of Our Sources

View All Sources

Help Webnuz

Referal links:

June 18, 2022 04:26 pm GMT

Golang Html tokenizer

Looking for parsing & extracting HTML content in golang as we can simply do in PHP or Js by creating a new dom document. In golang, there are multiple ways to do it by using different packages based on your requirements. Some of the ways I found out are:

gohtml: gohtml is an HTML5 tokenizer and parser implementation. It returns nodes after parsing, and then the elements can be extracted by various attributes such as tag type, tag name, attr, and text data using a tokenizer concept.
goquery: goquery is built on the gohtml package and the CSS Selector library Cascadia, giving it more power over content selection and extraction. It has a similar syntax as jquery.
godom: godom is a library that allows you to manipulate the DOM in Golang similar to javascript. It compiles Go code to JavaScript using GopherJS.

For now, I will use gohtml for the demonstration purpose, to use tokenization.

Tokenization is the lexical analysis, parsing the input into tokens. Among HTML tokens are start tags, end tags, attribute names and attribute values.

Tokenizing the document is the first step in parsing it into a tree of element and text nodes, similar to the DOM.

Types of HTML Tokens Supported:

html.StartTagToken: a start tag such as
html.EndTagToken: an end tag such as
html.SelfClosingTagToken: a self-closing tag such as <img.../>
html.TextToken: text content within a tag
html.CommentToken: an HTML comment such as
html.DoctypeToken: a document type declaration such as <!DOCTYPE html>

Example:

package mainimport ( "fmt" "strings" "io" "golang.org/x/net/html")func main() { tokenizer := html.NewTokenizer(strings.NewReader(sampleHtml)) for {  tokenType := tokenizer.Next()  token := tokenizer.Token()  if tokenType == html.ErrorToken {   if tokenizer.Err() == io.EOF {    return   }   fmt.Printf("Error: %v", tokenizer.Err())   return  }  fmt.Printf("Token: %v
", html.UnescapeString(token.String())) }}const sampleHtml = `<!DOCTYPE html><html><head><style> body {background-color: powderblue;} h1 {color: red;} p {color: orange;}</style><title>Sample HTML Code</title><script src="my-script.js">abc</script></head><body><h1>Main title</h1><p id="demo"></p><a href="https://dev.to/">Dev Community</a><script>document.getElementById("demo").innerHTML = "Hello JavaScript!";</script></body></html>`

Output:

Token: <!DOCTYPE html>Token: <html>Token: <head>Token: <style>Token:  body {background-color: powderblue;} h1 {color: red;} p {color: orange;}Token: </style>Token: <title>Token: Sample HTML CodeToken: </title>Token: <script src="my-script.js">Token: abcToken: </script>Token: </head>Token: <body>Token: <h1>Token: Main titleToken: </h1>Token: <p id="demo">Token: </p>Token: <a href="https://dev.to/">Token: Dev CommunityToken: </a>Token: <script>Token: document.getElementById("demo").innerHTML = "Hello JavaScript!";Token: </script>Token: </body>Token: </html>

Here, I had just simply checked for Error Token or EOF and printed all the token types as it is.

We can also parse HTML based on the Token such as html.StartTagToken, html.EndTagToken, etc as mentioned above.

Also, on the element type such as html, h1, script, style, etc.

tokenizer := html.NewTokenizer(strings.NewReader(sampleHtml)) for {  tokenType := tokenizer.Next()  token := tokenizer.Token()  if tokenType == html.ErrorToken {   if tokenizer.Err() == io.EOF {    return   }   fmt.Printf("Error: %v", tokenizer.Err())   return  }  switch token.Data {  case "script":   fmt.Printf("Script Token: %v
", html.UnescapeString(token.String()))  case "style":   fmt.Printf("Style Token: %v
", html.UnescapeString(token.String()))  default: //This will also include contents of <script>, <style> tags content   fmt.Printf("Others: %v
", html.UnescapeString(token.String()))  } }

Reference

Original Link: https://dev.to/dave3130/golang-html-tokenizer-5fh7

Share this article:

View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To