HTML Parsing Notes

ThinkAutomation includes a number of options to make parsing HTML content easier.

When an incoming message is received, if the message is an email and the body has no plaintext portion then the %Msg_Body% built-in variable will be automatically created as a plaintext version of the HTML body. The %Msg_HTML% built-in variable will contain the HTML original.

The Set Variable action and the Text Operation actions have options to convert HTML To Plaintext, XML or JSON. These operations are the same for both of these actions, however the Text Operation action has the option of previewing the result, allowing you to test the conversion.

HTML To Plain Text

Converts any HTML content into plaintext. All HTML tags are removed resulting in only the text content that would have been displayed.

For Example:

<html>
    <head>
        <title>This is a test</title>
        <meta http-equiv="Content-Language" content="en-us">
        <meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
    </head>
    <body>
        <h1>This is the heading</h1>
        <p>This is a sample paragraph with <b>bold</b> text<br> and a line below.
        <p>Sample link to <a href="http://www.google.com/">Google</a></p>
    </body>
</html>

Becomes:

THIS IS THE HEADING

This is a sample paragraph with bold text
and a line below.

Sample link to Google [http://www.google.com/]

HTML To XML

Converts any HTML into well-formed XML for easier extraction of data. All formatting tags, styles, images & scripts are removed.

For example, the above HTML would be converted to:

<?xml version="1.0" encoding="utf-8"?>
<root>
    <html>
        <head>
            <title>
                <text>This is a test</text>
            </title>
        </head>
        <body>
            <h1>
                <text>This is the heading</text>
            </h1>
            <p>
                <text>This is a sample paragraph with bold text and a line below.</text>
            </p>
            <p>
                <text>Sample link to</text>
                <a href="http://www.google.com/">
                    <text>Google</text>
                </a>
            </p>
        </body>
    </html>
</root>

HTML To JSON

Converts any HTML to Json text. The HTML is first converted to XML and then from XML to Json.

For example, the above HTML would be converted to:

{
  "html": {
    "head": {
      "title": {
        "text": "This is a test"
      }
    },
    "body": {
      "h1": {
        "text": "This is the heading"
      },
      "p": [
        {
          "text": "This is a sample paragraph with bold text and a line below."
        },
        {
          "text": "Sample link to",
          "a": {
            "@href": "http://www.google.com/",
            "text": "Google"
          }
        }
      ]
    }
  }
}

Character data, comments, whitespace and significant whitespace nodes are accessed via #cdata-section, #comment, #whitespace and #significant-whitespace respectively.
Multiple nodes with the same name at the same level are grouped together into an array.
Empty elements are null.

Once converted to Json you can use the Extract Field or Read JSON Document actions to extract data at specific paths.