HTML Parser Light

This component enables the application to parse HTML files and some other text files. The main purpose of the component is HTML in some variations (with ASP or other embedded foreign content for example), still other formats can be parsed as well, but with some limitations and need of additional configuration.

Creation:

ProgID: newObjects.utilctls.HTMLParser
ClassID: {A253C277-F280-4349-B918-ED94BA6A1A28}
free threaded version
ProgID: newObjects.utilctls.HTMLParser.free
ClassID: {F63A8DFD-B830-4f18-8621-22D8D1AFF366}

Contents:

Members reference
The object tree document structure
Searching for nodes
Cloning/adding elements
Remarks
What kind of documents can be parsed?
What is it for and what is not for?
Auto-closing
HTML Encoding/Decoding

Members reference

Member	Syntax	Description
Parse	Set result = obj.Parse(string)	Parses the string as configured and returns the tree (see below) representing the document.
Construct	str = obj.Construct(tree [,bXml])	The reverse of Parse method. Constructs a string from an object tree.
Configuration members
ApplySettings	obj.ApplySettings preset_name	There are a few pre-defined configurations which will save you the need to pass though all the configuration properties and members: HTML - default HTML parsing HTMLTEMPLATE - HTML but parses only the elements with existing attribute named TEMPLATE. The rest of the content is treated as plain text. This works faster and is enough for applications interested in modifying only certain elements. ASP - Parse only the ASP related elements and embedded code segments (As nodes). The rest is treated as plain text. Good for apps. which are interested only in separating the code from the static content. HTMLASP - HTML + ASP code embeddings. Good for apps. interested in both the HTML structure and the code. However the application needs to analyze the SCRIPT tags on its own (i.e. check if they have RUNAT attribute.
AddTag	obj.AddTag tag_name [, autoclose]	Adds a tag to the list of the tags that are parsed and specifies if it is a self-closing one (default is 0 - not).
RemoveTag	obj.RemoveTag tag_name	Removes a tag from the list of the tags.
RemoveTags	obj.RemoveTags	Removes all the tags from the parse list.
AddStdHTMLTags	obj.AddStdHTMLTags	Adds all the HTML 3.2 tags to the parse list
SetSkipTag	obj.SetSkipTag tag_name, bSetRemove	Adds/Removes depending on the bSetRemove (True - add, False - Remove) a tag to the list of the "skip tags". The "skip tags" list may contain only non-selfclosing tags. The content inside the tags is not parsed. I.e. the parse searches for the closing tag once the opening is found and the internal content is treated as plain text. For instance the SCRIPT and the STYLE tags are by default skip tags for the HTML configuration.
AddEmbed	obj.AddEmbed start,end,name	Adds an embed definition. start and end are strings defining how the embedding starts and how it finishes. The name is the name of the embedding by which it can be identified in the tree (the Info property of a node). The embeddings are supposed to be elements that violate the general tag syntax - such as comments, ASP code and so on. The internal content in an embed is treated as plain text.
RemoveEmbed	obj.RemoveEmbed name	Removes an embed from the parse list
RemoveEmbeds	obj.RemoveEmbeds	Clears the embed parse list
codePage	obj.CodePage = x x = obj.CodePage	The code page of the parsed content.
caseSensitive	obj.caseSensitive = boolval x = obj.caseSensitive	Enables disables case sensitive parsing. For instance HTML is parsed case insensitively.
knownTagsOnly	obj.knownTagsOnly = boolval x = obj.knownTagsOnly	If True only the tags in the list (constructed with AddTag or using a pre-set) are parsed - the rest are treated as plain text. If False the parser will parse everything that looks like a tag assuming it is not self-closing unless it finishes with />
aspcCompatible	obj.aspcCompatible = boolval x = obj.aspcCompatible	Work in ASP Compiler compatible style. This is rarely needed unless you want to use code built for the ASPC Compile Time Scripting.
commentTag	obj.commentTag = string s = obj.commentTag	Sets/returns the comment tag start (usually !--). Rarely needed - it is more flexible to use embeds to define comments. Still for plain HTML or XML this will work faster.
ignoreUnknownTags	obj.ignoreUnknownTags = boolval x = obj.ignoreUnknownTags	If True any errors while parsing unknown tags (knownTagsOnly = False) are ignored and the content is treated as plain text. This may help you parse the important part of content which includes some elements with tag-like syntax that cannot be understood by the parser.
requiredAttribute	obj.requiredAttribute = attr_name x = obj.requiredAttribute	Specifies the name of an attribute that must be present in order the element to be included in the result tree. If empty all the parsed elements are included. This is used for instance in the HTMLTEMPLATE pre-set where only tags which have TEMPLATE attribute are included in the tree and everything else is treated as plain text no matter if it is HTML or not. This is useful when you are not interested in the entire document structure, but in certain elements only.
omitEmptyValues	obj.omitEmptyValues = boolval x = obj.omitEmptyValues	If set to True the Construct will not output the empty values of attributes. For example if you have NOWRAP somewhere it will appear in the output as: NOWRAP if this property is True and as: NOWRAP="" if this property is False Strongly recommended with HTML
skipEmptyTexts	obj.skipEmptyTexts = boolval x = obj.skipEmptyTexts	Default (false). If set to true the empty text/plain areas will not be included in the document tree. Note that this is often related to the way the application works with the tree. For instance application that depends on this property set to True may assume that a node contains only attributes and sub-nodes that are HTML elements. Such a code may not be prepared to see text/plain elements between the HTML elements and thus may fail if you decide to set this property to False later in the application's development. Empty text area is area consisting only of spaces, <CR>, <LF> and tab characters. By default these areas are included in the tree as nodes with node.Info = "text/plain" and single unnamed element with string value containing the actual characters met there (for example several spaces, new line and then some tabs) . When the property is set to true these elements are not present in the tree. Such elements can be considered garbage, still they reflect the way the original document is formatted and may be important for some applications. Be careful to consider the need (current and potential need in future) of them before deciding if you will want to scrap them.

The object tree document structure

The Parse method uses VarDictionary objects to represent the document's tree. Each node has the following structure:

node.Info - contains the tag name
node(name_or_index) - contains an element inside this element (sub-node) or an attribute. Thus if it is an object it is a sub-node (element inside this element) and if it is not an object it is an attribute.
node.Key(name_or_index) - is the name of the attribute (if node(name_or_index) is not an object) or the ID of the sub-element (if node(name_or_index) is object). The ID is the value of the ID attribute if the element has such - if it has no ID attribute then its Key is empty. I.e. the parser treats the ID attribute in a bit specific manner, but this will not ruin anything if you are not interested in the ID.

The plain text segments - content of a tag, non-parsed parts, content of the embeds are represented as node same as the nodes for the tags, but its node.Info = "text/plain"

Thus to change the text somewhere in the document you need to find the appropriate text/plain node and change its only value (index 1 or empty name). For simplicity you can also change the Root property of the text/plain node instead of the first element - this will produce the same results when you use Generate. However, note that only the Generate method checks for the Root property and if it is non-empty uses it instead of the first element of the text/plain node - and this is so only to enable you to use a bit simpler code for changing texts.

The content of the parsed data is encapsulated in a node with Info = "document/root"

To illustrate this let us take this HTML file:

<HTML>
<HEAD>
<TITLE>Some title</TITLE>
</HEAD>
<BODY>
<P ALIGN="LEFT">Some text
<INPUT TYPE="TEXT" NAME="Text1" VALUE="Something">
</P>
</BODY>
</HTML>

Its tree will look as shown below. The letter in the first line is O or V to mark it this is an Object or Value. The notation is such that it reflects the expression you would need to use to access each node tree if you have called the Parse method like this:

Set root = o.Parse(content)

Where the content is a string containing the above HTML code.
O     root .Info="document/root"
O       root(1) .Info="HTML"
O         root(1)(1) .Info="text/plain"
V           root(1)(1)(1)=""
O         root(1)(2) .Info="HEAD"
O           root(1)(2)(1) .Info="text/plain"
V             root(1)(2)(1)(1) = ""
O           root(1)(2)(2) .Info="TITLE"
O             root(1)(2)(2)(1) .Info="text/plain"
V               root(1)(2)(2)(1)(1)="Some title"
O           root(1)(2)(3) .Info="text/plain"
V             root(1)(2)(3)(1)=""
O         root(1)(3) .Info ="text/plain"
V           root(1)(3)(1)=""
O         root(1)(4) .Info="BODY"
O           root(1)(4)(1) .Info="text/plain"
V             root(1)(4)(1)(1)=""
O           root(1)(4)(2) .Info="P"
V             root(1)(4)(2)(1)="LEFT" or root(1)(4)(2)("ALIGN")="LEFT"
O             root(1)(4)(2)(2) .Info="text/plain"
V               root(1)(4)(2)(2)(1)="Some text"
O             root(1)(4)(2)(3) .Info="INPUT"
V               root(1)(4)(2)(3)(1)="TEXT" or root(1)(4)(2)(3)("TYPE")="TEXT"
V               root(1)(4)(2)(3)(2)="Text1" or root(1)(4)(2)(3)("NAME")="Text1"
V               root(1)(4)(2)(3)(3)="Something" or root(1)(4)(2)(3)("VALUE")="Something"
O             root(1)(4)(2)(4) .Info="text/plain"
V               root(1)(4)(2)(4)(1)=""
O           root(1)(4)(3) .Info="text/plain"
V             root(1)(4)(3)(1)=""
O         root(1)(5) .Info="text/plain"
V           root(1)(5)(1)=""
Except the attributes by default the sub-elements are unnamed elements of each node. This is not so only if the sub-element has an ID attribute. In such case it will be named after the ID attribute's value in its parent's node collection. For example if we add ID="MyTextBox" to the attributes of the INPUT in the above example HTML we will be able to access the text box element not only as:
Set theInpur = root(1)(4)(2)(3)
but also as:
Set theInpur = root(1)(4)(2)("MyTextBox")

Each language has ability to test if certain element of the collection an object or value. Thus when you need to enumerate only the sub-elements and not the attributes of a node you use code like this:

' Assuming the above example HTML root(1)(4) is the BODY element.
For I = 1 To root(1)(4).Count
If IsObject(root(1)(4)(I)) Then
' Do whatever you want to do with each element of the BODY
End If
Next

Of course in the real usage you will not specify the indices as numbers directly - instead you will search for an element and then dig in it or enumerate the tree and go down the branches you are interested in.

When walking the tree you can use over each node statements like If node("NAME") = "Somename" to determine the value of the name attribute of the element (if it is missing empty will be returned). Another frequently used expression will be for example If node(N).Info = "A" - after using IsObject over node(N) and receiving True from it you learned that it is a sub-node/element - node(N).Info will give you the element name - anchor in this sample.

Still, the walking through the tree requires too much work - for you and for the machine. Thus usually such techniques are applied to small parts of it - for example enumerating the rows <TR> of a <TABLE> is often candidate for this kind of exploration, but after you get the <TABLE> in some more efficient way (it can be deep in the document structure and there could be many tables).

Searching for nodes

As it was stated above the nodes in the document tree are VarDictionary objects over which you can use any of the VarDictionary object's method and properties. There are many possible uses of them, but the Find methods deserve special attention.

Using FindByValue:

This method enables you to find recursively all the sub-nodes of a node that have a specific value for a specific attribute. For example in the example above we may want to find the INPUT by its name. We can do this:

Set elements = root.FindByValue("NAME","Text1")

The returned result is again a VarDictionary collection that contains references to all the found elements that match the criteria. The search is done in depth, thus we can perform the search for an element over the entire tree without need to dig certain branch first. Once the element is found we can search other more specific elements under it if it has sub-elements too

In this example case we will have the INPUT object in elements(1)

Note that FindByValue has optional parameters that specify how much elements can be found at most and how deep the search will go - pay attention to them in order to refine the search as needed.

Over the returned result you can perform a cycle and inspect each found element for something else, perform some other actions on some or all the found elements, add new sub-elements to them or alter, add attributes. For example lets create a little code that will add that ID attribute we used above to the INPUT in the above example:

Set elements = root.FindByValue("NAME","Text1")
For J = 1 To elements.Count
elements(J)("ID") = "MyTextBox"
Next

This code will work even if we have more elements that NAME attribute with value "Text1". To all of them an ID="MyTextBox" attribute will be appended.

If it is possible that there are other elements with NAME="Text1" in the HTML and we want to be sure that only INPUT elements of TYPE="TEXT" will be affected we can change the code this way:

Set elements = root.FindByValue("NAME","Text1",1,10000)
For J = 1 To elements.Count
If elements(J).Info = "INPUT" AND elements(J)("TYPE")="TEXT" Then
elements(J)("ID") = "MyTextBox"
End If
Next

Using FindByInfo:

This method searches throughout the tree for elements with Info property containing a specified values. Thus we can use it to find all the elements of certain type. For example all the paragraphs (P - elements).

Set elements = root.FindByInfo("P",1,10000)

Usually a combination of the both methods is the best way to find what we search for. For instance if we have many tables we name them somehow (there a lot of options - from using ID="tablename" attribute through using a TITLE="name" or NAME="somename" attribute to using custom non-standard HTML attribute for the purpose. No matter what method we use we can do it the same way: Find all tables first then in the result search for all the tables with certain attribute and value. For example:

Set tables = docroot.FindByInfo("TABLE",1,10000)
Set tablesinquestion = tables.FindByValue("TITLE","DataTable",1,10000)

This assumes that we have some tables marked with attribute TITLE="DataTable" and we like to find them and do something special with each such table. For instance we may want to fill the table with some data from a database. Another attribute may contain the query or other information which will tell the application which data to show in that table. See Adding elements below.

Note that all the FindXXXX methods have optional arguments after the criterion arguments. They are first found element to return, maximum elements to find and search depth. By default only the first found element is returned in the resulting collection (if any is found, of course). Thus when we expect to find more than one element we set sensible maximum limit or a very big limit to allow all the matching elements to be found. If the HTML is big and we want to optimize the search can be done in series of several elements by changing the first element and the max number of elements. The depth is by default unlimited, thus it deserves attention only if optimization is possible by specifying lower depth - for example we may know that the tables we search for are not deeper than 5 levels in the document tree, this will save iterations through the tree for the FindXXXX method as it will not look deeper than 5 levels at all.

The found elements are references to the elements in the tree, thus by changing some of their attributes or sub-elements we change the tree at the location where the element actually is.

The other - FindByName method should be used more carefully. In contrast to the other two it may return both nodes and attributes in the found collection. Thus if we perform Set elements = root.FindByName("TITLE",1,10000) we will receive all the TITLE attributes in the document and all the elements that have ID="TITLE". This is most often inconvenient unless used over a branch where we can be sure that the name searched is used by element ID-s and no attributes with that name exist. It is recommended to avoid this method except in scenarios where the HTML parser is used to provide some template functionality with well known and well-formed element ID-s.

Cloning/adding elements

One of the nicest features of the VarDictionary object is that it can be cloned. Thus in the primary usage of this object - HTML templates we can practice creating a complete pre-designed HTML. For example put in each table that will be filled with the parser one row with specified colors, styles etc. Then we can find the table, get the row, remove it from the table and then begin to add rows by cloning the row one time for each row we want to add. How this will look?

Assume we have marked the table with unique ID="ReportTable".

' We load thedoccontent from a file, db or somewhere else
Set doc = Parser.Parse(thedococntent)
' Parse it
' We will need often to add text inside elements of the tree.
' So it is convenient to create one text/plain node and then
' clone it each time we want to add text somewhere
Set textnode = doc.CreateNew
textnode.Info = "text/plain"
' The same node can be used in all the operations - so if we have more than
' the table we discussed we can use the same node for a template.
' Do something else ...
Set table = doc.FindByValue("ID","ReportTable")(1)
' For brevity we refer directly to the first found element
' It it is not there an error (object required) will occur which will
' be good-enough indication that the template is corrupted or wrong.
' As this will not happen after the development is finished we can save the
' more precise error checking.
Set samplerow = table("TemplateRow")
' We assume that the sample row has ID="TemplateRow" attribute
table.Remove("TemplateRow")
' We remove the sample row from the table - we do not want to show empty unused rows there
' No it is the time to generate rows from some data. For the sake of the example let's
' suppose we have SQLite database and we query it
Set data = db.Execute("SELECT * FROM SomeTable")
For r = 1 To data.Count
Set row = samplerow.Clone
' we create a clone of the sample row we got earlier
' now let's make our work even simpler - suppose we marked each <TD>
' with attribute FIELD="somefieldname" to indicate which field from the
' data set obtained from the database should be put in that cell.
For c = 1 To data(r).Count
    Set cell = row.FindByValue("FIELD",data(r).Key(c))(1)
    Set text = textnode.Clone
    text.Root = data(r)(c)
    cell.Add "", text
Next
Next

This code will work fine even if we have one or more heading rows in the table. We achieved that by setting a specific ID to the row that must be used as a template for all the rows we want to list in the table. This way it has a name in the table's node and can be removed directly by name, no need to search for it, assume anything that may change if the template design changes etc. Certainly there are other ways to do the same and sometimes there will be more efficient or convenient ways to deal with similar tasks.

We also helped ourselves by putting a FIELD attribute in the cell-s. We can do without it if we know the number of the cells and the data set and we know exactly in which cell what to put by index or over another criterion.

When pre-defined chunks of HTML are needed often it is useful to create a function that clones certain template element, and fills it with the specific data from the function arguments.

As you already have guessed the Clone method clones not only the node but also all its contents. Thus clone allows us to copy part of the tree and put it somewhere else. Removing a node from the tree will not remove it from the memory if we already saved it in a variable. This way we can get something we want to copy many times from the document, remove it from its original location and begin to clone and put it wherever we need it.

CreateNew - we used it above to create a plain text node. Why not create a VarDictionary directly (using Server.CreateObject ....)? CreateNew creates an empty VarDictionary with features like the one over which it is invoked. Thus by using the CreateNew over any node we are sure that it will have be configured with the same behavior as the node over which the method has been called. VarDictionary allows behavior configuration and the best way for us is to ensure that all the nodes in the tree have the same behavior. Thus using CreateNew we copy the behavior without the need to manually adjust it to match the other nodes. Furthermore this ensures that the object will be created directly and faster in the same COM apartment as the other node (this may be important in some applications).

Remarks

It is often a good idea to set the skipEmptyTexts property of the parser to True and thus strip the document tree from all the hollow text elements not relevant to the document structure. For instance one such element will appear between any two sibling nodes if the document is formatted with new lines after the tags and so on. This is a lot of garbage which can be even more than the half of the nodes in the tree (depending on the document formatting). However be aware that as a result the Construct method will output HTML with no new lines between the tags and it will not be much of the human readable kind.

What kind of documents can be parsed?

With appropriate configuration the parser can be used with most HTML and XML documents. Note that some of the properties may need some tuning if an error occurs. A good decision is to set knownTagsOnly to True in order to ignore anything the parser cannot understand.

The parser can work with documents that use the corresponding Windows code page encoding, but it is not guaranteed to work with UNICODE or UTF-8 HTML/XML documents for example. The newer Windows versions support UTF-8 code page, but you should know that the Parser itself does not perform its own handling and using it will make the application dependent on the Windows versions that have the feature.

The output can be translated to another code page, but this rarely makes sense except for templates that do not contain constant texts (translating the code page will not translate the text from one language to another - the result will be something unreadable in its place). When doing this do not forget to change or add the META tag that marks the document encoding.

What is it for and what is not for?

The parser can be used for random pages in known language - such as indexing them. Also by calling it twice - first to find only the META encoding tag and then to parse the whole document with appropriate code page you can cope with several different languages and encodings, but not all. Thus search engine like usage is somewhat limited, but is enough for wide variety of purposes - such as intranets, well known sites and so on.

Although some configuration is needed almost any correct HTML or XML can be parsed and re-generated back with some changes. Most often it is possible to parse the documents partially - by specifying only some tags or embeds that are actually of interest and treat the rest as plain text. Depending on the usage you can optimize the performance by configuring the parser to work only with the elements you actually need. Obviously for applications with extensive parsing the performance improvements can be drastic.

The object is named "Light" because it uses VarDictionary for the document tree thus avoiding the need to supply specialized objects for the document tree. This is a plus and a minus. Positives:- less objects to learn, no matter if the document is HTML, XML or even something else with some HTML/XML like inclusions. Negatives: The object model is nothing alike XML DOM or DHMTL - only the structure of the tree is the same, but you work with universal objects which do not have members named after the HTML or XML standard.

Auto-closing

The parser uses auto-closing of tags. Thus if a closing tag is found but it is not the closing tag corresponding to the last open tag the operation will not fail. Instead the last open tag will be automatically closed, then the closing tag will be compared to the open tag which contains the current and so on until match is found. Thus:

<TABLE>
<TR>
<TD>
<P> something
</TD>
<TD>
<P> something
</TD>
</TABLE>

will appear as intended in the document tree:

Table
TR
    TD
      P
    TD
      P

This is the behavior of the most browsers today and the parser's behavior with incorrect HTML is quite common and reflects the typical browser behavior.

HTML Encoding/Decoding

The parser will not decode or encode any texts from/to the document tree. Thus when you need to HTML decode or encode certain texts you must use a HTML Encoder object to perform Encode/Decode. One such object is usually enough for the entire application - you call it wherever you need it.