Bristle Software XML Tips

Details of Tips:

What is XML?

Original Version: 5/11/1999
Last Updated: 2/12/2013
Applies to: XML 1.0+

XML (eXtensible Markup Language) is syntactically similar to HTML (HyperText Markup Language). They both consist basically of regular text marked up with tags and attributes. However, the purposes of XML and HTML are very different. The purpose of HTML is to describe the layout and physical appearance of the embedded text, to be displayed as a Web page. The purpose of XML is to describe the structure and semantics of the embedded text, to be manipulated programmatically as data. With HTML, you use the one set of predefined tags with their predefined display-oriented meanings. With XML, you choose a set of tags that are in common use in your application area, or invent your own set.

Prior to XML, you could pass data between programs in proprietary formats, or as databases, or as CSV (comma-separated values) files. The producer and consumer of the data generally needed a common understanding of how the data was packaged so that the consumer could correctly interpret what the producer generated. To send an address, there had to be agreement about how to recognize the street, city, state, etc. as parts of the address. For example, it might be agreed that the 3rd field in a CSV file was the state, as:

	"123 Main St", "Malvern", "PA", "19355"

With XML, the data stream is self describing:

	<address>
		<street>123 Main St</street>
		<city>Malvern</city>
		<state>PA</state>
		<zip>19355</zip>
	</address>

This structured approach lends itself it reusable parsing tools, automatic data transformations, debugging tools, and other reuse opportunities. No longer is the consumer required to write a parser to interpret the incoming data stream. It simply uses an existing XML parser, asking it to find the value of "state", for example.

The tags and attributes can be formally defined via a DTD (Document Type Definition) or a more modern XML Schema. These allow you to specify the acceptable nesting of tags (street must be inside address), which tags and attributes are optional, default values, data types, etc. DTDs are not new to XML. There is a DTD that defines the HTML syntax also. (In fact, in loose terms, HTML is an instance of XML with a specific set of tags and attributes.) XML Schemas are a more powerful replacement for DTDs, with several advantages that will be discussed in future tips.

Don't underestimate the value of XML. It is the single most important development in the computer field since ASCII text. In the long run, it will have more impact than even HTML. HTML is a less than perfect, but very widely adopted standard for marking up documents for display. It's popularity made the World Wide Web possible. XML is a less than perfect, but very widely adopted standard for the exchange of data. As with HTML, the value of the standard is that it is good enough for now, and already wildly popular. Henceforth, we will always have a standard. The current version of XML may be replaced soon and often, with successively better standards, but there will never again be a day where there is no standard for exchange of data between different types of computer systems.

11/18/2011 Update: JSON is the new XML.

2/12/2013 Update: Companies are dropping support for XML in favor of JSON:

https://dev.twitter.com/docs/api/1.1/overview#JSON_support_only

--Fred

Differences from HTML syntax

Last Updated: 2/25/2001
Applies to: XML 1.0+

Here are some ways in which XML is syntactically different from HTML. Also noted are suggestions for how to write your HTML (even with older browsers) to be more like XML so that it is likely to comply with the emerging XHTML standard.

End tags required.

HTML allows   without . XML always requires an end tag, though it can be included in the begin tag using the <xxx /> format.

HTML Suggestion: Use the <xxx /> format for tags that don't require an end tag. Be sure to put a space before the slash. Otherwise, old browsers may ignore the tag entirely.

Properly nested tags.

HTML allows tags to be improperly nested, as:
<center>...</center>.
XML requires proper nesting. Each tag must be ended before any enclosing tag is ended, as:
<center>...</center>.

HTML Suggestion: Always nest tags properly.

Case sensitive.

XML is case sensitive. The tag <address> is not the same as the tag <ADDRESS>. The standard for XML is to use all lowercase letters for tags and attributes.

HTML Suggestion: Always use lowercase.

Quotes required around attribute values.

HTML allows:
 <tag attribute=value>
XML always requires quotes (single or double), as:
 <tag attribute="value">
or:
 <tag attribute='value'>

HTML Suggestion: Always use quotes, even around numbers and percent values.

Attribute values required.

HTML allows:
 <tag attribute>.
XML always requires each attribute to have a value, as:
 <tag attribute="value">
or:
 <tag attribute='value'>)

HTML Suggestion: Provide a dummy value when no value is required.
Example: <option selected='dummyvalue'>
instead of: <option selected>

Whitespace-sensitive.

HTML parsers (browsers) ignore almost all whitespace. This includes whitespace between tags, as:
```
	<hr>	
```
as well as whitespace inside of tags, as:
```
	
```
XML parsers may not tolerate such whitespace, especially linebreaks.

HTML Suggestion: This is a theoretical problem that has never been a problem for me (or anyone I know) in actual practice. All XML parsers I've used seem relatively whitespace insensitive. Don't worry about it for now. Using whitespace for indentation and line breaks to format your HTML more readably is far too valuable to give up without a good reason.

Predefined entities.

HTML has hundreds of predefined "entities", including:
```
	&nbsp;	non-breaking space
	&copy;	copyright symbol
	&amp;	ampersand (&)
	&lt;	less than (<)
	etc...
```
XML has only five predefined entitites:
```
	&amp;	ampersand (&)
	&quot;	double quote (")
	&apos;	apostrophe (')
	&lt;	less than (<)
	&gt;	greater than (>)
```
These are exactly the five that you need because they are a fundamental part of the XML syntax. Since XML is the "eXtensible Markup Language", you can define as many additional entities as you want in your DTD or XML Schema. If you want to define some of the standard HTML entities, you can simply copy them into your own DTD from the DTD that defines HTML:
http://www.w3.org/TR/html401/sgml/entities.html

If you don't want to bother creating a DTD or XML Schema, there is an easier way. In both XML and HTML, you can use the following to insert any special character via its numeric code:
```
	&#nn;	where nn is a decimal number
	&#xnn;	where nn is a hexidecimal number)
```
For example, in both HTML and XML, you can use   instead of   and © instead of © For a complete list of predefined HTML entities and their numeric codes, see:
http://hotwired.lycos.com/webmonkey/reference/special_characters/

Thanks to Howard Kapustein for reminding me about this difference.

--Fred

Tags vs Attributes

Last Updated: 7/5/2000
Applies to: XML 1.0+

Here are some things to keep in mind when choosing between a tag and an attribute to store a piece of data in XML:

You can usually get away with either. For example:

	<address city="Tucson" state="AZ"></address>

or:

	<address>
		<city>Tucson</city>
		<state>AZ</state>
	</address>

Attributes are the leaves of the tree. You cannot nest an attribute or a tag inside of an attribute. Only use an attribute when you are positive that the data is atomic and will never need to be further described.

Attributes don't allow multiple values. For example, you can write:
```
	<parent>
		<child>Billy</child>
		<child>Mary</child>
	</parent>
```
but not:
```
	<parent child="Billy" child="Mary"></parent>
```
and you don't want to get stuck having to parse the multiple values out of a single attribute, as:
```
	<parent child="Billy,Mary"></parent>
```

--Fred

Escaping special chars

Last Updated: 7/5/2000
Applies to: XML 1.0+

You can use CDATA in an XML document to escape all special characters in a block of text, rather than using entities for each special character, as:

	<![CDATA[ this text escaped ]]>

--Fred

XML Parsers

Microsoft MSXML

Last Updated: 2/2/2001
Applies to: XML 1.0+

Microsoft includes an XML parser with IE 5.0+. You can also download it from Microsoft at:

http://msdn.microsoft.com/xml/

It is an ActiveX component, so you can use it from VB, ASP, IE, etc.

Keep in mind however that some aspects of XML (especially XSL) are evolving rapidly, so newer versions are not always compatible with older versions. For example, MSXML version 2.5 supports an older XSL syntax for sorting, using the order-by attribute of elements like for-each and apply-templates, while the newer MSXML 3.0 supports the newer XSLT syntax, using the sort element.

--Fred
1. Using MSXML from VB
  
  Last Updated: 2/2/2001
  Applies to: XML 1.0+, MSXML 2.0+, VB5+
  
  To use MSXML from a VB application, add a reference to your VB project via the Project | References... menu. The one I use in this sample is: Microsoft XML, version 2.0
  
  You can then write code like:
```
    Dim xmlDOM As msxml.DOMDocument
    Set xmlDOM = New msxml.DOMDocument

    ' Load XML data from a URL into the XML DOM (Document 
    ' Object Model).
    xmlDOM.async = False
    xmlDOM.Load("some URL that returns an XML stream")

    ' Iterate over the XML DOM tree to get the list of items,
    ' loading them into a VB Combo Box.    
    cboItems.Clear
    Dim xmlNode As msxml.IXMLDOMNode
    For Each xmlNode In xmlDOM.childNodes(1).childNodes
        cboItems.AddItem xmlNode.Text
    Next
```
  You manipulate the objects in the XML DOM like any other objects in VB. The names of the objects, methods, properties, parameters, etc., pop up automatically as you type, via VB's "Intellisense", just as they do for other VB objects. To learn more about the objects in the DOM, hit F2 to view them in the VB Object Browser, or read the docs at the Microsoft MSDN Web site, currently:
  
  http://msdn.microsoft.com/library/psdk/xmlsdk/xml_9yg5.htm
  
  --Fred
2. Using MSXML from IE via VBScript
  
  Last Updated: 2/2/2001
  Applies to: XML 1.0+, MSXML 2.0+
  
  You can use MSXML from VBScript code running in IE5. The code looks like:
```
    Dim xmlDOM
    Set xmlDOM = CreateObject("msxml.DOMDocument")

    ' Load XML data from a URL into the XML DOM (Document 
    ' Object Model).
    xmlDOM.async = False
    xmlDOM.Load("some URL that returns an XML stream")

    ' Iterate over the XML DOM tree to get the list of items,
    ' loading them into an HTML SELECT control via DHTML.
    document.all.selItems.length = 0
    Dim xmlNode
    For Each xmlNode In xmlDOM.childNodes(1).childNodes
        Dim optNew
        Set optNew = document.createElement("OPTION")
        optNew.text = xmlNode.childNodes(0).text
        optNew.value = optNew.text
        document.all.selItems.options.add(optNew)
    Next
```
  Note that you are manipulating 2 document object models here. The XML DOM manipulations used to get the data are shown in bold, and the DHTML DOM manipulations used to insert the data into the HTML SELECT control in the Web page are underlined. To learn more about the objects in the DOMs, use the VB Object Browser, or read the docs at the Microsoft MSDN Web site, currently:
  
  XML DOM: http://msdn.microsoft.com/library/psdk/xmlsdk/xml_9yg5.htm
  DHTML DOM: http://msdn.microsoft.com/workshop/author/dhtml/reference/dhtmlrefs.asp
  
  --Fred
3. Using MSXML from IE via JavaScript
  
  Last Updated: 2/2/2001
  Applies to: XML 1.0+, MSXML 2.0+
  
  You can use MSXML from JavaScript code running in IE5. The code looks like:
```
    var xmlDOM = new ActiveXObject("msxml.DOMDocument");

    // Load XML data from a URL into the XML DOM (Document 
    // Object Model).
    xmlDOM.async = false;
    xmlDOM.load("some URL that returns an XML stream");

    // Iterate over the XML DOM tree to get the list of items,
    // loading them into an HTML SELECT control via DHTML.
    document.all.selItems.length = 0;
    var xmlNodes = xmlDOM.childNodes[1].childNodes;
    for (var xmlNode = xmlNodes.nextNode();
         xmlNode;
         xmlNode = xmlNodes.nextNode())
    {
        var optNew = document.createElement("OPTION");
        optNew.text = xmlNode.childNodes[0].text;
        optNew.value = optNew.text;
        document.all.selItems.options.add(optNew);
    }
```
  Note that you are manipulating 2 document object models here. The XML DOM manipulations used to get the data are shown in bold, and the DHTML DOM manipulations used to insert the data into the HTML SELECT control in the Web page are underlined. To learn more about the objects in the DOMs, use the VB Object Browser, or read the docs at the Microsoft MSDN Web site, currently:
  
  XML DOM: http://msdn.microsoft.com/library/psdk/xmlsdk/xml_9yg5.htm
  DHTML DOM: http://msdn.microsoft.com/workshop/author/dhtml/reference/dhtmlrefs.asp
  
  --Fred

DTD Tips

DTD Quick Reference

Last Updated: 7/5/2000
Applies to: XML 1.0+
1. Attribute Type:
 
 CDATA Character data, no markup
 
 ID Unique ID value
 
 IDREF Reference to ID or another element
 
 ENTITY, ENTITIES Name(s) of external entity
 
 NMTOKEN, NMTOKENS Only chars valid in a name (letters, digits, periods, dashes, underscores, colons)
 
 NOTATION Name of a notation
 
 (this|that) Alternation of literal values
 
 NOTATION(this|that) Alternation of notation names
2. Attribute Default:
 
 #REQUIRED
 
 #IMPLIED
 
 #FIXED value
 
 default_value
3. Element Structure Symbols:
 
 | Alternation (one or the other)
 
 , Sequence (one, then the other)
 
 ? Zero or one
 
 <no symbol> Exactly one
 
 * Zero or more
 
 + One or more
 
 () Grouping
 
 ANY Anything goes
 
 EMPTY No contents allowed
 
 #PCDATA Parsed character data (can contain markup)
 
 #CDATA Unparsed character data (no markup)
4. !ENTITY - Like a #DEFINE macro in C/C++.
 
 Predefined General Entity
 
 & Ampersand (&)
 
 < Less than (<)
 
 > Greater than (>)
 
 ' Apostrophe (')
 
 &quote Quote (")
 
 General Entity
 - Defined in DTD; used in document
 - Example: <!ENTITY amp "&">
 - Example: <!ENTITY ProductName "Super Duper Editor">
 
 Parameter Entity
 - Defined in DTD; used in DTD
 - Example: <!ENTITY % Description "This is the description.">
 
 Can be an External Entity -- Evaluates to an URL string, the contents of which are substituted like a #INCLUDE of C/C++.
--Fred

DTD Conditional Compilation

Last Updated: 7/5/2000
Applies to: XML 1.0+

You can enable and disable sections of your DTD (similar to #IF or #IFDEF in C/C++), as:
```
	<!ENTITY % part1 "IGNORE">
	<!ENTITY % part2 "INCLUDE">
	<![%part1;[ ... ]]>
	<![%part2;[ ... ]]>
```
The first 2 lines define the entities part1 and part2 to each have the value IGNORE or INCLUDE. The second 2 lines are each expanded to look like one of the following:
```
	<![%IGNORE[ ... ]]>
	<![%INCLUDE[ ... ]]>
```
so that the enclosed DTD statements are enabled or disabled.

--Fred

Bristle Software XML Tips

Table of Contents:

Details of Tips:

What is XML?

Differences from HTML syntax

Tags vs Attributes

Escaping special chars

XML Parsers

Microsoft MSXML

Using MSXML from VB

Using MSXML from IE via VBScript

Using MSXML from IE via JavaScript

DTD Tips

DTD Quick Reference

DTD Conditional Compilation

See Also

CDATA	Character data, no markup
ID	Unique ID value
IDREF	Reference to ID or another element
ENTITY, ENTITIES	Name(s) of external entity
NMTOKEN, NMTOKENS	Only chars valid in a name (letters, digits, periods, dashes, underscores, colons)
NOTATION	Name of a notation
(this\|that)	Alternation of literal values
NOTATION(this\|that)	Alternation of notation names

\|	Alternation (one or the other)
,	Sequence (one, then the other)
?	Zero or one
<no symbol>	Exactly one
*	Zero or more
+	One or more
()	Grouping
ANY	Anything goes
EMPTY	No contents allowed
#PCDATA	Parsed character data (can contain markup)
#CDATA	Unparsed character data (no markup)

&	Ampersand (&)
<	Less than (<)
>	Greater than (>)
'	Apostrophe (')
&quote	Quote (")

#REQUIRED
#IMPLIED
#FIXED value
default_value