W4 XML Parser

 

 

 

(c) Carlos Viegas Damásio, July 2003

 

Description
This software package implements a non-validating parser entirely written in Prolog, almost conformant with the W3C recommendations (we deviate from XML 1.0 by failing in non-standalone documents making use of of non-declared entities). It completely supports XML Namespaces and the internal representation produced respects the guidelines of XML Info Sets. The parser treats XML Base and resolves relative references.

The main goal of this library is to provide a robust, portable, and fully conformant tool for the development of advanced Semantic Web applications. In can be one order of magnitude slower than other existing Prolog parsers, whcih are not conformant to W3C recommendations (for instance, do not recognize Unicode neither do treat XML Base attributes). It is in our long term objectives to provide low-level implementation of this tool. If efficiency is desired, we give some hints in order to obtain better behaviour.

Internal DTDs are parsed and a Prolog representation is also produced. Default attributes declared in the DTD are automatically placed in the XML term representation. Furthermore, internal entities are expanded as well as parameter entities. Normalization of attribute values and  whitespace elimination is also properly done.
The following encodings are currently recognized: US-ASCII, UTF-8, UTF-16, UTF-16LE, UTF-16BE, and ISO-8859-1.

The package was developed for XSB Prolog 2.5, but porting to other Prolog systems is foreseen.

 

Download and Installation

The W4 XML Parser Library distribution contains the following XSB Prolog source files:

  • The file containing the main XML parser predicates (xml.P)
  • The XML parser (xmlparser.H and xmlparser.P), automaticaly produced by our Lookup Definite Clause Grammar generator, with lookahead information.
  • The Document Object Model (xmldom.H and xmldom.P) with the predicates for constructing the Prolog representation of XML documents.
  • A translator to the compact representation of XML in Prolog (xml2termns.P), with Namespace support.
  • The I/O stream support predicates (iostream.P and utf.P).
  • Several utility predicates (utilities.P) and specific XSB Prolog builtins (builtins.P) which might be changed to port to other Prolog systems.
  • Support of IRIs (iri.P and iriparse.P).

You can also find some files with extension .G, whcih are the original source files to generate the XML Parser.

To start using the W4 XML Parser library, unpack the .zip file and compile the file xml.P within XSB Prolog.

 

Utilization

The main predicate is parse_xml_document( +Name, +DocURI, ?Encoding, -Document, -Timing ), where:

  • Name is a character code's list, an atom (converted to a character code's list), or a term of the form stream(StreamName). For the latter, the stream is opened, read and closed by the predicate.
  • DocURI is the character code's list with the Document Base URI (see RFC-2396 for details). If you don't require XML Base support, you can use simply []; the absolute file path is used instead.
  • Encoding can be used to indicate the encoding information of the document. If the argument is not given (a variable) then the parser tries to extract the encoding from the document, either by using Byte-Order-Marks or using the encoding information specified in the XML declaration. It can take the following values:
    'US-ASCII', 'ISO-8859-1, 'UTF-8', 'UTF-16', 'UTF-16LE', 'UTF-16BE'.
  • Document returns the Prolog representation, according to XML INFOSET, which is described below in the XML DOM Module Documentation.
  • Timing is a term of the form time(LoadTime,ParseTime) providing the reading and parsing timings (in msecs).

The following two simpler versions of the parse_xml_document are provided:

  • parse_xml_document( +Name, ?Encoding, -Document )
  • parse_xml_document( +Name, +DocURI, ?Encoding, -Document )

In some situations, it might be desirable to skip the reading phase. In this case the xml_document(  +UnicodeList, +BaseURI, ?Encoding, -Document) predicate can be used directly. Notice that the first argument is a Unicode List (i.e. a list of integers) terminated with -1. The other arguments are as before.

Example 1:

| ?- xml_document( [0'<,0'h,0'e,0'l,0'l,0'o,0'/,0'>,-1], [], T ).

T = document([element([],hello,[],[],[],[],[[] = [],xml = http://www.w3.org/XML/1998/namespace],[],[])],
              element([],hello,[],[],[],[],[[] = [],xml = http://www.w3.org/XML/1998/namespace],[],[]),
              [],[],[],1.0,UTF-8,[],[]
            );

If the documents are not well-formed then all the previous predicates fail.

Representation of XML documents

The Prolog representation of the XML documents follows closely the XML Information Set (XML INFOSET), whenever defined, with the exception of the parent and references properties. In this way, the creation of cyclic terms is avoided since these are difficult to handle and correctly use in most Prolog implementations.

We illustrate here the features of our parser with, resorting to a simple example, and refer the reader to an auxiliary document containing the full description of the Prolog representation adopted.Consider the following example XML document:

<?xml version="1.0"?>
<!-- A comment -->
<?log    this  file  ?>

<tag1 a='abc &lt;' xmlns="http://xpto.org" n1:b='1234' xmlns:n1="http://abc.com">
A very simple text
	<n1:tag2 xml:space="preserve">
		<!-- whitespace between markup should appear -->
		<tag3 xml:space="default">
		</tag3>
		<tag3/>
	</n1:tag2>
	<tag3 xml:lang="en" attrib1='This attribute has    spaces    and 
					     a line feed'>
		<tag4 xmlns="">
			<tag5>This tag shouldn't have a namespace</tag5>
		</tag4>
		<tag4>
			<!-- Whitespace shouldn't appear -->
		</tag4>
	</tag3>
</tag1>

The representation produced is the following (very complex...) term:

document(
  [ comment([32,65,32,99,111,109,109,101,110,116,32]),
    pi(log,[116,104,105,115,32,32,102,105,108,101,32,32],file:/example.xml),
    element(http://xpto.org,tag1,[],
      [ pcdata([10,65,32,118,101,114,121,32,115,105,109,112,108,101,32,116,101,120,116,10,9]),
        element(http://abc.com,tag2,n1,
          [ whitespace([10,9,9]),
            comment([32,119,104,105,116,101,115,112,97,99,101,32,98,101,116,119,101,101,110,32,
                     109,97,114,107,117,112,32,115,104,111,117,108,100,32,97,112,112,101,97,114,32]),
            whitespace([10,9,9]),
            element(http://xpto.org,tag3,[],
              [],
              [ename(http://www.w3.org/XML/1998/namespace,space) = attribute(http://www.w3.org/XML/1998/namespace,space,xml,[100,101,102,97,117,108,116],no,[])],
              [],
              [[] = http://xpto.org,n1 = http://abc.com,xml = http://www.w3.org/XML/1998/namespace],
              file:/example.xml,[]
            ),
            whitespace([10,9,9]),
            element(http://xpto.org,tag3,[],
              [],
              [],
              [],
              [[] = http://xpto.org,n1 = http://abc.com,xml = http://www.w3.org/XML/1998/namespace],
              file:/example.xml,
              []
            ),
            whitespace([10,9])
          ],
          [ename(http://www.w3.org/XML/1998/namespace,space) = attribute(http://www.w3.org/XML/1998/namespace,space,xml,[112,114,101,115,101,114,118,101],no,[])],
          [],
          [[] = http://xpto.org,n1 = http://abc.com,xml = http://www.w3.org/XML/1998/namespace],
          file:/example.xml,
          []
        ),
        element(http://xpto.org,tag3,[],
          [ element([],tag4,[],
              [ element([],tag5,[],
                  [ pcdata([84,104,105,115,32,116,97,103,32,115,104,111,117,108,100,110,39,116,32,104,97,118,101,32,97,32,110,97,109,101,115,112,97,99,101])
                  ],
                  [],
                  [],
                  [[] = [],n1 = http://abc.com,xml = http://www.w3.org/XML/1998/namespace],
                  file:/example.xml,
                  [101,110]
                )
              ],
              [],
              [ename(http://www.w3.org/2000/xmlns/,[]) = attribute(http://www.w3.org/2000/xmlns/,xmlns,[],[],no,[])],
              [[] = [],n1 = http://abc.com,xml = http://www.w3.org/XML/1998/namespace],
              file:/example.xml,
              [101,110]
            ),
            element(http://xpto.org,tag4,[],
              [ comment([32,87,104,105,116,101,115,112,97,99,101,32,115,104,111,117,108,100,110,39,116,32,97,112,112,101,97,114,32])
              ],
              [],
              [],
              [[] = http://xpto.org,n1 = http://abc.com,xml = http://www.w3.org/XML/1998/namespace],
              file:/example.xml,
              [101,110]
            )
          ],
          [ ename([],attrib1) = attribute([],attrib1,[],[84,104,105,115,32,97,116,116,114,105,98,117,116,101,32,104,97,115,32,32,32,32,115,112,97,99,101,115,
                                                         32,32,32,32,97,110,100,32,32,32,32,32,32,32,32,32,32,32,32,97,32,108,105,110,101,32,102,101,101,100],no,[]),
            ename(http://www.w3.org/XML/1998/namespace,lang) = attribute(http://www.w3.org/XML/1998/namespace,lang,xml,[101,110],no,[]) 
          ],
          [],
          [[] = http://xpto.org,n1 = http://abc.com,xml = http://www.w3.org/XML/1998/namespace],
          file:/example.xml,
          [101,110]
        )
      ],
      [ ename([],a) = attribute([],a,[],[97,98,99,32,60],no,[]),
        ename(http://abc.com,b) = attribute(http://abc.com,b,n1,[49,50,51,52],no,[])],
      [ ename(http://www.w3.org/2000/xmlns/,[]) = attribute(http://www.w3.org/2000/xmlns/,xmlns,[],[104,116,116,112,58,47,47,120,112,116,111,46,111,114,103],no,[]),
        ename(http://www.w3.org/2000/xmlns/,n1) = attribute(http://www.w3.org/2000/xmlns/,n1,xmlns,[104,116,116,112,58,47,47,97,98,99,46,99,111,109],no,[])],
      [[] = http://xpto.org,n1 = http://abc.com,xml = http://www.w3.org/XML/1998/namespace],
      file:/example.xml,
      []
    )
  ], 
  ... The representation of the document element again ...,
  [],
  [],
  file:/example.xml,
  1.0,
  UTF-8,
  yes,
  []
)

 

Current Limitations
  • The reading of documents is rather inefficient, due to the necessity of supporting several encodings. If you have a more efficient way of obtaining terminated Unicode Lists then do use it, and then resort to xml_document/4.
  • Does not expand External Parsed Entities
Future developments
  • Complete the documentation
  • Improve the generation of the tree representation

 

Disclaimer

THIS IS AN EXPERIMENTAL TOOL. I DO NOT GIVE ANY GUARANTEE.

 
 
Last update: July 30th, 2003