W4 XML Parser |
|
|
|
(c) Carlos Viegas Damásio, July 2003 |
Description |
This software package implements a
non-validating parser entirely written in Prolog, almost conformant with
the W3C recommendations (we deviate from XML 1.0 by
failing in non-standalone documents making use of of
non-declared entities). It completely supports
XML Namespaces and the internal representation produced respects the
guidelines of XML Info Sets. The parser treats XML Base
and resolves relative references. The main goal of this library is to provide a robust, portable, and fully conformant tool for the development of advanced Semantic Web applications. In can be one order of magnitude slower than other existing Prolog parsers, whcih are not conformant to W3C recommendations (for instance, do not recognize Unicode neither do treat XML Base attributes). It is in our long term objectives to provide low-level implementation of this tool. If efficiency is desired, we give some hints in order to obtain better behaviour. Internal DTDs are parsed and a
Prolog representation is also produced. Default attributes declared in the
DTD are automatically placed
in the XML term representation. Furthermore, internal entities are
expanded as well as parameter entities. Normalization of attribute values
and whitespace elimination is also properly done. The package was developed for XSB Prolog 2.5, but porting to other Prolog systems is foreseen.
|
Download and
Installation The W4 XML Parser Library distribution contains the following XSB Prolog source files:
You can also find some files with extension .G, whcih are the original source files to generate the XML Parser. To start using the W4 XML Parser library, unpack the .zip file and compile the file xml.P within XSB Prolog.
|
Utilization The main predicate is parse_xml_document( +Name, +DocURI, ?Encoding, -Document, -Timing ), where:
The following two simpler versions of the parse_xml_document are provided:
In some situations, it might be desirable to skip the reading phase. In this case the xml_document( +UnicodeList, +BaseURI, ?Encoding, -Document) predicate can be used directly. Notice that the first argument is a Unicode List (i.e. a list of integers) terminated with -1. The other arguments are as before. Example 1: | ?- xml_document(
[0'<,0'h,0'e,0'l,0'l,0'o,0'/,0'>,-1], [], T ). If the documents are not well-formed then all the previous predicates fail. |
Representation of XML
documents The Prolog representation of the XML documents follows closely the XML Information Set (XML INFOSET), whenever defined, with the exception of the parent and references properties. In this way, the creation of cyclic terms is avoided since these are difficult to handle and correctly use in most Prolog implementations. We illustrate here the features of our parser with, resorting to a simple example, and refer the reader to an auxiliary document containing the full description of the Prolog representation adopted.Consider the following example XML document: <?xml version="1.0"?> <!-- A comment --> <?log this file ?> <tag1 a='abc <' xmlns="http://xpto.org" n1:b='1234' xmlns:n1="http://abc.com"> A very simple text <n1:tag2 xml:space="preserve"> <!-- whitespace between markup should appear --> <tag3 xml:space="default"> </tag3> <tag3/> </n1:tag2> <tag3 xml:lang="en" attrib1='This attribute has spaces and a line feed'> <tag4 xmlns=""> <tag5>This tag shouldn't have a namespace</tag5> </tag4> <tag4> <!-- Whitespace shouldn't appear --> </tag4> </tag3> </tag1> The representation produced is the following (very complex...) term: document( [ comment([32,65,32,99,111,109,109,101,110,116,32]), pi(log,[116,104,105,115,32,32,102,105,108,101,32,32],file:/example.xml), element(http://xpto.org,tag1,[], [ pcdata([10,65,32,118,101,114,121,32,115,105,109,112,108,101,32,116,101,120,116,10,9]), element(http://abc.com,tag2,n1, [ whitespace([10,9,9]), comment([32,119,104,105,116,101,115,112,97,99,101,32,98,101,116,119,101,101,110,32, 109,97,114,107,117,112,32,115,104,111,117,108,100,32,97,112,112,101,97,114,32]), whitespace([10,9,9]), element(http://xpto.org,tag3,[], [], [ename(http://www.w3.org/XML/1998/namespace,space) = attribute(http://www.w3.org/XML/1998/namespace,space,xml,[100,101,102,97,117,108,116],no,[])], [], [[] = http://xpto.org,n1 = http://abc.com,xml = http://www.w3.org/XML/1998/namespace], file:/example.xml,[] ), whitespace([10,9,9]), element(http://xpto.org,tag3,[], [], [], [], [[] = http://xpto.org,n1 = http://abc.com,xml = http://www.w3.org/XML/1998/namespace], file:/example.xml, [] ), whitespace([10,9]) ], [ename(http://www.w3.org/XML/1998/namespace,space) = attribute(http://www.w3.org/XML/1998/namespace,space,xml,[112,114,101,115,101,114,118,101],no,[])], [], [[] = http://xpto.org,n1 = http://abc.com,xml = http://www.w3.org/XML/1998/namespace], file:/example.xml, [] ), element(http://xpto.org,tag3,[], [ element([],tag4,[], [ element([],tag5,[], [ pcdata([84,104,105,115,32,116,97,103,32,115,104,111,117,108,100,110,39,116,32,104,97,118,101,32,97,32,110,97,109,101,115,112,97,99,101]) ], [], [], [[] = [],n1 = http://abc.com,xml = http://www.w3.org/XML/1998/namespace], file:/example.xml, [101,110] ) ], [], [ename(http://www.w3.org/2000/xmlns/,[]) = attribute(http://www.w3.org/2000/xmlns/,xmlns,[],[],no,[])], [[] = [],n1 = http://abc.com,xml = http://www.w3.org/XML/1998/namespace], file:/example.xml, [101,110] ), element(http://xpto.org,tag4,[], [ comment([32,87,104,105,116,101,115,112,97,99,101,32,115,104,111,117,108,100,110,39,116,32,97,112,112,101,97,114,32]) ], [], [], [[] = http://xpto.org,n1 = http://abc.com,xml = http://www.w3.org/XML/1998/namespace], file:/example.xml, [101,110] ) ], [ ename([],attrib1) = attribute([],attrib1,[],[84,104,105,115,32,97,116,116,114,105,98,117,116,101,32,104,97,115,32,32,32,32,115,112,97,99,101,115, 32,32,32,32,97,110,100,32,32,32,32,32,32,32,32,32,32,32,32,97,32,108,105,110,101,32,102,101,101,100],no,[]), ename(http://www.w3.org/XML/1998/namespace,lang) = attribute(http://www.w3.org/XML/1998/namespace,lang,xml,[101,110],no,[]) ], [], [[] = http://xpto.org,n1 = http://abc.com,xml = http://www.w3.org/XML/1998/namespace], file:/example.xml, [101,110] ) ], [ ename([],a) = attribute([],a,[],[97,98,99,32,60],no,[]), ename(http://abc.com,b) = attribute(http://abc.com,b,n1,[49,50,51,52],no,[])], [ ename(http://www.w3.org/2000/xmlns/,[]) = attribute(http://www.w3.org/2000/xmlns/,xmlns,[],[104,116,116,112,58,47,47,120,112,116,111,46,111,114,103],no,[]), ename(http://www.w3.org/2000/xmlns/,n1) = attribute(http://www.w3.org/2000/xmlns/,n1,xmlns,[104,116,116,112,58,47,47,97,98,99,46,99,111,109],no,[])], [[] = http://xpto.org,n1 = http://abc.com,xml = http://www.w3.org/XML/1998/namespace], file:/example.xml, [] ) ], ... The representation of the document element again ..., [], [], file:/example.xml, 1.0, UTF-8, yes, [] )
|
Current Limitations
|
Future developments
|
Disclaimer THIS IS AN EXPERIMENTAL TOOL. I DO NOT GIVE ANY GUARANTEE. |
Last update: July 30th, 2003 |