index.html
  1 <?xml version="1.0"?>
2 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
3 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
4 <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
5 <head>
6 <title>S-XML</title>
7 <link rel="stylesheet" type="text/css" href="style.css"/>
8 <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>
9 </head>
10
11 <body>
12 <div class="header">
13 <h1>S-XML</h1>
14 </div>
15
16 <p>
17 S-XML is a simple XML parser implemented in Common Lisp.
18 Originally it was written by <a href="http://homepage.mac.com/svc">Sven Van Caekenberghe</a>.
19 It is now being maintained by <a href="http://homepage.mac.com/svc">Sven Van Caekenberghe</a>,
20 <a href="http://constantly.at">Rudi Schlatte</a> and <a href="http://www.cs.indiana.edu/~bmastenb">Brian Mastenbrook</a>.
21 S-XML is used by <a href="http://common-lisp.net/project/s-xml-rpc">S-XML-RPC</a> and
22 <a href="http://common-lisp.net/project/cl-prevalence">CL-PREVALENCE</a>.
23 </p>
24
25 <p>
26 This XML parser implementation has the following features:
27 </p>
28 <ul>
29 <li>It works (handling many common XML usages).</li>
30 <li>It is very small (the core is about 400 lines of code, including comments and whitespace).</li>
31 <li>It has a core API that is simple, efficient and pure functional, much like that from <a href="http://pobox.com/~oleg/ftp/Scheme/xml.html">SSAX</a> (see also <a href="http://ssax.sourceforge.net">http://ssax.sourceforge.net</a>).</li>
32 <li>It supports different DOM models: an <a href="http://pobox.com/~oleg/ftp/Scheme/SXML.html">XSML</a>-based one, an <a href="http://opensource.franz.com/xmlutils/xmlutils-dist/pxml.htm">LXML</a>-based one and a classic xml-element struct based one.</li>
33 <li>It is reasonably time and space efficient (internally avoiding garbage generatation as much as possible).</li>
34 </ul>
35 <p>
36 This XML parser implementation has the following limitations:
37 </p>
38 <ul>
39 <li>It does not support CDATA.</li>
40 <li>Only supports simple character sets.</li>
41 <li>It does not support name spaces</li>
42 <li>It does not support any special tags (like processing instructions).</li>
43 <li>It is not validating, even skips DTD's all together.</li>
44 </ul>
45
46 <h3>Download</h3>
47 <p>
48 You can download the LLGPL source code and documentation as <a href="s-xml.tgz">s-xml.tgz</a>
49 (signature: <a href="s-xml.tgz.asc">s-xml.tgz.asc</a> for which the public key can be found
50 in the <a href="http://common-lisp.net/keyring.asc">common-lisp.net keyring</a>)
51 (build and/or install with ASDF).
52 </p>
53 <p>
54 You can view the <a href="http://common-lisp.net/cgi-bin/viewcvs.cgi/?cvsroot=s-xml">CVS Repository</a> or
55 get anonymous CVS access as follows:
56 <pre>$ cvs -d:pserver:anonymous@common-lisp.net:/project/s-xml/cvsroot login
57 (Logging in to anonymous@common-lisp.net)
58 CVS password: anonymous
59 $ cvs -d:pserver:anonymous@common-lisp.net:/project/s-xml/cvsroot co s-xml</pre>
60 </p>
61
62 <h3>API</h3>
63 <p>
64 The plain API exported by the package S-XML (automatically generated by LispDoc)
65 is available in <a href="S-XML.html">S-XML.html</a>.
66 </p>
67
68 <h3>XML Parser</h3>
69 <p>
70 Using a DOM parser is easier, but usually less efficient: see the next sections. To use the event-based API of the parser, you call the function start-parse-xml on a stream, specifying 3 hook functions:
71 </p>
72 <ul>
73 <li><b>new-element-hook</b> <tt>(name attributes seed) =&gt; seed</tt><br/>
74 Called when the parser enters a new element.
75 The name of the element (tag) and the attributes (an unordered dotted pair list of attribute names as keywords
76 and attribute values as strings) of the element are passed in,
77 as well as the seed from the previous element (either the last encountered sibling or the parent).
78 The hook must return a seed value to be passed to the first child element
79 or directly to finish-element-hook (when there are no children).</li>
80 <li><b>finish-element-hook</b> <tt>(name attributes parent-seed seed) =&gt; seed</tt><br/>
81 Called when the parser leaves an element.
82 The name of the element (tag) and the attributes (an unordered dotted pair list of attribute names as keywords
83 and attribute values as strings) of the element are passed in,
84 as well as the parent-seed, the seed passed to us when this element started,
85 i.e. passed to our corresponding new-element-hook,
86 as well as the seed from the previous element (either the last encountered sibling or the parent).
87 The hook must return the final seed value for this element
88 to be passed to the next sibling or to the parent (when there are no more children).</li>
89 <li><b>text-hook</b> <tt>(string seed) =&gt; seed</tt><br/>
90 Called when the parser finds text as contents.
91 The string of the text encountered is passed in, as well as the seed from the previous element
92 (either the last encountered sibling or the parent).
93 The hook must return the final seed value for this element
94 to be passed to the next sibling or to the parent (when there are no more children).</li>
95 </ul>
96 <p>
97 As an example, consider the following tracer that shows how the different hooks are called:
98 </p>
99 <pre>(defun trace-xml-new-element-hook (name attributes seed)
100 (let ((new-seed (cons (1+ (car seed)) (1+ (cdr seed)))))
101 (trace-xml-log (car seed)
102 "(new-element :name ~s :attributes ~:[()~;~:*~s~] :seed ~s) =&gt; ~s"
103 name attributes seed new-seed)
104 new-seed))
105
106 (defun trace-xml-finish-element-hook (name attributes parent-seed seed)
107 (let ((new-seed (cons (1- (car seed)) (1+ (cdr seed)))))
108 (trace-xml-log (car parent-seed)
109 "(finish-element :name ~s :attributes ~:[()~;~:*~s~] :parent-seed ~s :seed ~s) =&gt; ~s"
110 name attributes parent-seed seed new-seed)
111 new-seed))
112
113 (defun trace-xml-text-hook (string seed)
114 (let ((new-seed (cons (car seed) (1+ (cdr seed)))))
115 (trace-xml-log (car seed)
116 "(text :string ~s :seed ~s) =&gt; ~s"
117 string seed new-seed)
118 new-seed))
119
120 (defun trace-xml (in)
121 "Parse and trace a toplevel XML element from stream in"
122 (start-parse-xml in
123 (make-instance 'xml-parser-state
124 :seed (cons 0 0)
125 ;; seed car is xml element nesting level
126 ;; seed cdr is ever increasing from element to element
127 :new-element-hook #'trace-xml-new-element-hook
128 :finish-element-hook #'trace-xml-finish-element-hook
129 :text-hook #'trace-xml-text-hook)))</pre>
130 <p>
131 This is the output of the tracer on two small XML documents, the seed is a CONS that keeps track of the nesting level in its CAR and of its flow through the hooks with an ever increasing number is its CDR:
132 </p>
133 <pre>S-XML 31 &gt; (with-input-from-string (in "&lt;FOO X='10' Y='20'&gt;&lt;P&gt;Text&lt;/P&gt;&lt;BAR/&gt;&lt;H1&gt;&lt;H2&gt;&lt;/H2&gt;&lt;/H1&gt;&lt;/FOO&gt;") (trace-xml in))
134 (new-element :name :FOO :attributes ((:Y . "20") (:X . "10")) :seed (0 . 0)) =&gt; (1 . 1)
135 (new-element :name :P :attributes () :seed (1 . 1)) =&gt; (2 . 2)
136 (text :string "Text" :seed (2 . 2)) =&gt; (2 . 3)
137 (finish-element :name :P :attributes () :parent-seed (1 . 1) :seed (2 . 3)) =&gt; (1 . 4)
138 (new-element :name :BAR :attributes () :seed (1 . 4)) =&gt; (2 . 5)
139 (finish-element :name :BAR :attributes () :parent-seed (1 . 4) :seed (2 . 5)) =&gt; (1 . 6)
140 (new-element :name :H1 :attributes () :seed (1 . 6)) =&gt; (2 . 7)
141 (new-element :name :H2 :attributes () :seed (2 . 7)) =&gt; (3 . 8)
142 (finish-element :name :H2 :attributes () :parent-seed (2 . 7) :seed (3 . 8)) =&gt; (2 . 9)
143 (finish-element :name :H1 :attributes () :parent-seed (1 . 6) :seed (2 . 9)) =&gt; (1 . 10)
144 (finish-element :name :FOO :attributes ((:Y . "20") (:X . "10")) :parent-seed (0 . 0) :seed (1 . 10)) =&gt; (0 . 11)
145 (0 . 11)
146
147 S-XML 32 &gt; (with-input-from-string (in "&lt;FOO&gt;&lt;UL&gt;&lt;LI&gt;1&lt;/LI&gt;&lt;LI&gt;2&lt;/LI&gt;&lt;LI&gt;3&lt;/LI&gt;&lt;/UL&gt;&lt;/FOO&gt;") (trace-xml in))
148 (new-element :name :FOO :attributes () :seed (0 . 0)) =&gt; (1 . 1)
149 (new-element :name :UL :attributes () :seed (1 . 1)) =&gt; (2 . 2)
150 (new-element :name :LI :attributes () :seed (2 . 2)) =&gt; (3 . 3)
151 (text :string "1" :seed (3 . 3)) =&gt; (3 . 4)
152 (finish-element :name :LI :attributes () :parent-seed (2 . 2) :seed (3 . 4)) =&gt; (2 . 5)
153 (new-element :name :LI :attributes () :seed (2 . 5)) =&gt; (3 . 6)
154 (text :string "2" :seed (3 . 6)) =&gt; (3 . 7)
155 (finish-element :name :LI :attributes () :parent-seed (2 . 5) :seed (3 . 7)) =&gt; (2 . 8)
156 (new-element :name :LI :attributes () :seed (2 . 8)) =&gt; (3 . 9)
157 (text :string "3" :seed (3 . 9)) =&gt; (3 . 10)
158 (finish-element :name :LI :attributes () :parent-seed (2 . 8) :seed (3 . 10)) =&gt; (2 . 11)
159 (finish-element :name :UL :attributes () :parent-seed (1 . 1) :seed (2 . 11)) =&gt; (1 . 12)
160 (finish-element :name :FOO :attributes () :parent-seed (0 . 0) :seed (1 . 12)) =&gt; (0 . 13)
161 (0 . 13)</pre>
162 <p>
163 The following example counts tags, attributes and characters:
164 </p>
165 <pre>(defclass count-xml-seed ()
166 ((elements :initform 0)
167 (attributes :initform 0)
168 (characters :initform 0)))
169
170 (defun count-xml-new-element-hook (name attributes seed)
171 (declare (ignore name))
172 (incf (slot-value seed 'elements))
173 (incf (slot-value seed 'attributes) (length attributes))
174 seed)
175
176 (defun count-xml-text-hook (string seed)
177 (incf (slot-value seed 'characters) (length string))
178 seed)
179
180 (defun count-xml (in)
181 "Parse a toplevel XML element from stream in, counting elements, attributes and characters"
182 (start-parse-xml in
183 (make-instance 'xml-parser-state
184 :seed (make-instance 'count-xml-seed)
185 :new-element-hook #'count-xml-new-element-hook
186 :text-hook #'count-xml-text-hook)))
187
188 (defun count-xml-file (pathname)
189 "Parse XMl from the file at pathname, counting elements, attributes and characters"
190 (with-open-file (in pathname)
191 (let ((result (count-xml in)))
192 (with-slots (elements attributes characters) result
193 (format t
194 "~a contains ~d XML elements, ~d attributes and ~d characters.~%"
195 pathname elements attributes characters)))))</pre>
196 <p>
197 This example removes XML markup:
198 </p>
199 <pre>(defun remove-xml-markup (in)
200 (let* ((state (make-instance 'xml-parser-state
201 :text-hook #'(lambda (string seed) (cons string seed))))
202 (result (start-parse-xml in state)))
203 (apply #'concatenate 'string (nreverse result))))</pre>
204 <p>
205 The next example is from the xml-element struct DOM implementation, where the SSAX parser hook functions are building the actual DOM:
206 </p>
207 <pre>(defun standard-new-element-hook (name attributes seed)
208 (declare (ignore name attributes seed))
209 '())
210
211 (defun standard-finish-element-hook (name attributes parent-seed seed)
212 (let ((xml-element (make-xml-element :name name
213 :attributes attributes
214 :children (nreverse seed))))
215 (cons xml-element parent-seed)))
216
217 (defun standard-text-hook (string seed)
218 (cons string seed))
219
220 (defmethod parse-xml-dom (stream (output-type (eql :xml-struct)))
221 (car (start-parse-xml stream
222 (make-instance 'xml-parser-state
223 :new-element-hook #'standard-new-element-hook
224 :finish-element-hook #'standard-finish-element-hook
225 :text-hook #'standard-text-hook))))
226 </pre>
227 <p>
228 The parse state can be used to specify the initial seed value (nil by default), and the set of known entities (the 5 standard entities (lt, gt, amp, qout, apos) and nbps by default).
229 </p>
230 <h3>DOM</h3>
231 <p>
232 Using a DOM parser is easier, but usually less efficient. Currently three different DOM's are supported:
233 </p>
234 <ul>
235 <li>The DOM type <tt>:sxml</tt> is an <a href="http://pobox.com/~oleg/ftp/Scheme/SXML.html">XSML</a>-based one</li>
236 <li>The DOM type <tt>:lxml</tt> is an <a href="http://opensource.franz.com/xmlutils/xmlutils-dist/pxml.htm">LXML</a>-based one</li>
237 <li>The DOM type <tt>:xml-struct</tt> is a classic xml-element struct based one</li>
238 </ul>
239 <p>
240 There is a generic API that is identical for each type of DOM, with an extra parameter <tt>input-type</tt> or <tt>output-type</tt> used to specify the type of DOM. The default DOM type is <tt>:lxml</tt>. Here are some examples:
241 </p>
242 <pre>? (in-package :s-xml)
243 #&lt;Package "S-XML"&gt;
244
245 ? (setf xml-string "&lt;foo id='top'&gt;&lt;bar&gt;text&lt;/bar&gt;&lt/foo&gt;")
246 "&lt;foo id='top'&gt;&lt;bar&gt;text&lt;/bar&gt;&lt;/foo&gt;"
247
248 ? (parse-xml-string xml-string)
249 ((:|foo| :|id| "top") (:|bar| "text"))
250
251 ? (parse-xml-string xml-string :output-type :sxml)
252 (:|foo| (:@ (:|id| "top")) (:|bar| "text"))
253
254 ? (parse-xml-string xml-string :output-type :xml-struct)
255 #S(XML-ELEMENT :NAME :|foo| :ATTRIBUTES ((:|id| . "top"))
256 :CHILDREN (#S(XML-ELEMENT :NAME :|bar|
257 :ATTRIBUTES NIL
258 :CHILDREN ("text"))))
259
260 ? (print-xml * :pretty t :input-type :xml-struct)
261 &lt;foo id="top"&gt;
262 &lt;bar&gt;text&lt;/bar&gt;
263 &lt;/foo&gt;
264 NIL
265
266 ? (print-xml '(p "Interesting stuff at " ((a href "http://slashdot.org") "SlashDot")))
267 &lt;P&gt;Interesting stuff at &lt;A HREF="http://slashdot.org"&gt;SlashDot&lt/A&gt;&lt/P&gt;
268 NIL</pre>
269 <p>
270 Tag and attribute names are converted to keywords. Note that XML is case-sensitive, hence the fact that Common Lisp has to resort to the special literal symbol syntax.
271 </p>
272
273 <h3>Release History and ChangeLog</h3>
274
275 <pre>
276 2005-02-03 Sven Van Caekenberghe &lt;svc@mac.com&gt;
277
278 * release 5 (cvs tag RELEASE_5)
279 * added :start and :end keywords to print-string-xml
280 * fixed a bug: in a tag containing whitespace, like &lt;foo&gt; &lt;/foo&gt; the parser collapsed
281 and ingnored all whitespace and considered the tag to be empty!
282 this is now fixed and a unit test has been added
283 * cleaned up xml character escaping a bit: single quotes and all normal whitespace
284 (newline, return and tab) is preserved a unit test for this has been added
285 * IE doesn't understand the &apos; XML entity, so I've commented that out for now.
286 Also, using actual newlines for newlines is probably better than using #xA,
287 which won't get any end of line conversion by the server or user agent.
288
289 June 2004 Sven Van Caekenberghe &lt;svc@mac.com&gt;
290
291 * release 4
292 * project moved to common-lisp.net, renamed to s-xml,
293 * added examples counter, tracer and remove-markup, improved documentation
294
295 13 Jan 2004 Sven Van Caekenberghe &lt;svc@mac.com&gt;
296
297 * release 3
298 * added ASDF systems
299 * optimized print-string-xml
300
301 10 Jun 2003 Sven Van Caekenberghe &lt;svc@mac.com&gt;
302
303 * release 2
304 * added echo-xml function: we are no longer taking the car when
305 the last seed is returned from start-parse-xml
306
307 25 May 2003 Sven Van Caekenberghe &lt;svc@mac.com&gt;
308
309 * release 1
310 * first public release of working code
311 * tested on OpenMCL
312 * rewritten to be event-based, to improve efficiency and
313 to optionally use different DOM representations
314 * more documentation
315
316 end of 2002 Sven Van Caekenberghe &lt;svc@mac.com&gt;
317
318 * release 0
319 * as part of an XML-RPC implementation
320 </pre>
321
322 <h3>Todo</h3>
323
324 <ul>
325 <li>Some should find some time to add CDATA support (both in the parser and as a print function) - this shouldn't be too hard, and it would be really useful!</li>
326 </ul>
327
328 <h3>Mailing Lists</h3>
329
330 <ul>
331 <li><a href="http://common-lisp.net/mailman/listinfo/s-xml-cvs">S-XML-CVS mailing list info</a></li>
332 <li><a href="http://common-lisp.net/mailman/listinfo/s-xml-devel">S-XML-DEVEL mailing list info</a></li>
333 <li><a href="http://common-lisp.net/mailman/listinfo/s-xml-announce">S-XML-ANNOUNCE mailing list info</a></li>
334 </ul>
335
336 <p>CVS version $Id: index.html,v 1.10 2005/02/03 08:36:05 scaekenberghe Exp $</p>
337
338 <div class="footer">
339 <p>Back to <a href="http://common-lisp.net/">Common-lisp.net</a>.</p>
340 </div>
341
342 <div class="check">
343 <a href="http://validator.w3.org/check/referer">Valid XHTML 1.0 Strict</a>
344 <a href="http://jigsaw.w3.org/css-validator/check/referer">Valid CSS</a>
345 </div>
346 </body>
347 </html>