XPB4J User Guide

Version 0.8

Author: Pankaj Kumar
e-mail: pankaj_kumar@acm.org
Date: October 3 2001

Introduction

XML Processing Benchmark for Java (XPB4J) is a Java based performance measurement and comparison program for XML processing software. XML operations such as parsing, transformation, validation, encryption/decryption, custom access/manipulation or any combination of these applied on one or more XML files and/or byte streams is considered as XML processing.

Specific examples of such processing include:

validation of an XML file against a specified XML schema file;
creation/verification of digital signature as per XML Digital Signature standard;
validation of XML content as per a given set of business rules;
merging two or more XML files as per a specified set of rules using XSLT stylesheet or otherwise;
creating memory objects from XML content and vice-versa as per specified data binding rules;

XPB4J doesn't define any benchmark standard; it simply defines a framework to execute and measure performance characteristics of Processing Activities. It also includes code to do specific processing. If the same operation can be performed with different Processing Methods ( say, using different parsing APIs such as SAX, DOM, JDOM or Pull Parser API) then the performance charateristics of these can be measured and compared. One could also use different parsers and/or transformers and compare the results for the same processing method.

I wrote XPB4J primarily to

learn about different XML processing APIs;
enable myself and my fellow programmers to experiment with different ways of doing the same processing and understand the trade-offs and hence help us make better design and deployment choices;
track evolution of XML processing software with respect to their performance characteristics.

I have exercised XPB4J for a specific processing activity which I call XStat Processing. This processing essentially gathers certain statistical information from the input XML document. The different processing methods used are:

SAX -- linear scan using a JAXP compliant SAX parser;
DOM -- building W3C DOM structure using a JAXP compliant DOM parser;
JDOM -- building JDOM structure using JDOM software;
PULL -- linear scan using a Pull Parser;
XSLT -- using Java extension functions in an XSL stylesheet and using an XSLT transformer; and
COCOON -- using a Cocoon transformer in Cocoon Processing Framework.

You can find my observations and conclusions under section XStat Measurements. You could also run XPBJ4 on your machine with your favourite parser/transformer with your typical input and observe the results.

If your interest is in finding out performance and memory usage of your own custom processing, you can write your own classes using XPB4J Framework to invoke your processing and collect the relevant data.

Rest of the guide is organized under following sections:

Installing XPB4J
Running XPB4J
XStat Processing
XStat Measurements
Building XPB4J
XPB4J Framework
Known Limitations
Future Directions

Note: The directory path and execution script name in this document use the MS WINDOWS convention. Their UNIX equivalents can be derived simply by replace \ by / in path names and .bat by .sh in script names.

Note: This version of XPB4J contains script files for MS WINDOWS platform only.

Installing XPB4J

Download XPB4J distribution file xpb4j-0.8.zip from http://www.pankaj-k.net/xpb4j and unzip it in a suitable directory. This should create directory xpb4j-0.8, also referred to as the base directory, and place all the binaries, scripts, documents and sources at appropriate places.

The distribution includes following third party jar files in subdirectory xpb4j-0.8\lib:

crimson.jar -- Crimson1.1.2beta2. A JAXP compliant parser.
jdom.jar -- JDOM beta7.
PullParser2_0_2.jar -- Pull Parser 2.0.2.

Presence of these jar files will allow "out of box" execution of XPB4J for XStat Processing ( except for processing method XSLT and COCOON ). To try out XPB4J, go to directory xpb4j-0.8, ensure that environment variable JAVA_HOME set to JDK installation directory and issue the command:

      >run

This should report the performance and memory usage measurements on your machine. To use other parsers, Cocoon2 and/or an XSLT transformer or newer versions of supplied parsers, download them from their respective sites and place the corresponding .jar files in xpb4j-0.8\lib directory:

Get JDOM from JDOM site.
Get xerces parser from Apache Xerces site.
Get xalan processor from Apache xalan site.
Get pull parser from Pull Parser site.
Get Cocoon2 from Cocoon2 site.

You can also try XPB4J with other JAXP compliant parsers.

Running XPB4J

As illustrated in earlier section, running XPB4J with default arguments is very simple.

You can change the execution arguments by editing an XML file.

The execution arguments are specified in args.xml file ( located in the base directory ). You can specify following in this file:

Attribute loopCount of element Params -- Specifies number of times each processing be executed. The default value is 10.
Element Targets -- A list of targets ( string values ) to be passed to the processing code. The default list contains only one target: Data\rxgen.xml.
Element ProcessingActivities -- A list of ProcessingActivity elements. ProcessingActivity names are specified in configuration file conf.xml ( located in the base directory ).
Element ProcessingMethods -- A list of ProcessingMethod elements. ProcessingMethod names and corresponding Java classes are specified in configuration file conf.xml ( located in the base directory ).
Attribute enabled of element ProcessingMethod -- A true value indicates that the corresponding class be loaded and invoked.
Attribute gc of element Flags -- a true value forces garbage collection before initiating the processing but outside the measurement window.
Attribute gcMeasured of element Flags -- a true value forces garbage collection within the measurement window.

Here is a sample args.xml and conf.xml.

Note the relationship between args.xml and conf.xml file. File conf.xml contains all the available processing activities and methods. File args.xml contains information required for a specific execution.

You can add more processing activities and processing methods by simply writing classes as per XPB4J Framework and adding appropriate entries into the conf.xml file. Before execution, however, you should ensure that all the classes are accessible as per the current CLASSPATH. A successful execution of XPB4J writes the measurements in file pdata.xml and processing results in file results.xml, both located in the base directory.

Running XPB4J with Cocoon

To enable and run XPB4J for Cocoon processing, you must have Cocoon installed and you must build XPB4J with Cocoon.

Set environment variable C2_LIB to the directory having the cocoon*.jar and all other required jar files and then run the XPB4J's execution script run.batfrom the base directory.

Generating Input Files

XPB4J includes a simple Java program, RandXMLGen.java, to generate arbitrary sized random XML documents. This program generates XML elements and attributes by picking them from a given set randomly. The size of the generated file is determined by a numeric argument to the program specifying the number of children of the topmost element, RXGenTopElement. Look at the source file org\xperf\xpb\RandXMLGen.java ( included in the distribution ) to understand how the input file is generated. To run this program with argument value 100, go to the base directory and issue the command:

      >rxgen 100

This generates the file Data\rxgen.xml. Note that the Java program writes the output on standard output but the rxgen.bat script redirects it to Data\rxgen.xml.

XStat Processing

XStat processing consist of scanning one or more XML files and collecting following statistical information:

the number times the element occurred
the number of times it had a particular element as a parent
the number of times it had a particular element as a child
the number of times it had a particular attribute
the amount of character data that had at least some non-whitespace characters
whether the element was always empty

Acknowledgement: I have borrowed the idea behind this processing from the article Using The Perl XML::Parser Module.

XStat Processing Methods

XPB4J includes code to perform XStat processing using following Processing Methods:

SAX -- The XML input is accessed using a SAX API and relevant information is stored in a suitable datastructure. The SAX parser is accessed using JAXP API. Refer to sources under package org.xperf.xpb.xstat.sax for details.
DOM -- The XML input is converted into a W3C DOM object and is traversed to gather the relevant statistics. The DOM parser is accessed using JAXP API. Refer to sources under package org.xperf.xpb.xstat.dom for details.
PULL -- The XML input is accessed using Pull Parser API avaialable at Pull Parser site and relevant information is stored in a suitable datastructure, as with SAX processing method. Refer to sources under package org.xperf.xpb.xstat.pull for details.
JDOM -- The XML input is converted into JDOM document and is traversed to gather the relevant statistics. This is very similar to DOM processing method. Refer to sources under package org.xperf.xpb.xstat.jdom for details.
XSLT -- A stylesheet with Java extension functions is applied to the XML input using an XSL tranformer. The stylesheet fires appropriate Java function on encountering element nodes, attributes and text nodes. The transformer is obtained and invoked using JAXP API. Refer to sources under package org.xperf.xpb.xstat.xslt for details.
COCOON -- A Cocoon file generator is used to generate SAX events corresponding to input file and a Cocoon transformer gathers the relevant statistics. This transformer eats the SAX events and doesn't pass them to the serializer. Cocoon is invoked in commandline mode and not as a servlet. Refer to sources under package org.xperf.xpb.xstat.cocoon for details.

XStat Measurements

Here is a set of Measurements and Conclusions on XStat Processing.

Building XPB4J

To build XPB4J, carry out following steps:

If you do not have jakarta-ant, Install it now. You can get it from Apache Jakarta site.
If you do not have xalan*.jar, Install xalan. You can get it from Apache XML site.
Set environment variable ANT_HOME to point to Ant's installation directory.
Copy xalan*.jar to the xpb4j-0.8\lib directory.

Now run XPB4J's build script in the base directory:

      >.\build

Build and Execution scritps for Linux/UNIX are not present. It should be fairly simple to write these scripts. If you do, please share with me and I will include those.
XSL stylesheet for XSLT processing method for XStat uses XALAN processor specific extensions and may not be portable to other processors. It should be simple to write stylesheets for other processors. If you do, please share with me and I will include those.
It is currently not possible to compare measurements using two different parsers using the same API in one execution run. You can get around this by running XPB4J multiple time, changing CLASSPATH for each run so that the appropriate parser is used.
It is not possible to mix Cocoon processing with processing using other methods due to CLASSPATH conflicts.
Javadocs for XPB4J Framework do not exist. However, the framework itself is quite simple and if you really want to use it for your own processing code, you should be able to do so by looking at the code and following the XStat as a sample.
The Cocoon processing code for XStat doesn't work with Cocoon2 beta2.

Future Directions

I plan to evolve XPB4J in following areas

More processing operations: XML schema based validation, data binding, XML encryption/decryption, XML Digital Signature etc.
More options to invoke Cocoon based processing.
Generator of random XML files as per a specified schema and other desired characteristics.
Collating performance data from multiple execution runs for presentation.
GUI to specify input runs and other parameters.
Ability to use multiple parsers with same interface ( say xerces and crimson ) in the same execution cycle and compare the results.
Stylesheets to transform the result and performance files for better presentation.
Concurrent execution of operations to measure concurrency support on multi-cpu machines.
Support parsers/transformers written in languages other than Java.

------------------