VTD-XML
   HOME

TheInfoList



OR:

Virtual Token Descriptor for eXtensible Markup Language (VTD-XML) refers to a collection of cross-platform
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. T ...
processing technologies centered on a non-extractive
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. T ...
, "document-centric" parsing technique called Virtual Token Descriptor (VTD). Depending on the perspective, VTD-XML can be viewed as one of the following: *A "
Document-Centric Document and file collaboration are the tools or systems set up to help multiple people work together on a single document or file to achieve a single final version. Normally, this is software that allows teams to work on a single document, such as ...
" XML
parser Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term ''parsing'' comes from Lati ...
*A native XML indexer or a
file format A file format is a standard way that information is encoded for storage in a computer file. It specifies how bits are used to encode information in a digital storage medium. File formats may be either proprietary or free. Some file formats ...
that uses binary data to enhance the text XML *An incremental XML content modifier *An XML slicer/splitter/assembler *An XML editor/eraser *A way to port XML processing on chip *A non-blocking, stateless
XPath XPath (XML Path Language) is an expression language designed to support the query or transformation of XML documents. It was defined by the World Wide Web Consortium (W3C) and can be used to compute values (e.g., strings, numbers, or Boolean v ...
evaluator VTD-XML is developed by XimpleWare and dual-licensed under
GPL The GNU General Public License (GNU GPL or simply GPL) is a series of widely used free software licenses that guarantee end users the four freedoms to run, study, share, and modify the software. The license was the first copyleft for general us ...
and proprietary license. It is originally written in
Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's List ...
, but is now available in C,
C++ C++ (pronounced "C plus plus") is a high-level general-purpose programming language created by Danish computer scientist Bjarne Stroustrup as an extension of the C programming language, or "C with Classes". The language has expanded significan ...
and C#.


Basic concept


Non-extractive, document-centric parsing

Traditionally, a
lexical analyzer In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of ''lexical tokens'' ( strings with an assigned and thus identified ...
represents tokens (the small units of indivisible character values) as discrete string objects. This approach is designated ''extractive'' parsing. In contrast, ''non-extractive'' tokenization mandates that one keeps the source text intact, and uses offsets and lengths to describe those tokens.


Virtual token descriptor

Virtual Token Descriptor (VTD) applies the concept of non-extractive, document-centric parsing to XML processing. A VTD record uses a 64-bit integer to encode the offset, length, token type and nesting depth of a token in an XML document. Because all VTD records are 64 bits in length, they can be stored efficiently and managed as an array.


Location cache

Location Caches (LC) build on VTD records to provide efficient random access. Organized as tables, with one table per nesting depth level, LCs contain entries modeling an XML document's element hierarchy. An LC entry is a 64-bit integer encoding a pair of 32-bit values. The upper 32 bits identify the VTD record for the corresponding element. The lower 32 bits identify that element's first child in the LC at the next lower nesting level.


Benefits


Overview

Virtually all the core benefits of VTD-XML are inherent to non-extractive, document-centric parsing which provides these characteristics: * The source XML text is kept intact in memory without decoding. * The internal representation of VTD-XML is inherently
persistent Persistent may refer to: * Persistent data * Persistent data structure * Persistent identifier * Persistent memory * Persistent organic pollutant * Persistent Systems, a technology company * USS ''Persistent'', three United States Navy ships See ...
. * Obviates
object-oriented Object-oriented programming (OOP) is a programming paradigm based on the concept of "objects", which can contain data and code. The data is in the form of fields (often known as attributes or ''properties''), and the code is in the form of pro ...
modeling of the hierarchical representation as it relies entirely on primitive data types (e.g., 64-bit integers) to represent the XML hierarchy, thus reducing object creation cost to nearly zero. Combining those characteristics permits thinking of XML purely as syntax (bits, bytes, offsets, lengths, fragments, namespace-compensated fragments, and document composition) instead of the
serialization In computing, serialization (or serialisation) is the process of translating a data structure or object state into a format that can be stored (e.g. files in secondary storage devices, data buffers in primary storage devices) or transmitted (e ...
/deserialization of objects. This is a powerful way to think about XML/ SOA applications.


Conformance

VTD-XML conforms strictly to XML 1.0 (Except the DTD part) and XML Namespace 1.0. It essentially conforms to XPath 1.0 spec (with some subtle differences in terms of underlying data model) with extension to XPath 2.0 built-in functions.


Simplicity


As parser

When used in parsing mode, VTD-XML is a general purpose, extremely high performance XML parser which compares favorably with others: * VTD-XML typically outperforms SAX (with NULL content handler) while still providing full random access and built-in
XPath XPath (XML Path Language) is an expression language designed to support the query or transformation of XML documents. It was defined by the World Wide Web Consortium (W3C) and can be used to compute values (e.g., strings, numbers, or Boolean v ...
support. * VTD-XML typically consumes 1.3-1.5 times the XML document's size in memory, which is about 1/5 the memory usage of DOM * Applications written in VTD-XML are usually much shorter and cleaner than their DOM or SAX versions.


As indexer

Because of the inherent persistence of VTD-XML, developers can write the internal representation of a parsed XML document to disk and later reload it to avoid repetitive parsing. To this end, XimpleWare has introduced VTD+XML as a binary packaging format combining VTD, LC and the XML text. It can typically be viewed in one of the following two ways: * A native XML index that completely eliminates the parsing cost and also retains all benefits of XML. It is a file format that is human readable and backward compatible with XML. * A
binary XML Various binary formats have been proposed as compact representations for XML (''Extensible Markup Language''). Using a binary XML format generally reduces the verbosity of XML documents thereby also reducing the cost of parsing, but hinders the use ...
format that uses binary data to enhance the processing of the XML text.


XML content modifier

Because VTD-XML keeps the XML text intact without decoding, when an application intends to modify the content of XML it only needs to modify the portions most relevant to the changes. This is in stark contrast with DOM, SAX, or StAx parsing, which incur the cost of parsing and re-serialization no matter how small the changes are. Since VTDs refer to document elements by their offsets, changes to the length of elements occurring earlier in a document require adjustments to VTDs referring to all later elements. However, those adjustments are integer additions, albeit to many integers in multiple tables, so they are quick.


XML slicer/splitter/assembler

An application based on VTD-XML can also use offsets and lengths to address tokens, or element fragments. This allows XML documents to be manipulated like arrays of bytes. * As a slicer, VTD-XML can "slice" off a token or an element fragment from an XML document, then insert it back into another location in the same document, or into a different document. * As a splitter, VTD-XML can split sub-elements in an XML document and dump each into a separate XML document. * As an assembler, VTD-XML can "cut" chunks out of multiple XML documents and assemble them into a new XML document.


XML editor/eraser

Used as an editor/eraser, VTD-XML can directly edit/erase the underlying byte content of the XML text, provided that the token length is wider than the intended new content. An immediate benefit of this approach is that the application can immediately reuse the original VTD and LC. In contrast, when using VTD-XML to incrementally update an XML document, an application needs to reparse the updated document before the application can process it. An editor can be made smart enough to track the location of each token, permitting new, longer tokens to replace existing, shorter tokens by merely addressing the new token in separate memory outside that used to store the original document. Likewise, when reordering the document, element text does not need to be copied; only the LCs need to be updated. When a complete, contiguous XML document is needed, such as when saving it, the disparate parts can be reassembled into a new, contiguous document.


Other benefits

VTD-XML also pioneers the non-blocking, stateless XPath evaluation approach.


Weaknesses

VTD-XML also exhibits a few noticeable shortcomings: *As an XML parser, it does not support external entities declared in the DTD. *As a file format, it increases the document size by about 30% to 50%. *As an API, it is not compatible with
DOM Dom or DOM may refer to: People and fictional characters * Dom (given name), including fictional characters * Dom (surname) * Dom La Nena (born 1989), stage name of Brazilian-born cellist, singer and songwriter Dominique Pinto * Dom people, an et ...
, SAX or
StAX Streaming API for XML (StAX) is an application programming interface (API) to read and write XML documents, originating from the Java programming language community. Traditionally, XML APIs are either: * DOM based - the entire document is read in ...
. *It is difficult to support certain validation techniques, employed by DTD and XML Schema (e.g., default attributes and elements), that require modifications to the XML instances being parsed.


Areas of applications


General-purpose replacement for DOM or SAX

Because of VTD-XML's performance and memory advantages, it covers a larger portion of XML use cases than either DOM or SAX. *Compared to DOM, VTD-XML processes bigger (3x~5x) XML documents for the same amount of physical memory at about 3 to 10 times the performance. *Compared to SAX, VTD-XML provides random access and XPath support and outperforms SAX by at least 2x.


XPath over huge XML documents

The extended edition of VTD-XML combining with 64-bit JVM makes possible XPath-based XML processing over huge XML documents (up to 256 GB) in size.


For SOA/WS/XML security

The combination of VTD-XML's high performance and incremental-update capability makes it essential to achieve the desired level of
quality of service Quality of service (QoS) is the description or measurement of the overall performance of a service, such as a telephony or computer network, or a cloud computing service, particularly the performance seen by the users of the network. To quantitat ...
for SOA/WS/XML security applications.


For SOA/WS/XML intermediary

VTD-XML is well suited for SOA intermediary applications such as XML routers/switches/gateways,
Enterprise Service Bus An enterprise service bus (ESB) implements a communication system between mutually interacting software applications in a service-oriented architecture (SOA). It represents a software architecture for distributed computing, and is a special varia ...
es, and services aggregation points. All those applications perform the basic "store and forward" operations for which retaining the original XML is critical for minimizing latency. VTD-XML's incremental update capability also contributes significantly to the forwarding performance. VTD-XML's random-access capability lends itself well to
XPath XPath (XML Path Language) is an expression language designed to support the query or transformation of XML documents. It was defined by the World Wide Web Consortium (W3C) and can be used to compute values (e.g., strings, numbers, or Boolean v ...
-based XML routing/switching/filtering common in
AJAX Ajax may refer to: Greek mythology and tragedy * Ajax the Great, a Greek mythological hero, son of King Telamon and Periboea * Ajax the Lesser, a Greek mythological hero, son of Oileus, the king of Locris * ''Ajax'' (play), by the ancient Greek ...
and SOA deployment.


Intelligent SOA/WS/XML Load-balancing and Offloading

When an XML document travels through several middle-tier SOA components, the first message stop, after finishing the inspection of the XML document, can choose to send the VTD+XML file format to the downstream components to avoid repetitive parsing, thus improving throughput. By the same token, an intelligent SOA load balancer can choose to generate VTD+XML for incoming/outgoing SOAP messages to offload XML parsing from the application servers that receive those messages.


XML persistence data store

When viewed from the perspective of native XML persistence, VTD-XML can be used as a human-readable, easy to use, general-purpose XML index. XML documents stored this way can be loaded into memory to be queried, updated, or edited without the overhead of parsing/re-serialization.


Schemaless XML data binding

VTD-XML's combination of high performance, low memory usage, and efficient XPath evaluation makes possible a new
XML data binding XML data binding refers to a means of representing information in an XML document as a business object in computer memory. This allows applications to access the data in the XML from the object rather than using the DOM or SAX to retrieve the data ...
approach based entirely on XPath. This approach's biggest benefit is it no longer requires XML schema, avoids needless object creation, and takes advantage of XML's inherent loose encoding. It is worth noting that data binding discussed in the article mentioned above needs to be implemented by the application: VTD-XML itself only offers accessors. In this regard VTD-XML is not a data binding solution itself (unlike JiBX, JAXB, XMLBeans), although it offers extraction functionality for data binding packages, much like other XML parsers (
DOM Dom or DOM may refer to: People and fictional characters * Dom (given name), including fictional characters * Dom (surname) * Dom La Nena (born 1989), stage name of Brazilian-born cellist, singer and songwriter Dominique Pinto * Dom people, an et ...
, SAX,
StAX Streaming API for XML (StAX) is an application programming interface (API) to read and write XML documents, originating from the Java programming language community. Traditionally, XML APIs are either: * DOM based - the entire document is read in ...
).


Essential classes

As of Version 2.11, the Java and C# versions of VTD-XML consist of the following classes: * VTDGen (VTD Generator) is the class that encapsulates the main parsing, index loading and index writing functions. *VTDNav (VTD Navigator) is the class that (1) encapsulates XML, VTD, and hierarchical info, (2) contains various navigation methods, (3) performs various comparisons between VTD records and strings, and (4) converts VTD records to primitive data types. *AutoPilot is a class containing functions that perform node-level iteration and XPath. *XMLModifier is a class that offers incremental update capability, such as delete, insert and update. The extended VTD-XML consists of the following classes: * VTDGenHuge (Extended VTD Generator) encapsulates the main parsing. * XMLBuffer performs in-memory loading of XML documents. * XMLMemMappedBuffer performs memory mapped loading of XML documents. *VTDNavHuge (Extended VTD Navigator) (1) encapsulates XML, Extended VTD, and hierarchical info, (2) contains various navigation methods, (3) performs various comparisons between VTD records and strings, and (4) converts VTD records to primitive data types. *AutoPilotHuge performs node-level iteration and XPath.


Code sample

/* In this java program, we demonstrate how to use XMLModifier to incrementally * update a simple XML purchase order. * a particular name space. We also are going * to use VTDGen's parseFile to simplify programming. */ import com.ximpleware.*; public class Update


References

{{DEFAULTSORT:Vtd-Xml XML XML parsers Cross-platform free software Java platform .NET programming tools XML-based standards C (programming language) libraries C++ libraries