XML retrieval
   HOME

TheInfoList



OR:

XML retrieval, or XML information retrieval, is the content-based retrieval of documents structured with
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. T ...
(eXtensible Markup Language). As such it is used for computing
relevance Relevance is the concept of one topic being connected to another topic in a way that makes it useful to consider the second topic when considering the first. The concept of relevance is studied in many different fields, including cognitive sc ...
of XML documents.


Queries

Most XML retrieval approaches do so based on techniques from the
information retrieval Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other c ...
(IR) area, e.g. by computing the similarity between a query consisting of keywords (query terms) and the document. However, in XML-Retrieval the query can also contain
structural A structure is an arrangement and organization of interrelated elements in a material object or system, or the object or system so organized. Material structures include man-made objects such as buildings and machines and natural objects such ...
hints. So-called "content and structure" (CAS) queries enable users to specify what structure the requested content can or must have.


Exploiting XML structure

Taking advantage of the self-describing structure of XML documents can improve the search for XML documents significantly. This includes the use of CAS queries, the weighting of different XML elements differently and the focused retrieval of subdocuments.


Ranking

Ranking in XML-Retrieval can incorporate both content relevance and structural similarity, which is the resemblance between the structure given in the query and the structure of the document. Also, the retrieval units resulting from an XML query may not always be entire documents, but can be any deeply nested XML elements, i.e. dynamic documents. The aim is to find the smallest retrieval unit that is highly relevant. Relevance can be defined according to the notion of specificity, which is the extent to which a retrieval unit focuses on the topic of request.


Existing XML search engines

An overview of two potential approaches is available. The INitiative for the Evaluation of XML-Retrieval (''INEX'') was founded in 2002 and provides a platform for evaluating such
algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...
s. Three different areas influence XML-Retrieval:


Traditional XML query languages

Query language Query languages, data query languages or database query languages (DQL) are computer languages used to make queries in databases and information systems. A well known example is the Structured Query Language (SQL). Types Broadly, query language ...
s such as the W3C standard
XQuery XQuery (XML Query) is a query and functional programming language that queries and transforms collections of structured and unstructured data, usually in the form of XML, text and with vendor-specific extensions for other data formats (JSON, b ...
supply complex queries, but only look for exact matches. Therefore, they need to be extended to allow for vague search with relevance computing. Most XML-centered approaches imply a quite exact knowledge of the documents' schemas.


Databases

Classic
database In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases ...
systems have adopted the possibility to store
semi-structured data Semi-structured data is a form of structured data that does not obey the tabular structure of data models associated with relational databases or other forms of data tables, but nonetheless contains tags or other markers to separate semantic ele ...
and resulted in the development of
XML database An XML database is a data persistence software system that allows data to be specified, and sometimes stored, in XML format. This data can be queried, transformed, exported and returned to a calling system. XML databases are a flavor of document ...
s. Often, they are very formal, concentrate more on searching than on ranking, and are used by experienced users able to formulate complex queries.


Information retrieval

Classic information retrieval models such as the
vector space model Vector space model or term vector model is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers (such as index terms). It is used in information filtering, information retrieval, indexing an ...
provide relevance ranking, but do not include document structure; only flat queries are supported. Also, they apply a static document concept, so retrieval units usually are entire documents. They can be extended to consider structural information and dynamic document retrieval. Examples for approaches extending the vector space models are available: they use document
subtree In computer science, a tree is a widely used abstract data type that represents a hierarchical tree structure with a set of connected nodes. Each node in the tree can be connected to many children (depending on the type of tree), but must be con ...
s (index terms plus structure) as dimensions of the vector space.


Data-centric XML datasets

For data-centric XML datasets, the unique and distinct keyword search method, namely, XDMA for XML databases is designed and developed based on dual indexing and mutual summation.


See also

*
Document retrieval Document retrieval is defined as the matching of some stated user query against a set of free-text records. These records could be any type of mainly unstructured text, such as newspaper articles, real estate records or paragraphs in a manual. Use ...
* Information retrieval applications


References

{{DEFAULTSORT:Xml-Retrieval XML Information retrieval genres