HOME

TheInfoList



OR:

Semi-structured data is a form of
structured data A data model is an abstract model that organizes elements of data and standardizes how they relate to one another and to the properties of real-world entities. For instance, a data model may specify that the data element representing a car be ...
that does not obey the tabular structure of data models associated with
relational database A relational database (RDB) is a database based on the relational model of data, as proposed by E. F. Codd in 1970. A Relational Database Management System (RDBMS) is a type of database management system that stores data in a structured for ...
s or other forms of data tables, but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. Therefore, it is also known as self-describing structure. In semi-structured data, the entities belonging to the same class may have different attributes even though they are grouped together, and the attributes' order is not important. Semi-structured data are increasingly occurring since the advent of the
Internet The Internet (or internet) is the Global network, global system of interconnected computer networks that uses the Internet protocol suite (TCP/IP) to communicate between networks and devices. It is a internetworking, network of networks ...
where
full-text In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the original texts r ...
documents A document is a written, drawn, presented, or memorialized representation of thought, often the manifestation of non-fictional, as well as fictional, content. The word originates from the Latin ', which denotes a "teaching" or "lesson": ...
and
databases In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and ana ...
are not the only forms of data anymore, and different applications need a medium for exchanging information. In
object-oriented databases An object database or object-oriented database is a database management system in which information is represented in the form of objects as used in object-oriented programming. Object databases are different from relational databases which are ...
, one often finds semi-structured data.


Types


XML

XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding electronic document, documents in a format that is both human-readable and Machine-r ...
, other markup languages,
email Electronic mail (usually shortened to email; alternatively hyphenated e-mail) is a method of transmitting and receiving Digital media, digital messages using electronics, electronic devices over a computer network. It was conceived in the ...
, and EDI are all forms of semi-structured data. OEM (Object Exchange Model)Stanford Universities Lore DBMS
/ref> was created prior to XML as a means of self-describing a data structure. XML has been popularized by web services that are developed utilizing
SOAP Soap is a salt (chemistry), salt of a fatty acid (sometimes other carboxylic acids) used for cleaning and lubricating products as well as other applications. In a domestic setting, soaps, specifically "toilet soaps", are surfactants usually u ...
principles. Some types of data described here as "semi-structured", especially XML, suffer from the impression that they are incapable of structural rigor at the same functional level as Relational Tables and Rows. Indeed, the view of XML as inherently semi-structured (previously, it was referred to as "unstructured") has handicapped its use for a widening range of data-centric applications. Even documents, normally thought of as the epitome of semi-structure, can be designed with virtually the same rigor as
database schema The database schema is the structure of a database described in a formal language supported typically by a relational database management system (RDBMS). The term "wikt:schema, schema" refers to the organization of data as a blueprint of how the ...
, enforced by the
XML schema An XML schema is a description of a type of XML document, typically expressed in terms of constraints on the structure and content of documents of that type, above and beyond the basic syntactical constraints imposed by XML itself. These constrai ...
and processed by both commercial and custom software programs without reducing their usability by human readers. In view of this fact, XML might be referred to as having "flexible structure" capable of human-centric flow and hierarchy as well as highly rigorous element structure and data typing. The concept of XML as "human-readable", however, can only be taken so far. Some implementations/dialects of XML, such as the XML representation of the contents of a Microsoft Word document, as implemented in Office 2007 and later versions, utilize dozens or even hundreds of different kinds of tags that reflect a particular problem domain - in Word's case, formatting at the character and paragraph and document level, definitions of styles, inclusion of citations, etc. - which are nested within each other in complex ways. Understanding even a portion of such an XML document by reading it, let alone catching errors in its structure, is impossible without a very deep prior understanding of the specific XML implementation, along with assistance by software that understands the XML schema that has been employed. Such text is not "human-understandable" any more than a book written in Swahili (which uses the Latin alphabet) would be to an American or Western European who does not know a word of that language: the tags are symbols that are meaningless to a person unfamiliar with the domain.


JSON

JSON JSON (JavaScript Object Notation, pronounced or ) is an open standard file format and electronic data interchange, data interchange format that uses Human-readable medium and data, human-readable text to store and transmit data objects consi ...
or JavaScript Object Notation, is an open standard format that uses human-readable text to transmit data objects. JSON has been popularized by web services developed utilizing
REST REST (Representational State Transfer) is a software architectural style that was created to describe the design and guide the development of the architecture for the World Wide Web. REST defines a set of constraints for how the architecture of ...
principles. Databases such as
MongoDB MongoDB is a source-available, cross-platform, document-oriented database program. Classified as a NoSQL database product, MongoDB uses JSON-like documents with optional database schema, schemas. Released in February 2009 by 10gen (now MongoDB ...
and
Couchbase Couchbase Server, originally known as Membase, is a source-available, distributed ( shared-nothing architecture) multi-model NoSQL document-oriented database software package optimized for interactive applications. These applications may serv ...
store data natively in JSON format, leveraging the pros of semi-structured data architecture.


Pros and cons


Advantages

* Programmers persisting objects from their application to a database do not need to worry about object-relational impedance mismatch, but can often serialize objects via a light-weight library. * Support for nested or hierarchical data often simplifies data models representing complex relationships between entities. * Support for lists of objects simplifies data models by avoiding messy translations of lists into a relational data model.


Disadvantages

* The traditional relational data model has a popular and ready-made query language,
SQL Structured Query Language (SQL) (pronounced ''S-Q-L''; or alternatively as "sequel") is a domain-specific language used to manage data, especially in a relational database management system (RDBMS). It is particularly useful in handling s ...
. * Prone to "garbage in, garbage out"; by removing restraints from the data model, there is less forethought that is necessary to operate a data application.


Semi-structured model

The semi-structured model is a
database model A database model is a type of data model that determines the logical structure of a database. It fundamentally determines in which manner data can be stored, organized and manipulated. The most popular example of a database model is the relatio ...
where there is no separation between the
data Data ( , ) are a collection of discrete or continuous values that convey information, describing the quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted for ...
and the
schema Schema may refer to: Science and technology * SCHEMA (bioinformatics), an algorithm used in protein engineering * Schema (genetic algorithms), a set of programs or bit strings that have some genotypic similarity * Schema.org, a web markup vocab ...
, and the amount of structure used depends on the purpose. The advantages of this model are the following: * It can represent the information of some data sources that cannot be constrained by schema. * It provides a flexible format for data exchange between different types of databases. * It can be helpful to view structured data as semi-structured (for browsing purposes). * The schema can easily be changed. * The data transfer format may be portable. The primary trade-off being made in using a semi-structured
database model A database model is a type of data model that determines the logical structure of a database. It fundamentally determines in which manner data can be stored, organized and manipulated. The most popular example of a database model is the relatio ...
is that queries cannot be made as efficiently as in a more constrained structure, such as in the
relational model The relational model (RM) is an approach to managing data using a structure and language consistent with first-order predicate logic, first described in 1969 by English computer scientist Edgar F. Codd, where all data are represented in terms of t ...
. Typically the records in a semi-structured database are stored with unique IDs that are referenced with pointers to their location on disk. This makes navigational or path-based queries quite efficient, but for doing searches over many records (as is typical in
SQL Structured Query Language (SQL) (pronounced ''S-Q-L''; or alternatively as "sequel") is a domain-specific language used to manage data, especially in a relational database management system (RDBMS). It is particularly useful in handling s ...
), it is not as efficient because it has to seek around the disk following pointers. The Object Exchange Model (OEM) is one standard to express semi-structured data, another way is
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding electronic document, documents in a format that is both human-readable and Machine-r ...
.


See also

*
NoSQL NoSQL (originally meaning "Not only SQL" or "non-relational") refers to a type of database design that stores and retrieves data differently from the traditional table-based structure of relational databases. Unlike relational databases, which ...
*
Unstructured data Unstructured data (or unstructured information) is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically plain text, text-heavy, but may contain data such ...
*
Structured data A data model is an abstract model that organizes elements of data and standardizes how they relate to one another and to the properties of real-world entities. For instance, a data model may specify that the data element representing a car be ...


References


External links


UPenn Database Group
{snd semi-structured data and XML
Semi-Structured data analytics: Relational or Hadoop platform?
by IBM Data modeling