Semi-structured data is a form of
structured data
A data model is an abstract model that organizes elements of data and standardizes how they relate to one another and to the properties of real-world entities. For instance, a data model may specify that the data element representing a car be co ...
that does not obey the tabular structure of data models associated with
relational database
A relational database is a (most commonly digital) database based on the relational model of data, as proposed by E. F. Codd in 1970. A system used to maintain relational databases is a relational database management system (RDBMS). Many relatio ...
s or other forms of
data tables, but nonetheless contains
tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. Therefore, it is also known as
self-describing structure.
In semi-structured data, the entities belonging to the same class may have different
attributes even though they are grouped together, and the attributes' order is not important.
Semi-structured data are increasingly occurring since the advent of the
Internet
The Internet (or internet) is the global system of interconnected computer networks that uses the Internet protocol suite (TCP/IP) to communicate between networks and devices. It is a ''internetworking, network of networks'' that consists ...
where
full-text documents
A document is a written, drawn, presented, or memorialized representation of thought, often the manifestation of non-fictional, as well as fictional, content. The word originates from the Latin ''Documentum'', which denotes a "teaching" or ...
and
databases
In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases spa ...
are not the only forms of data anymore, and different applications need a medium for
exchanging information. In
object-oriented databases
An object database or object-oriented database is a database management system in which information is represented in the form of objects as used in object-oriented programming. Object databases are different from relational databases which are ...
, one often finds semi-structured data.
Types
XML
XML
Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. ...
, other markup languages,
email
Electronic mail (email or e-mail) is a method of exchanging messages ("mail") between people using electronic devices. Email was thus conceived as the electronic ( digital) version of, or counterpart to, mail, at a time when "mail" mean ...
, and
EDI are all forms of semi-structured data.
OEM (Object Exchange Model)
Stanford Universities Lore DBMS
/ref> was created prior to XML as a means of self-describing a data structure. XML has been popularized by web services that are developed utilizing SOAP
Soap is a salt of a fatty acid used in a variety of cleansing and lubricating products. In a domestic setting, soaps are surfactants usually used for washing, bathing, and other types of housekeeping. In industrial settings, soaps are used ...
principles.
Some types of data described here as "semi-structured", especially XML, suffer from the impression that they are incapable of structural rigor at the same functional level as Relational Tables and Rows. Indeed, the view of XML as inherently semi-structured (previously, it was referred to as "unstructured") has handicapped its use for a widening range of data-centric applications. Even documents, normally thought of as the epitome of semi-structure, can be designed with virtually the same rigor as database schema
The database schema is the structure of a database described in a formal language supported by the database management system (DBMS). The term " schema" refers to the organization of data as a blueprint of how the database is constructed (divid ...
, enforced by the XML schema
An XML schema is a description of a type of Extensible Markup Language, XML document, typically expressed in terms of constraints on the structure and content of documents of that type, above and beyond the basic syntactical constraints imposed ...
and processed by both commercial and custom software programs without reducing their usability by human readers.
In view of this fact, XML might be referred to as having "flexible structure" capable of human-centric flow and hierarchy as well as highly rigorous element structure and data typing.
The concept of XML as "human-readable", however, can only be taken so far. Some implementations/dialects of XML, such as the XML representation of the contents of a Microsoft Word document, as implemented in Office 2007 and later versions, utilize dozens or even hundreds of different kinds of tags that reflect a particular problem domain - in Word's case, formatting at the character and paragraph and document level, definitions of styles, inclusion of citations, etc. - which are nested within each other in complex ways. Understanding even a portion of such an XML document by reading it, let alone catching errors in its structure, is impossible without a very deep prior understanding of the specific XML implementation, along with assistance by software that understands the XML schema that has been employed. Such text is not "human-understandable" any more than a book written in Swahili (which uses the Latin alphabet) would be to an American or Western European who does not know a word of that language: the tags are symbols that are meaningless to a person unfamiliar with the domain.
JSON
JSON
JSON (JavaScript Object Notation, pronounced ; also ) is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays (or other s ...
or JavaScript Object Notation, is an open standard format that uses human-readable text to transmit data objects consisting of attribute–value pairs. It is used primarily to transmit data between a server and web application, as an alternative to XML. JSON has been popularized by web services developed utilizing REST
Rest or REST may refer to:
Relief from activity
* Sleep
** Bed rest
* Kneeling
* Lying (position)
* Sitting
* Squatting position
Structural support
* Structural support
** Rest (cue sports)
** Armrest
** Headrest
** Footrest
Arts and ente ...
principles.
There is a new breed of databases such as MongoDB
MongoDB is a source-available cross-platform document-oriented database program. Classified as a NoSQL database program, MongoDB uses JSON-like documents with optional schemas. MongoDB is developed by MongoDB Inc. and licensed under the Ser ...
and Couchbase that store data natively in JSON format, leveraging the pros of semi-structured data architecture.
Pros and cons
Advantages
* Programmers persisting objects from their application to a database do not need to worry about object-relational impedance mismatch, but can often serialize objects via a light-weight library.
* Support for nested or hierarchical data often simplifies data models representing complex relationships between entities.
* Support for lists of objects simplifies data models by avoiding messy translations of lists into a relational data model.
Disadvantages
* The traditional relational data model has a popular and ready-made query language, SQL.
* Prone to "garbage in, garbage out"; by removing restraints from the data model, there is less fore-thought that is necessary to operate a data application.
See also
* Semi-structured model
The semi-structured model is a database model where there is no separation between the data and the schema, and the amount of structure used depends on the purpose.
The advantages of this model are the following:
* It can represent the informatio ...
* NoSQL
A NoSQL (originally referring to "non- SQL" or "non-relational") database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Such databases have existed ...
* Unstructured data
Unstructured data (or unstructured information) is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, n ...
* Structured data
A data model is an abstract model that organizes elements of data and standardizes how they relate to one another and to the properties of real-world entities. For instance, a data model may specify that the data element representing a car be co ...
References
External links
UPenn Database Group
{snd semi-structured data and XML
Semi-Structured data analytics: Relational or Hadoop platform?
by IBM
Data modeling