Unstructured Information
   HOME

TheInfoList



OR:

Unstructured data (or unstructured information) is information that either does not have a pre-defined
data model A data model is an abstract model that organizes elements of data and standardizes how they relate to one another and to the properties of real-world entities. For instance, a data model may specify that the data element representing a car be ...
or is not organized in a pre-defined manner. Unstructured information is typically
text Text may refer to: Written word * Text (literary theory) In literary theory, a text is any object that can be "read", whether this object is a work of literature, a street sign, an arrangement of buildings on a city block, or styles of clothi ...
-heavy, but may contain data such as dates, numbers, and facts as well. This results in irregularities and ambiguities that make it difficult to understand using traditional programs as compared to data stored in fielded form in databases or
annotated An annotation is extra information associated with a particular point in a document or other piece of information. It can be a note that includes a comment or explanation. Annotations are sometimes presented in the margin of book pages. For anno ...
( semantically tagged) in documents. In 1998,
Merrill Lynch Merrill Lynch, Pierce, Fenner & Smith Incorporated, doing business as Merrill, and previously branded Merrill Lynch, is an American investment management and wealth management division of Bank of America. Along with BofA Securities, the investm ...
said "unstructured data comprises the vast majority of data found in an organization, some estimates run as high as 80%." It is unclear what the source of this number is, but nonetheless it is accepted by some. Other sources have reported similar or higher percentages of unstructured data. , IDC and
Dell EMC EMC Corporation (stylized as EMC²) was an American multinational corporation headquartered in Hopkinton, Massachusetts, which sold data storage, information security, virtualization, analytics, cloud computing and other products and services th ...
project that data will grow to 40 zettabytes by 2020, resulting in a 50-fold growth from the beginning of 2010. More recently, IDC and Seagate predict that the global datasphere will grow to 163 zettabytes by 2025 and majority of that will be unstructured. The Computer World magazine states that unstructured information might account for more than 70–80% of all data in organizations.


Background

The earliest research into
business intelligence Business intelligence (BI) consists of strategies, methodologies, and technologies used by enterprises for data analysis and management of business information. Common functions of BI technologies include Financial reporting, reporting, online an ...
focused in on unstructured textual data, rather than numerical data. As early as 1958,
computer science Computer science is the study of computation, information, and automation. Computer science spans Theoretical computer science, theoretical disciplines (such as algorithms, theory of computation, and information theory) to Applied science, ...
researchers like H.P. Luhn were particularly concerned with the extraction and classification of unstructured text. However, only since the turn of the century has the technology caught up with the research interest. In 2004, the
SAS Institute SAS Institute (or SAS, pronounced "sass") is an American multinational developer of analytics and artificial intelligence software based in Cary, North Carolina. SAS develops and markets a suite of analytics software ( also called SAS), which ...
developed the SAS Text Miner, which uses
Singular Value Decomposition In linear algebra, the singular value decomposition (SVD) is a Matrix decomposition, factorization of a real number, real or complex number, complex matrix (mathematics), matrix into a rotation, followed by a rescaling followed by another rota ...
(SVD) to reduce a hyper-dimensional textual
space Space is a three-dimensional continuum containing positions and directions. In classical physics, physical space is often conceived in three linear dimensions. Modern physicists usually consider it, with time, to be part of a boundless ...
into smaller dimensions for significantly more efficient machine-analysis. The mathematical and technological advances sparked by
machine A machine is a physical system that uses power to apply forces and control movement to perform an action. The term is commonly applied to artificial devices, such as those employing engines or motors, but also to natural biological macromol ...
textual analysis prompted a number of businesses to research applications, leading to the development of fields like
sentiment analysis Sentiment analysis (also known as opinion mining or emotion AI) is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subje ...
,
voice of the customer In marketing and quality management, the voice of the customer (VOC) summarizes customers' expectations, preferences and aversions. A widely used form of customer's voice market research produces a detailed set of customer wants and needs, organ ...
mining, and call center optimization. The emergence of
Big Data Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data processing, data-processing application software, software. Data with many entries (rows) offer greater statistical power, while data with ...
in the late 2000s led to a heightened interest in the applications of unstructured data analytics in contemporary fields such as
predictive analytics Predictive analytics encompasses a variety of Statistics, statistical techniques from data mining, Predictive modelling, predictive modeling, and machine learning that analyze current and historical facts to make predictions about future or other ...
and
root cause analysis In science and engineering, root cause analysis (RCA) is a method of problem solving used for identifying the root causes of faults or problems. It is widely used in IT operations, manufacturing, telecommunications, industrial process control, ...
.


Issues with terminology

The term is imprecise for several reasons: #
Structure A structure is an arrangement and organization of interrelated elements in a material object or system, or the object or system so organized. Material structures include man-made objects such as buildings and machines and natural objects such as ...
, while not formally defined, can still be implied. # Data with some form of structure may still be characterized as unstructured if its structure is not helpful for the processing task at hand. # Unstructured information might have some structure ( semi-structured) or even be highly structured but in ways that are unanticipated or unannounced.


Dealing with unstructured data

Techniques such as
data mining Data mining is the process of extracting and finding patterns in massive data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and ...
,
natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
(NLP), and
text analytics Text mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from plain text, text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information ...
provide different methods to find patterns in, or otherwise interpret, this information. Common techniques for structuring text usually involve manual tagging with metadata or
part-of-speech tagging In corpus linguistics, part-of-speech tagging (POS tagging, PoS tagging, or POST), also called grammatical tagging, is the process of marking up a word in a text ( corpus) as corresponding to a particular part of speech, based on both its defini ...
for further
text mining Text mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from differe ...
-based structuring. The Unstructured Information Management Architecture (UIMA) standard provided a common framework for processing this information to extract meaning and create structured data about the information. Software that creates machine-processable structure can utilize the linguistic, auditory, and visual structure that exist in all forms of human communication. Algorithms can infer this inherent structure from text, for instance, by examining word
morphology Morphology, from the Greek and meaning "study of shape", may refer to: Disciplines *Morphology (archaeology), study of the shapes or forms of artifacts *Morphology (astronomy), study of the shape of astronomical objects such as nebulae, galaxies, ...
, sentence syntax, and other small- and large-scale patterns. Unstructured information can then be enriched and tagged to address ambiguities and relevancy-based techniques then used to facilitate search and discovery. Examples of "unstructured data" may include books, journals, documents,
metadata Metadata (or metainformation) is "data that provides information about other data", but not the content of the data itself, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive ...
, health records,
audio Audio most commonly refers to sound, as it is transmitted in signal form. It may also refer to: Sound *Audio signal, an electrical representation of sound *Audio frequency, a frequency in the audio spectrum *Digital audio, representation of sound ...
,
video Video is an Electronics, electronic medium for the recording, copying, playback, broadcasting, and display of moving picture, moving image, visual Media (communication), media. Video was first developed for mechanical television systems, whi ...
, analog data, images, files, and unstructured text such as the body of an
e-mail Electronic mail (usually shortened to email; alternatively hyphenated e-mail) is a method of transmitting and receiving Digital media, digital messages using electronics, electronic devices over a computer network. It was conceived in the ...
message,
Web page A web page (or webpage) is a World Wide Web, Web document that is accessed in a web browser. A website typically consists of many web pages hyperlink, linked together under a common domain name. The term "web page" is therefore a metaphor of pap ...
, or
word-processor A word processor (WP) is a device or computer program that provides for input, editing, formatting, and output of text, often with some additional features. Word processor (electronic device), Early word processors were stand-alone devices dedicate ...
document. While the main content being conveyed does not have a defined structure, it generally comes packaged in objects (e.g. in files or documents, ...) that themselves have structure and are thus a mix of structured and unstructured data, but collectively this is still referred to as "unstructured data". For example, an
HTML Hypertext Markup Language (HTML) is the standard markup language for documents designed to be displayed in a web browser. It defines the content and structure of web content. It is often assisted by technologies such as Cascading Style Sheets ( ...
web page is tagged, but HTML mark-up typically serves solely for rendering. It does not capture the meaning or function of tagged elements in ways that support automated processing of the information content of the page.
XHTML Extensible HyperText Markup Language (XHTML) is part of the family of XML markup languages which mirrors or extends versions of the widely used HyperText Markup Language (HTML), the language in which Web pages are formulated. While HTML, pr ...
tagging does allow machine processing of elements, although it typically does not capture or convey the semantic meaning of tagged terms. Since unstructured data commonly occurs in
electronic document An electronic document is a document that can be sent in non-physical means, such as telex, email, and the internet. Originally, any computer data were considered as something internal—the final data output was always on paper. However, the ...
s, the use of a content or
document management A document management system (DMS) is usually a computerized system used to store, share, track and manage files or documents. Some systems include history tracking where a log of the various versions created and modified by different users is r ...
system which can categorize entire documents is often preferred over data transfer and manipulation from within the documents. Document management thus provides the means to convey structure onto document collections.
Search engines Search engines, including web search engines, selection-based search engines, metasearch engines, desktop search tools, and web portals and vertical market websites have a search facility for online databases. By content/topic Gene ...
have become popular tools for indexing and searching through such data, especially text.


Approaches in natural language processing

Specific computational workflows have been developed to impose structure upon the unstructured data contained within text documents. These workflows are generally designed to handle sets of thousands or even millions of documents, or far more than manual approaches to annotation may permit. Several of these approaches are based upon the concept of online analytical processing, or OLAP, and may be supported by data models such as text cubes. Once document metadata is available through a data model, generating summaries of subsets of documents (i.e., cells within a text cube) may be performed with phrase-based approaches.


Approaches in medicine and biomedical research

Biomedical research generates one major source of unstructured data as researchers often publish their findings in scholarly journals. Though the language in these documents is challenging to derive structural elements from (e.g., due to the complicated technical vocabulary contained within and the
domain knowledge Domain knowledge is knowledge of a specific discipline or field in contrast to general (or domain-independent) knowledge. The term is often used in reference to a more general discipline—for example, in describing a software engineer who has ge ...
required to fully contextualize observations), the results of these activities may yield links between technical and medical studies and clues regarding new disease therapies. Recent efforts to enforce structure upon biomedical documents include
self-organizing map A self-organizing map (SOM) or self-organizing feature map (SOFM) is an unsupervised machine learning technique used to produce a low-dimensional (typically two-dimensional) representation of a higher-dimensional data set while preserving the t ...
approaches for identifying topics among documents, general-purpose unsupervised algorithms, and an application of the CaseOLAP workflow to determine associations between protein names and
cardiovascular disease Cardiovascular disease (CVD) is any disease involving the heart or blood vessels. CVDs constitute a class of diseases that includes: coronary artery diseases (e.g. angina, heart attack), heart failure, hypertensive heart disease, rheumati ...
topics in the literature. CaseOLAP defines phrase-category relationships in an accurate (identifies relationships), consistent (highly reproducible), and efficient manner. This platform offers enhanced accessibility and empowers the biomedical community with phrase-mining tools for widespread biomedical research applications.


The use of "unstructured" in data privacy regulations

In Sweden (EU), pre 2018, some data privacy regulations did not apply if the data in question was confirmed as "unstructured". This terminology, unstructured data, is rarely used in the EU after
GDPR The General Data Protection Regulation (Regulation (EU) 2016/679), abbreviated GDPR, is a European Union regulation on information privacy in the European Union (EU) and the European Economic Area (EEA). The GDPR is an important component of ...
came into force in 2018. GDPR does neither mention nor define "unstructured data". It does use the word "structured" as follows (without defining it); * Parts of GDPR Recital 15, "The protection of natural persons should apply to the processing of personal data ... if ... contained in a filing system." * GDPR Article 4, "‘filing system’ means any structured set of personal data which are accessible according to specific criteria ..." GDPR Case-law on what defines a "filing system"; "the specific criterion and the specific form in which the set of personal data collected by each of the members who engage in preaching is actually structured is irrelevant, so long as that set of data makes it possible for the data relating to a specific person who has been contacted to be easily retrieved, which is however for the referring court to ascertain in the light of all the circumstances of the case in the main proceedings.” (
CJEU The Court of Justice of the European Union (CJEU) ( or "''CJUE''"; Latin: Curia) is the judicial branch of the European Union (EU). Seated in the Kirchberg quarter of Luxembourg City, Luxembourg, this EU institution consists of two separat ...

Todistajat v. Tietosuojavaltuutettu, Jehovan, Paragraph 61
. If
personal data Personal data, also known as personal information or personally identifiable information (PII), is any information related to an identifiable person. The abbreviation PII is widely used in the United States, but the phrase it abbreviates has fou ...
is easily retrieved - then it is a filing system and - then it is in scope for GDPR regardless of being "structured" or "unstructured". Most electronic systems today, subject to access and applied software, can allow for easy retrieval of data.


See also

* Clustering *
Pattern recognition Pattern recognition is the task of assigning a class to an observation based on patterns extracted from data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess PR capabilities but their p ...
* List of text mining software * Semi-structured data *
Structured data A data model is an abstract model that organizes elements of data and standardizes how they relate to one another and to the properties of real-world entities. For instance, a data model may specify that the data element representing a car be ...


Notes

# Today's Challenge in Government: What to do with Unstructured Information and Why Doing Nothing Isn't An Option, Noel Yuhanna, Principal Analyst,
Forrester Research Forrester Research, Inc. is a research and advisory firm. Forrester serves clients in North America, Europe, and Asia Pacific. The firm is headquartered in Cambridge, Massachusetts, Cambridge, MA with global offices in Amsterdam, London, New D ...
, Nov 2010


References


External links


Matching Unstructured Data and Structured Dataa brief description for Structured DataUnstructured Data Definition, Examples, Benefits & Challenges
{{Data Data Information technology management Business intelligence terms