Data Engineering
   HOME

TheInfoList



OR:

Data engineering refers to the building of systems to enable the collection and usage of
data In the pursuit of knowledge, data (; ) is a collection of discrete Value_(semiotics), values that convey information, describing quantity, qualitative property, quality, fact, statistics, other basic units of meaning, or simply sequences of sy ...
. This data is usually used to enable subsequent
analysis Analysis ( : analyses) is the process of breaking a complex topic or substance into smaller parts in order to gain a better understanding of it. The technique has been applied in the study of mathematics and logic since before Aristotle (3 ...
and data science; which often involves
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
. Making the data usable usually involves substantial
compute Computing is any goal-oriented activity requiring, benefiting from, or creating computing machinery. It includes the study and experimentation of algorithmic processes, and development of both hardware and software. Computing has scientific, e ...
and storage, as well as data processing and
cleaning Cleaning is the process of removing unwanted substances, such as dirt, infectious agents, and other impurities, from an object or environment. Cleaning is often performed for aesthetic, hygienic, functional, environmental, or safety purposes ...
.


History

Around the 1970s/1980s the term information engineering methodology (IEM) was created to describe database design and the use of
software Software is a set of computer programs and associated software documentation, documentation and data (computing), data. This is in contrast to Computer hardware, hardware, from which the system is built and which actually performs the work. ...
for data analysis and processing. These techniques were intended to be used by
database administrator Database administrators (DBAs) use specialized software to store and organize data. The role may include capacity planning, installation, configuration, database design, migration, performance monitoring, security, troubleshooting, as well as ba ...
s (DBAs) and by systems analysts based upon an understanding of the operational processing needs of organizations for the 1980s. In particular, these techniques were meant to help bridge the gap between strategic business planning and information systems. A key early contributor (often called the "father" of information engineering methodology) was the Australian
Clive Finkelstein Clive Finkelstein (born ca. 1939 died 9/12/2021) is an Australian computer scientist, known as the "Father" of information engineering methodology. Life and work In 1961 Finkelstein received his Bachelor of Science from the University of New ...
, who wrote several articles about it between 1976 and 1980, and also co-authored an influential Savant Institute report on it with James Martin. Over the next few years, Finkelstein continued work in a more business driven direction, which was intended to address a rapidly changing business environment; Martin continued work in a more data processing driven direction. From 1983 to 1987, Charles M. Richter, guided by Clive Finkelstein, played a significant role by revamping IEM as well as helping to design the IEM software product (user-data), which helped automate IEM. In the early 2000s, the data and data tooling was generally held by the
information technology Information technology (IT) is the use of computers to create, process, store, retrieve, and exchange all kinds of Data (computing), data . and information. IT forms part of information and communications technology (ICT). An information te ...
(IT) teams in most companies. Other teams then used data for their work (e.g. reporting), and there was usually little overlap in data skillset between these parts of the business. In the early 2010s, with the rise of the
internet The Internet (or internet) is the global system of interconnected computer networks that uses the Internet protocol suite (TCP/IP) to communicate between networks and devices. It is a '' network of networks'' that consists of private, pub ...
, the massive increase in data volumes, velocity, and variety led to the term big data to describe the data itself, and data-driven tech companies like
Facebook Facebook is an online social media and social networking service owned by American company Meta Platforms. Founded in 2004 by Mark Zuckerberg with fellow Harvard College students and roommates Eduardo Saverin, Andrew McCollum, Dustin Mosk ...
and
Airbnb Airbnb, Inc. ( ), based in San Francisco, California, operates an online marketplace focused on short-term homestays and experiences. The company acts as a broker and charges a commission from each booking. The company was founded in 2008 b ...
started using the phrase data engineer. Due to the new scale of the data, major firms like
Google Google LLC () is an American Multinational corporation, multinational technology company focusing on Search Engine, search engine technology, online advertising, cloud computing, software, computer software, quantum computing, e-commerce, ar ...
, Facebook,
Amazon Amazon most often refers to: * Amazons, a tribe of female warriors in Greek mythology * Amazon rainforest, a rainforest covering most of the Amazon basin * Amazon River, in South America * Amazon (company), an American multinational technolog ...
,
Apple An apple is an edible fruit produced by an apple tree (''Malus domestica''). Apple trees are cultivated worldwide and are the most widely grown species in the genus ''Malus''. The tree originated in Central Asia, where its wild ancestor, ' ...
,
Microsoft Microsoft Corporation is an American multinational technology corporation producing computer software, consumer electronics, personal computers, and related services headquartered at the Microsoft Redmond campus located in Redmond, Washin ...
, and
Netflix Netflix, Inc. is an American subscription video on-demand over-the-top streaming service and production company based in Los Gatos, California. Founded in 1997 by Reed Hastings and Marc Randolph in Scotts Valley, California, it offers a fi ...
started to move away from traditional ETL and storage techniques. They started creating data engineering, a type of
software engineering Software engineering is a systematic engineering approach to software development. A software engineer is a person who applies the principles of software engineering to design, develop, maintain, test, and evaluate computer software. The term '' ...
focused on data, and in particular infrastructure,
warehousing A warehouse is a building for storing goods. Warehouses are used by manufacturers, importers, exporters, wholesalers, transport businesses, customs, etc. They are usually large plain buildings in industrial parks on the outskirts of cities, tow ...
,
data protection Information privacy is the relationship between the collection and dissemination of data, technology, the public expectation of privacy, contextual information norms, and the legal and political issues surrounding them. It is also known as data pr ...
,
cybersecurity Computer security, cybersecurity (cyber security), or information technology security (IT security) is the protection of computer systems and networks from attack by malicious actors that may result in unauthorized information disclosure, t ...
,
mining Mining is the extraction of valuable minerals or other geological materials from the Earth, usually from an ore body, lode, vein, seam, reef, or placer deposit. The exploitation of these deposits for raw material is based on the economic ...
, modelling,
processing Processing is a free graphical library and integrated development environment (IDE) built for the electronic arts, new media art, and visual design communities with the purpose of teaching non-programmers the fundamentals of computer programming ...
, and metadata management. This change in approach was particularly focused on
cloud computing Cloud computing is the on-demand availability of computer system resources, especially data storage ( cloud storage) and computing power, without direct active management by the user. Large clouds often have functions distributed over mu ...
. Data started to be handled and used by many parts of the business, such as
sales Sales are activities related to selling or the number of goods sold in a given targeted time period. The delivery of a service for a cost is also considered a sale. The seller, or the provider of the goods or services, completes a sale in ...
and
marketing Marketing is the process of exploring, creating, and delivering value to meet the needs of a target market in terms of goods and services; potentially including selection of a target audience; selection of certain attributes or themes to emph ...
, and not just IT.


Tools


Compute

High performance computing is critical for the processing and analysis of data. One particularly widespread approach to computing for data engineering is
dataflow programming In computer programming, dataflow programming is a programming paradigm that models a program as a directed graph of the data flowing between operations, thus implementing dataflow principles and architecture. Dataflow programming languages share s ...
, in which the computation is represented as a
directed graph In mathematics, and more specifically in graph theory, a directed graph (or digraph) is a graph that is made up of a set of vertices connected by directed edges, often called arcs. Definition In formal terms, a directed graph is an ordered pa ...
(dataflow graph); nodes are the operations, and edges represent the flow of data. Popular implementations include Apache Spark, and the deep learning specific
TensorFlow TensorFlow is a free and open-source software library for machine learning and artificial intelligence. It can be used across a range of tasks but has a particular focus on training and inference of deep neural networks. "It is machine learnin ...
. More recent implementations such as Differential/ Timely Dataflow have used incremental computing for much more efficient data processing.


Storage

Data are stored in a variety of ways, one of the key deciding factors is in how the data will be used.


Databases

If the data are structured and some form of
online transaction processing In online transaction processing (OLTP), information systems typically facilitate and manage transaction-oriented applications. This is contrasted with online analytical processing. The term "transaction" can have two different meanings, both of w ...
is required, then
databases In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases spa ...
are generally used. Originally mostly relational databases were used, with strong ACID transaction correctness guarantees; most relational databases use SQL for their queries. However, with the growth of data in the 2010s, NoSQL databases have also become popular since they horizontally scaled more easily than relational databases by giving up the ACID transaction guarantees, as well as reducing the object-relational impedance mismatch. More recently,
NewSQL NewSQL is a class of relational database management systems that seek to provide the scalability of NoSQL systems for online transaction processing (OLTP) workloads while maintaining the ACID guarantees of a traditional database system. Man ...
databases — which attempt to allow horizontal scaling while retaining ACID guarantees — have become popular.


Data Warehouses

If the data are structured and
online analytical processing Online analytical processing, or OLAP (), is an approach to answer multi-dimensional analytical (MDA) queries swiftly in computing. OLAP is part of the broader category of business intelligence, which also encompasses relational databases, repor ...
is required (but not online transaction processing), then
data warehouse In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis and is considered a core component of business intelligence. DWs are central repositories of integra ...
s are a main choice. They enable data analysis, mining, and
artificial intelligence Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information—demonstrated by machines, as opposed to intelligence displayed by animals and humans. Example tasks in which this is done include speech r ...
on a much larger scale than databases can allow, and indeed data often flow from databases into data warehouses.
Business analyst A business analyst (BA) is a person who processes, interprets and documents business processes, products, services and software through analysis of data. The role of a business analyst is to ensure business efficiency increases through their kno ...
s, data engineers, and data scientists can access data warehouses using tools such as SQL or
business intelligence Business intelligence (BI) comprises the strategies and technologies used by enterprises for the data analysis and management of business information. Common functions of business intelligence technologies include reporting, online analytical p ...
software.


Files

If the data are less structured, then often they are just stored as files. There are several options: * File systems represent data hierarchially in nested folders. *
Block storage In computing (specifically data transmission and data storage), a block, sometimes called a physical record, is a sequence of bytes or bits, usually containing some whole number of records, having a maximum length; a ''block size''. Data thu ...
splits data into regularly sized chunks; this often matches up with (virtual)
hard drives A hard disk drive (HDD), hard disk, hard drive, or fixed disk is an electro-mechanical data storage device that stores and retrieves digital data using magnetic storage with one or more rigid rapidly rotating platters coated with magne ...
or
solid state drives A solid-state drive (SSD) is a solid-state storage device that uses integrated circuit assemblies to store data persistently, typically using flash memory, and functioning as secondary storage in the hierarchy of computer storage. It is ...
. *
Object storage Object storage (also known as object-based storage) is a computer data storage that manages data as objects, as opposed to other storage architectures like file systems which manages data as a file hierarchy, and block storage which manages data a ...
manages data using metadata; often each file is assigned a key such as a
UUID A universally unique identifier (UUID) is a 128-bit label used for information in computer systems. The term globally unique identifier (GUID) is also used. When generated according to the standard methods, UUIDs are, for practical purposes, un ...
.


Management

The number of different data processes and storage locations can quickly become overwhelming. This motivates the usage of a
workflow management system A workflow management system (WfMS or WFMS) provides an infrastructure for the set-up, performance and monitoring of a defined sequence of tasks, arranged as a workflow application. International standards There are several international standards ...
(e.g.
Airflow Airflow, or air flow, is the movement of air. The primary cause of airflow is the existence of air. Air behaves in a fluid manner, meaning particles naturally flow from areas of higher pressure to those where the pressure is lower. Atmospheric ...
) to allow the data tasks to be specified, created, and monitored. The tasks are often specified as a directed acyclic graph (DAG).


Lifecycle


Business planning

Business objectives that executives set for what's to come are characterized in key business plans, with their more noteworthy definition in tactical business plans and implementation in operational business plans. Most businesses today recognize the fundamental need to grow a business plan that follows this strategy. It is often difficult to implement these plans because of the lack of transparency at the tactical and operational degrees of organizations. This kind of planning requires feedback to allow for early correction of problems that are due to miscommunication and misinterpretation of their business plan.


Systems Design

The design of data systems involves several components such as architecting data platforms, and designing data stores.


Data modelling

This is the process of producing a
data model A data model is an abstract model that organizes elements of data and standardizes how they relate to one another and to the properties of real-world entities. For instance, a data model may specify that the data element representing a car be c ...
, an
abstract model A conceptual model is a representation of a system. It consists of concepts used to help people know, understand, or simulate a subject the model represents. In contrast, physical models are physical object such as a toy model that may be assemb ...
to describe the data and relationships between different parts of the data.


Roles


Data Engineer

A data engineer is a type of software engineer who creates big data ETL pipelines to manage the flow of data through the organization. This makes it possible to take huge amounts of data and translate it into
insights Insight is the understanding of a specific cause and effect within a particular context. The term insight can have several related meanings: *a piece of information *the act or result of understanding the inner nature of things or of seeing intui ...
. They are focused on the production readiness of data and things like formats, resilience, scaling, and security. Data engineers usually hail from a software engineering background and are proficient in programming languages like
Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's mos ...
,
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (pro ...
, Scala, and
Rust Rust is an iron oxide, a usually reddish-brown oxide formed by the reaction of iron and oxygen in the catalytic presence of water or air moisture. Rust consists of hydrous iron(III) oxides (Fe2O3·nH2O) and iron(III) oxide-hydroxide (FeO( ...
. They will be more familiar with databases, architecture, cloud computing, and Agile software development.


Data Scientist

Data scientists are more focused on the analysis of the data, they will be more familiar with mathematics,
algorithms In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...
, statistics, and
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
.


See also

* Big data *
Information technology Information technology (IT) is the use of computers to create, process, store, retrieve, and exchange all kinds of Data (computing), data . and information. IT forms part of information and communications technology (ICT). An information te ...
*
Software engineering Software engineering is a systematic engineering approach to software development. A software engineer is a person who applies the principles of software engineering to design, develop, maintain, test, and evaluate computer software. The term '' ...
*
Computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to practical disciplines (includi ...


References


Further reading

* John Hares (1992). "Information engineering for the Advanced Practitioner", Wiley. * Clive Finkelstein (1989). ''An Introduction to Information engineering : From Strategic Planning to Information Systems''. Sydney: Addison-Wesley. * Clive Finkelstein (1992). "Information Engineering: Strategic Systems Development". Sydney: Addison-Wesley. * Ian Macdonald (1986). "Information engineering". in: ''Information Systems Design Methodologies''. T.W. Olle et al. (ed.). North-Holland. * Ian Macdonald (1988). "Automating the Information engineering methodology with the Information engineering Facility". In: ''Computerized Assistance during the Information Systems Life Cycle''.
T.W. Olle T. William (Bill) Olle (born 1933 and died March 2019) was a British computer scientist and consultant and President of T. William Olle Associates, England. Biography Bill Olle was educated at Boston Grammar School (1943-1950). He received an M. ...
et al. (ed.). North-Holland. * James Martin and
Clive Finkelstein Clive Finkelstein (born ca. 1939 died 9/12/2021) is an Australian computer scientist, known as the "Father" of information engineering methodology. Life and work In 1961 Finkelstein received his Bachelor of Science from the University of New ...
. (1981). ''Information engineering''. Technical Report (2 volumes), Savant Institute, Carnforth, Lancs, UK. * James Martin (1989). ''Information engineering''. (3 volumes), Prentice-Hall Inc. * Clive Finkelstein (2006) "Enterprise Architecture for Integration: Rapid Delivery Methods and Technologies". First Edition, Artech House, Norwood MA in hardcover. * Clive Finkelstein (2011) "Enterprise Architecture for Integration: Rapid Delivery Methods and Technologies". Second Edition in PDF at www.ies.aust.com and as an ibook on the Apple iPad and ebook on the Amazon Kindle. * Reis, Joe; Housley, Matt (2022) "Fundamentals of Data Engineering". O'Reilly Media, Inc. ISBN 9781098108304


External links


The Complex Method IEM



Enterprise Engineering and Rapid Delivery of Enterprise Architecture
{{Authority control Software development process Information systems