Data Engineers
   HOME

TheInfoList



OR:

Data engineering refers to the building of
systems A system is a group of interacting or interrelated elements that act according to a set of rules to form a unified whole. A system, surrounded and influenced by its environment, is described by its boundaries, structure and purpose and express ...
to enable the collection and usage of
data In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted ...
. This data is usually used to enable subsequent
analysis Analysis ( : analyses) is the process of breaking a complex topic or substance into smaller parts in order to gain a better understanding of it. The technique has been applied in the study of mathematics and logic since before Aristotle (38 ...
and
data science Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insights from noisy, structured and unstructured data, and apply knowledge from data across a br ...
; which often involves
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
. Making the data usable usually involves substantial
compute Computing is any goal-oriented activity requiring, benefiting from, or creating computing machinery. It includes the study and experimentation of algorithmic processes, and development of both hardware and software. Computing has scientific, e ...
and storage, as well as
data processing Data processing is the collection and manipulation of digital data to produce meaningful information. Data processing is a form of ''information processing'', which is the modification (processing) of information in any manner detectable by an ...
and
cleaning Cleaning is the process of removing unwanted substances, such as dirt, infectious agents, and other impurities, from an object or environment. Cleaning is often performed for aesthetic, hygienic, functional, environmental, or safety purposes. Cl ...
.


History

Around the 1970s/1980s the term information engineering methodology (IEM) was created to describe
database design Database design is the organization of data according to a database model. The designer determines what data must be stored and how the data elements interrelate. With this information, they can begin to fit the data to the database model.Teorey, ...
and the use of
software Software is a set of computer programs and associated documentation and data. This is in contrast to hardware, from which the system is built and which actually performs the work. At the lowest programming level, executable code consists ...
for data analysis and processing. These techniques were intended to be used by
database administrator Database administrators (DBAs) use specialized software to store and organize data. The role may include capacity planning, installation, configuration, database design, migration, performance monitoring, security, troubleshooting, as well as ba ...
s (DBAs) and by
systems analyst A systems analyst, also known as business technology analyst, is an information technology (IT) professional who specializes in analyzing, designing and implementing information systems. Systems analysts assess the suitability of information syst ...
s based upon an understanding of the operational processing needs of organizations for the 1980s. In particular, these techniques were meant to help bridge the gap between strategic business planning and information systems. A key early contributor (often called the "father" of information engineering methodology) was the Australian
Clive Finkelstein Clive Finkelstein (born ca. 1939 died 9/12/2021) is an Australian computer scientist, known as the "Father" of information engineering methodology. Life and work In 1961 Finkelstein received his Bachelor of Science from the University of New ...
, who wrote several articles about it between 1976 and 1980, and also co-authored an influential Savant Institute report on it with James Martin. Over the next few years, Finkelstein continued work in a more business driven direction, which was intended to address a rapidly changing business environment; Martin continued work in a more data processing driven direction. From 1983 to 1987, Charles M. Richter, guided by Clive Finkelstein, played a significant role by revamping IEM as well as helping to design the IEM software product (user-data), which helped automate IEM. In the early 2000s, the data and data tooling was generally held by the
information technology Information technology (IT) is the use of computers to create, process, store, retrieve, and exchange all kinds of data . and information. IT forms part of information and communications technology (ICT). An information technology system (I ...
(IT) teams in most companies. Other teams then used data for their work (e.g. reporting), and there was usually little overlap in data skillset between these parts of the business. In the early 2010s, with the rise of the
internet The Internet (or internet) is the global system of interconnected computer networks that uses the Internet protocol suite (TCP/IP) to communicate between networks and devices. It is a '' network of networks'' that consists of private, pub ...
, the massive increase in data volumes, velocity, and variety led to the term
big data Though used sometimes loosely partly because of a lack of formal definition, the interpretation that seems to best describe Big data is the one associated with large body of information that we could not comprehend when used only in smaller am ...
to describe the data itself, and data-driven tech companies like
Facebook Facebook is an online social media and social networking service owned by American company Meta Platforms. Founded in 2004 by Mark Zuckerberg with fellow Harvard College students and roommates Eduardo Saverin, Andrew McCollum, Dustin M ...
and
Airbnb Airbnb, Inc. ( ), based in San Francisco, California, operates an online marketplace focused on short-term homestays and experiences. The company acts as a broker and charges a commission from each booking. The company was founded in 2008 b ...
started using the phrase data engineer. Due to the new scale of the data, major firms like
Google Google LLC () is an American multinational technology company focusing on search engine technology, online advertising, cloud computing, computer software, quantum computing, e-commerce, artificial intelligence, and consumer electronics. ...
, Facebook,
Amazon Amazon most often refers to: * Amazons, a tribe of female warriors in Greek mythology * Amazon rainforest, a rainforest covering most of the Amazon basin * Amazon River, in South America * Amazon (company), an American multinational technology c ...
,
Apple An apple is an edible fruit produced by an apple tree (''Malus domestica''). Apple fruit tree, trees are agriculture, cultivated worldwide and are the most widely grown species in the genus ''Malus''. The tree originated in Central Asia, wh ...
,
Microsoft Microsoft Corporation is an American multinational technology corporation producing computer software, consumer electronics, personal computers, and related services headquartered at the Microsoft Redmond campus located in Redmond, Washing ...
, and
Netflix Netflix, Inc. is an American subscription video on-demand over-the-top streaming service and production company based in Los Gatos, California. Founded in 1997 by Reed Hastings and Marc Randolph in Scotts Valley, California, it offers a fil ...
started to move away from traditional ETL and storage techniques. They started creating data engineering, a type of
software engineering Software engineering is a systematic engineering approach to software development. A software engineer is a person who applies the principles of software engineering to design, develop, maintain, test, and evaluate computer software. The term '' ...
focused on data, and in particular
infrastructure Infrastructure is the set of facilities and systems that serve a country, city, or other area, and encompasses the services and facilities necessary for its economy, households and firms to function. Infrastructure is composed of public and priv ...
,
warehousing A warehouse is a building for storing goods. Warehouses are used by manufacturers, importers, exporters, wholesalers, transport businesses, customs, etc. They are usually large plain buildings in industrial parks on the outskirts of cities, tow ...
,
data protection Information privacy is the relationship between the collection and dissemination of data, technology, the public expectation of privacy, contextual information norms, and the legal and political issues surrounding them. It is also known as data pr ...
,
cybersecurity Computer security, cybersecurity (cyber security), or information technology security (IT security) is the protection of computer systems and networks from attack by malicious actors that may result in unauthorized information disclosure, the ...
,
mining Mining is the extraction of valuable minerals or other geological materials from the Earth, usually from an ore body, lode, vein, seam, reef, or placer deposit. The exploitation of these deposits for raw material is based on the economic via ...
, modelling,
processing Processing is a free graphical library and integrated development environment (IDE) built for the electronic arts, new media art, and visual design communities with the purpose of teaching non-programmers the fundamentals of computer programming ...
, and
metadata Metadata is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive metadata – the descriptive ...
management. This change in approach was particularly focused on
cloud computing Cloud computing is the on-demand availability of computer system resources, especially data storage ( cloud storage) and computing power, without direct active management by the user. Large clouds often have functions distributed over mul ...
. Data started to be handled and used by many parts of the business, such as
sales Sales are activities related to selling or the number of goods sold in a given targeted time period. The delivery of a service for a cost is also considered a sale. The seller, or the provider of the goods or services, completes a sale in r ...
and
marketing Marketing is the process of exploring, creating, and delivering value to meet the needs of a target market in terms of goods and services; potentially including selection of a target audience; selection of certain attributes or themes to emph ...
, and not just IT.


Tools


Compute

High performance computing is critical for the processing and analysis of data. One particularly widespread approach to computing for data engineering is
dataflow programming In computer programming, dataflow programming is a programming paradigm that models a program as a directed graph of the data flowing between operations, thus implementing dataflow principles and architecture. Dataflow programming languages share ...
, in which the computation is represented as a
directed graph In mathematics, and more specifically in graph theory, a directed graph (or digraph) is a graph that is made up of a set of vertices connected by directed edges, often called arcs. Definition In formal terms, a directed graph is an ordered pa ...
(dataflow graph); nodes are the operations, and edges represent the flow of data. Popular implementations include
Apache Spark Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of Californi ...
, and the
deep learning Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised. De ...
specific
TensorFlow TensorFlow is a free and open-source software library for machine learning and artificial intelligence. It can be used across a range of tasks but has a particular focus on training and inference of deep neural networks. "It is machine learnin ...
. More recent implementations such as Differential/ Timely Dataflow have used
incremental computing Incremental computing, also known as incremental computation, is a software feature which, whenever a piece of data changes, attempts to save time by only recomputing those outputs which depend on the changed data. When incremental computing is su ...
for much more efficient data processing.


Storage

Data are stored in a variety of ways, one of the key deciding factors is in how the data will be used.


Databases

If the data are structured and some form of
online transaction processing In online transaction processing (OLTP), information systems typically facilitate and manage transaction-oriented applications. This is contrasted with online analytical processing. The term "transaction" can have two different meanings, both of wh ...
is required, then
databases In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases spa ...
are generally used. Originally mostly
relational database A relational database is a (most commonly digital) database based on the relational model of data, as proposed by E. F. Codd in 1970. A system used to maintain relational databases is a relational database management system (RDBMS). Many relatio ...
s were used, with strong
ACID In computer science, ACID ( atomicity, consistency, isolation, durability) is a set of properties of database transactions intended to guarantee data validity despite errors, power failures, and other mishaps. In the context of databases, a sequ ...
transaction correctness guarantees; most relational databases use SQL for their queries. However, with the growth of data in the 2010s,
NoSQL A NoSQL (originally referring to "non- SQL" or "non-relational") database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Such databases have existed ...
databases have also become popular since they horizontally scaled more easily than relational databases by giving up the ACID transaction guarantees, as well as reducing the object-relational impedance mismatch. More recently,
NewSQL NewSQL is a class of relational database management systems that seek to provide the scalability of NoSQL systems for online transaction processing (OLTP) workloads while maintaining the ACID guarantees of a traditional database system. Man ...
databases — which attempt to allow horizontal scaling while retaining ACID guarantees — have become popular.


Data Warehouses

If the data are structured and
online analytical processing Online analytical processing, or OLAP (), is an approach to answer multi-dimensional analytical (MDA) queries swiftly in computing. OLAP is part of the broader category of business intelligence, which also encompasses relational databases, repo ...
is required (but not online transaction processing), then
data warehouse In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for Business reporting, reporting and data analysis and is considered a core component of business intelligence. DWs are central Repos ...
s are a main choice. They enable data analysis, mining, and
artificial intelligence Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information—demonstrated by machines, as opposed to intelligence displayed by animals and humans. Example tasks in which this is done include speech re ...
on a much larger scale than databases can allow, and indeed data often flow from databases into data warehouses.
Business analyst A business analyst (BA) is a person who processes, interprets and documents business processes, products, services and software through analysis of data. The role of a business analyst is to ensure business efficiency increases through their know ...
s, data engineers, and data scientists can access data warehouses using tools such as SQL or
business intelligence Business intelligence (BI) comprises the strategies and technologies used by enterprises for the data analysis and management of business information. Common functions of business intelligence technologies include reporting, online analytical pr ...
software.


Files

If the data are less structured, then often they are just stored as files. There are several options: *
File system In computing, file system or filesystem (often abbreviated to fs) is a method and data structure that the operating system uses to control how data is stored and retrieved. Without a file system, data placed in a storage medium would be one larg ...
s represent data hierarchially in nested folders. *
Block storage In computing (specifically data transmission and data storage), a block, sometimes called a physical record, is a sequence of bytes or bits, usually containing some whole number of records, having a maximum length; a ''block size''. Data thu ...
splits data into regularly sized chunks; this often matches up with (virtual)
hard drives A hard disk drive (HDD), hard disk, hard drive, or fixed disk is an electro-mechanical data storage device that stores and retrieves digital data using magnetic storage with one or more rigid rapidly rotating platters coated with magneti ...
or
solid state drives A solid-state drive (SSD) is a solid-state storage device that uses integrated circuit assemblies to store data persistently, typically using flash memory, and functioning as secondary storage in the hierarchy of computer storage. It is a ...
. *
Object storage Object storage (also known as object-based storage) is a computer data storage that manages data as objects, as opposed to other storage architectures like file systems which manages data as a file hierarchy, and block storage which manages data a ...
manages data using
metadata Metadata is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive metadata – the descriptive ...
; often each file is assigned a key such as a
UUID A universally unique identifier (UUID) is a 128-bit label used for information in computer systems. The term globally unique identifier (GUID) is also used. When generated according to the standard methods, UUIDs are, for practical purposes, un ...
.


Management

The number of different data processes and storage locations can quickly become overwhelming. This motivates the usage of a
workflow management system A workflow management system (WfMS or WFMS) provides an infrastructure for the set-up, performance and monitoring of a defined sequence of tasks, arranged as a workflow application. International standards There are several international standards- ...
(e.g.
Airflow Airflow, or air flow, is the movement of air. The primary cause of airflow is the existence of air. Air behaves in a fluid manner, meaning particles naturally flow from areas of higher pressure to those where the pressure is lower. Atmospheric a ...
) to allow the data tasks to be specified, created, and monitored. The tasks are often specified as a directed acyclic graph (DAG).


Lifecycle


Business planning

Business objectives that executives set for what's to come are characterized in key business plans, with their more noteworthy definition in tactical business plans and implementation in operational business plans. Most businesses today recognize the fundamental need to grow a business plan that follows this strategy. It is often difficult to implement these plans because of the lack of transparency at the tactical and operational degrees of organizations. This kind of planning requires feedback to allow for early correction of problems that are due to miscommunication and misinterpretation of their business plan.


Systems Design

The design of data systems involves several components such as architecting data platforms, and designing data stores.


Data modelling

This is the process of producing a
data model A data model is an abstract model that organizes elements of data and standardizes how they relate to one another and to the properties of real-world entities. For instance, a data model may specify that the data element representing a car be co ...
, an
abstract model A conceptual model is a representation of a system. It consists of concepts used to help people know, understand, or simulate a subject the model represents. In contrast, physical models are physical object such as a toy model that may be assemble ...
to describe the data and relationships between different parts of the data.


Roles


Data Engineer

A data engineer is a type of software engineer who creates
big data Though used sometimes loosely partly because of a lack of formal definition, the interpretation that seems to best describe Big data is the one associated with large body of information that we could not comprehend when used only in smaller am ...
ETL pipelines to manage the flow of data through the organization. This makes it possible to take huge amounts of data and translate it into
insights Insight is the understanding of a specific cause and effect within a particular context. The term insight can have several related meanings: *a piece of information *the act or result of understanding the inner nature of things or of seeing intui ...
. They are focused on the production readiness of data and things like formats, resilience, scaling, and security. Data engineers usually hail from a software engineering background and are proficient in programming languages like
Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's List ...
,
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (pro ...
, Scala, and
Rust Rust is an iron oxide, a usually reddish-brown oxide formed by the reaction of iron and oxygen in the catalytic presence of water or air moisture. Rust consists of hydrous iron(III) oxides (Fe2O3·nH2O) and iron(III) oxide-hydroxide (FeO(OH ...
. They will be more familiar with databases, architecture, cloud computing, and
Agile software development In software development, agile (sometimes written Agile) practices include requirements discovery and solutions improvement through the collaborative effort of self-organizing and cross-functional teams with their customer(s)/ end user(s), ad ...
.


Data Scientist

Data scientists are more focused on the analysis of the data, they will be more familiar with
mathematics Mathematics is an area of knowledge that includes the topics of numbers, formulas and related structures, shapes and the spaces in which they are contained, and quantities and their changes. These topics are represented in modern mathematics ...
,
algorithms In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing c ...
,
statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
, and
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
.


See also

*
Big data Though used sometimes loosely partly because of a lack of formal definition, the interpretation that seems to best describe Big data is the one associated with large body of information that we could not comprehend when used only in smaller am ...
*
Information technology Information technology (IT) is the use of computers to create, process, store, retrieve, and exchange all kinds of data . and information. IT forms part of information and communications technology (ICT). An information technology system (I ...
*
Software engineering Software engineering is a systematic engineering approach to software development. A software engineer is a person who applies the principles of software engineering to design, develop, maintain, test, and evaluate computer software. The term '' ...
*
Computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...


References


Further reading

* John Hares (1992). "Information engineering for the Advanced Practitioner", Wiley. * Clive Finkelstein (1989). ''An Introduction to Information engineering : From Strategic Planning to Information Systems''. Sydney: Addison-Wesley. * Clive Finkelstein (1992). "Information Engineering: Strategic Systems Development". Sydney: Addison-Wesley. * Ian Macdonald (1986). "Information engineering". in: ''Information Systems Design Methodologies''. T.W. Olle et al. (ed.). North-Holland. * Ian Macdonald (1988). "Automating the Information engineering methodology with the Information engineering Facility". In: ''Computerized Assistance during the Information Systems Life Cycle''.
T.W. Olle T. William (Bill) Olle (born 1933 and died March 2019) was a British computer scientist and consultant and President of T. William Olle Associates, England. Biography Bill Olle was educated at Boston Grammar School (1943-1950). He received an M. ...
et al. (ed.). North-Holland. * James Martin and
Clive Finkelstein Clive Finkelstein (born ca. 1939 died 9/12/2021) is an Australian computer scientist, known as the "Father" of information engineering methodology. Life and work In 1961 Finkelstein received his Bachelor of Science from the University of New ...
. (1981). ''Information engineering''. Technical Report (2 volumes), Savant Institute, Carnforth, Lancs, UK. * James Martin (1989). ''Information engineering''. (3 volumes), Prentice-Hall Inc. * Clive Finkelstein (2006) "Enterprise Architecture for Integration: Rapid Delivery Methods and Technologies". First Edition, Artech House, Norwood MA in hardcover. * Clive Finkelstein (2011) "Enterprise Architecture for Integration: Rapid Delivery Methods and Technologies". Second Edition in PDF at www.ies.aust.com and as an ibook on the Apple iPad and ebook on the Amazon Kindle. * Reis, Joe; Housley, Matt (2022) "Fundamentals of Data Engineering". O'Reilly Media, Inc. ISBN 9781098108304


External links


The Complex Method IEM



Enterprise Engineering and Rapid Delivery of Enterprise Architecture
{{Authority control Software development process Information systems