software engineering
Software engineering is a branch of both computer science and engineering focused on designing, developing, testing, and maintaining Application software, software applications. It involves applying engineering design process, engineering principl ...
, more specifically in
distributed computing
Distributed computing is a field of computer science that studies distributed systems, defined as computer systems whose inter-communicating components are located on different networked computers.
The components of a distributed system commu ...
, observability is the ability to collect data about programs' execution, modules' internal states, and the communication among components. To improve observability, software engineers use a wide range of
logging
Logging is the process of cutting, processing, and moving trees to a location for transport. It may include skidder, skidding, on-site processing, and loading of trees or trunk (botany), logs onto logging truck, truckstracing techniques to gather telemetry information, and tools to analyze and use it. Observability is foundational to site reliability engineering, as it is the first step in triaging a service outage.
One of the goals of observability is to minimize the amount of prior knowledge needed to debug an issue.
Etymology, terminology and definition
The term is borrowed from control theory, where the "
observability
Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.
In control theory, the observability and controllability of a linear system are mathematical duals.
The concept of observa ...
" of a system measures how well its state can be determined from its outputs. Similarly, software observability measures how well a system's state can be understood from the obtained telemetry (metrics, logs, traces, profiling).
The definition of observability varies by vendor:
The term is frequently referred to as its
numeronym
A numeronym is a word, usually an abbreviation, composed partially or wholly of numerals. The term can be used to describe several different number-based constructs, but it most commonly refers to a contraction in which all letters between the fir ...
o11y (where 11 stands for the number of letters between the first letter and the last letter of the word). This is similar to other computer science abbreviations such as i18n and l10n and k8s.
Observability vs. monitoring
Observability and monitoring are sometimes used interchangeably. As tooling, commercial offerings and practices evolved in complexity, "monitoring" was re-branded as observability in order to differentiate new tools from the old.
The terms are commonly contrasted in that systems are ''monitored'' using predefined sets of ''telemetry'', and monitored systems may be ''observable''.
Majors et al. suggest that engineering teams that only have monitoring tools end up relying on expert foreknowledge (seniority), whereas teams that have observability tools rely on exploratory analysis (curiosity).
Telemetry types
Observability relies on three main types of telemetry data: metrics, logs and traces. Those are often referred to as "pillars of observability".
Metrics
A metric is a
point in time
In physics and the philosophy of science, instant refers to an infinitesimal interval in time, whose passage is instantaneous. In ordinary speech, an instant has been defined as "a point or very short space of time," a notion deriving from its etym ...
measurement (
scalar
Scalar may refer to:
*Scalar (mathematics), an element of a field, which is used to define a vector space, usually the field of real numbers
*Scalar (physics), a physical quantity that can be described by a single element of a number field such a ...
) that represents some system state. Examples of common metrics include:
* number of HTTP requests per second;
* total number of query failures;
* database size in bytes;
* time in seconds since last
garbage collection
Waste collection is a part of the process of waste management. It is the transfer of solid waste from the point of use and disposal to the point of treatment or landfill. Waste collection also includes the curbside collection of recyclable ...
.
Monitoring tools are typically configured to emit alerts when certain metric values exceed set thresholds. Thresholds are set based on knowledge about normal operating conditions and experience.
Metrics are typically tagged to facilitate grouping and searchability.
Application developers choose what kind of metrics to instrument their software with, before it is released. As a result, when a previously unknown issue is encountered, it is impossible to add new metrics without shipping new code. Furthermore, their
cardinality
The thumb is the first digit of the hand, next to the index finger. When a person is standing in the medical anatomical position (where the palm is facing to the front), the thumb is the outermost digit. The Medical Latin English noun for thum ...
can quickly make the storage size of telemetry data prohibitively expensive. Since metrics are cardinality-limited, they are often used to represent aggregate values (for example: average page load time, or 5-second average of the request rate). Without external context, it is impossible to correlate between events (such as user requests) and distinct metric values.
Logs
Logs, or log lines, are generally free-form, unstructured text blobs that are intended to be human readable. Modern logging is structured to enable machine parsability. As with metrics, an application developer must instrument the application upfront and ship new code if different logging information is required.
Logs typically include a timestamp and severity level. An event (such as a user request) may be fragmented across multiple log lines and interweave with logs from concurrent events.
Traces
Distributed traces
A cloud native application is typically made up of distributed services which together fulfill a single request. A distributed trace is an interrelated series of discrete events (also called spans) that track the progression of a single user request. A trace shows the causal and temporal relationships between the services that interoperate to fulfill a request.
Instrumenting an application with traces means sending span information to a tracing backend. The tracing backend correlates the received spans to generate presentable traces. To be able to follow a request as it traverses multiple services, spans are labeled with
unique identifier
A unique identifier (UID) is an identifier that is guaranteed to be unique among all identifiers used for those objects and for a specific purpose. The concept was formalized early in the development of computer science and information systems. ...
s that enable constructing a parent-child relationship between spans. Span information is typically shared in the HTTP headers of outbound requests.
Continuous profiling
Continuous profiling is another telemetry type used to precisely determine how an application consumes resources.
Instrumentation
To be able to observe an application, telemetry about the application's behavior needs to be collected or exported. Instrumentation means generating telemetry alongside the normal operation of the application. Telemetry is then collected by an independent backend for later analysis.
Instrumentation can be automatic, or custom. Automatic instrumentation offers blanket coverage and immediate value; custom instrumentation brings higher value but requires more intimate involvement with the instrumented application.
Instrumentation can be native - done in-code (modifying the code of the instrumented application) - or out-of-code (e.g. sidecar,
eBPF
eBPF is a technology that can run programs in a privileged context such as the operating system kernel. It is the successor to the Berkeley Packet Filter (BPF, with the "e" originally meaning "extended") filtering mechanism in Linux and is al ...
).
Verifying new features in production by shipping them together with custom instrumentation is a practice called "observability-driven development".
"Pillars of observability"
Metrics, logs and traces are most commonly listed as the pillars of observability. Majors et al. suggest that the pillars of observability are high cardinality, high-dimensionality, and explorability, arguing that
runbook
In a computer system or network, a runbook is a compilation of routine procedures and operations that the system administrator or operator carries out. System administrators in IT departments and NOCs use runbooks as a reference.
Runbooks ca ...
s and dashboards have little value because "modern systems rarely fail in precisely the same way twice."
Self monitoring
Self monitoring is a practice where observability stacks monitor each other, in order to reduce the risk of inconspicuous outages. Self monitoring may be put in place in addition to high availability and redundancy to further avoid correlated failures.
See also
*
Application performance management
In the fields of information technology and systems management, application performance management (APM) is the monitoring and management of the performance and availability of software applications. APM strives to detect and diagnose complex appli ...
(APM)
*
OpenTelemetry
The Cloud Native Computing Foundation (CNCF) is a subsidiary of the Linux Foundation founded in 2015 to support cloud-native computing.
History
It was announced alongside Kubernetes 1.0, an open source container cluster manager, which was contri ...
(OTel)
*
Real user monitoring
Real user monitoring (RUM) is a passive monitoring technology that records all user interaction with a website or client interacting with a server or cloud-based application. Monitoring actual user interaction with a website or an application is i ...
(RUM)
*
Synthetic monitoring
In software design, web design, and electronic product design, synthetic monitoring (also known as ''active monitoring or proactive monitoring'') is a monitoring technique that is done by using a simulation or scripted recordings of transactions ...
*
DevOps
DevOps is the integration and automation of the software development and information technology operations. DevOps encompasses necessary tasks of software development and can lead to shortening development time and improving the development life ...
Sociotechnical system
Sociotechnical systems (STS) in organizational development is an approach to complex organizational work design that recognizes the interaction between people and technology in workplaces. The term also refers to coherent systems of human relati ...