Change Data Capture
   HOME

TheInfoList



OR:

In
database In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases ...
s, change data capture (CDC) is a set of software
design patterns ''Design Patterns: Elements of Reusable Object-Oriented Software'' (1994) is a software engineering book describing software design patterns. The book was written by Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides, with a forewo ...
used to determine and track the data that has changed so that action can be taken using the changed data. CDC is an approach to
data integration Data integration involves combining data residing in different sources and providing users with a unified view of them. This process becomes significant in a variety of situations, which include both commercial (such as when two similar companies ...
that is based on the identification, capture and delivery of the changes made to enterprise data sources. CDC occurs often in data-warehouse environments since capturing and preserving the state of data across time is one of the core functions of a data warehouse, but CDC can be utilized in any database or data repository system.


Methodology

System developers can set up CDC mechanisms in a number of ways and in any one or a combination of system layers from application logic down to physical storage. In a simplified CDC context, one computer system has data believed to have changed from a previous point in time, and a second computer system needs to take action based on that changed data. The former is the source, the latter is the target. It is possible that the source and target are the same system physically, but that would not change the design pattern logically. Multiple CDC solutions can exist in a single system.


Timestamps on rows

Tables whose changes must be captured may have a column that represents the time of last change. Names such as LAST_UPDATE, LAST_MODIFIED, etc. are common. Any row in any table that has a timestamp in that column that is more recent than the last time data was captured is considered to have changed. Timestamps on rows are also frequently used for optimistic locking so this column is often available.


Version numbers on rows

Database designers give tables whose changes must be captured a column that contains a version number. Names such as VERSION_NUMBER, etc. are common. One technique is to mark each changed row with a version number. A current version is maintained for the table, or possibly a group of tables. This is stored in a supporting construct such as a reference table. When a change capture occurs, all data with the latest version number is considered to have changed. Once the change capture is complete, the reference table is updated with a new version number. (Do not confuse this technique with row-level versioning used for optimistic locking. For optimistic locking each row has an independent version number, typically a sequential counter. This allows a process to atomically update a row and increment its counter only if another process has not incremented the counter. But CDC cannot use row-level versions to find all changes unless it knows the original "starting" version of every row. This is impractical to maintain.)


Status indicators on rows

This technique can either supplement or complement timestamps and versioning. It can configure an alternative if, for example, a status column is set up on a table row indicating that the row has changed (e.g., a boolean column that, when set to true, indicates that the row has changed). Otherwise, it can act as a complement to the previous methods, indicating that a row, despite having a new version number or a later date, still shouldn't be updated on the target (for example, the data may require human validation).


Time/Version/Status on rows

This approach combines the three previously discussed methods. As noted, it is not uncommon to see multiple CDC solutions at work in a single system, however, the combination of time, version, and status provides a particularly powerful mechanism and programmers should utilize them as a trio where possible. The three elements are not redundant or superfluous. Using them together allows for such logic as, "Capture all data for version 2.1 that changed between 6/1/2005 12:00 a.m. and 7/1/2005 12:00 a.m. where the status code indicates it is ready for production."


Triggers on tables

May include a publish/subscribe pattern to communicate the changed data to multiple targets. In this approach, triggers log events that happen to the transactional table into another queue table that can later be "played back". For example, imagine an Accounts table, when transactions are taken against this table, triggers would fire that would then store a history of the event or even the deltas into a separate queue table. The queue table might have schema with the following fields: Id, TableName, RowId, TimeStamp, Operation. The data inserted for our Account sample might be: 1, Accounts, 76, 11/02/2008 12:15am, Update. More complicated designs might log the actual data that changed. This queue table could then be "played back" to replicate the data from the source system to a target. ore discussion needed An example of this technique is the pattern known as the
log trigger In relational databases, the log trigger or history trigger is a mechanism for automatic recording of information about changes inserting or/and updating or/and deleting rows in a database table. It is a particular technique for change data captu ...
.


Event programming

Coding a change into an application at appropriate points is another method that can give intelligent discernment that data changed. Although this method involves programming vs. more easily implemented "dumb" triggers, it may provide more accurate and desirable CDC, such as only after a COMMIT, or only after certain columns changed to certain values - just what the target system is looking for.


Log scanners

Most database management systems manage a
transaction log In the field of databases in computer science, a transaction log (also transaction journal, database log, binary log or audit trail) is a history of actions executed by a database management system used to guarantee ACID properties over crashes ...
that records changes made to the database contents and to
metadata Metadata is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive metadata – the descriptive ...
. By scanning and interpreting the contents of the database transaction log, one can capture the changes made to the database in a non-intrusive manner. Using transaction logs for change data capture offers a challenge in that the structure, contents and use of a transaction log is specific to a database management system. Unlike data access, no standard exists for transaction logs. Most database management systems do not document the internal format of their transaction logs, although some provide programmatic interfaces to their transaction logs (for example: Oracle, DB2, SQL/MP, SQL/MX and SQL Server 2008). Other challenges in using transaction logs for change data capture include: * Coordinating the reading of the transaction logs and the archiving of log files (database management software typically archives log files off-line on a regular basis). * Translation between physical storage formats that are recorded in the transaction logs and the logical formats typically expected by database users (e.g., some transaction logs save only minimal buffer differences that are not directly useful for change consumers). * Dealing with changes to the format of the transaction logs between versions of the database management system. * Eliminating uncommitted changes that the database wrote to the transaction log and later rolled back. * Dealing with changes to the metadata of tables in the database. CDC solutions based on transaction log files have distinct advantages that include: * minimal impact on the database (even more so if one uses
log shipping Log shipping is the process of automating the backup of transaction log files on a primary (production) database server, and then restoring them onto a standby server. This technique is supported by Microsoft SQL Server,latency in acquiring changes. * transactional integrity: log scanning can produce a change stream that replays the original transactions in the order they were committed. Such a change stream include changes made to all tables participating in the captured transaction. * no need to change the database schema


Confounding factors

As often occurs in complex domains, the final solution to a CDC problem may have to balance many competing concerns.


Unsuitable source systems

Change data capture both increases in complexity and reduces in value if the source system saves
metadata Metadata is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive metadata – the descriptive ...
changes when the data itself is not modified. For example, some
Data model A data model is an abstract model that organizes elements of data and standardizes how they relate to one another and to the properties of real-world entities. For instance, a data model may specify that the data element representing a car be c ...
s track the user who last looked at but did not change the data in the same structure as the data. This results in
noise Noise is unwanted sound considered unpleasant, loud or disruptive to hearing. From a physics standpoint, there is no distinction between noise and desired sound, as both are vibrations through a medium, such as air or water. The difference aris ...
in the Change Data Capture.


Tracking the capture

Actually tracking the changes depends on the data source. If the data is being persisted in a modern
database In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases ...
then Change Data Capture is a simple matter of permissions. Two techniques are in common use: * Tracking changes using
database trigger A database trigger is procedural code that is automatically executed in response to certain events on a particular table or view in a database. The trigger is mostly used for maintaining the integrity of the information on the database. For exam ...
s * Reading the
transaction log In the field of databases in computer science, a transaction log (also transaction journal, database log, binary log or audit trail) is a history of actions executed by a database management system used to guarantee ACID properties over crashes ...
as, or shortly after, it is written. If the data is not in a modern database, CDC becomes a programming challenge.


Push versus pull

* Push: the source process creates a snapshot of changes within its own process and delivers rows downstream. The downstream process uses the snapshot, creates its own subset and delivers them to the next process. * Pull: the target that is immediately downstream from the source, prepares a request for data from the source. The downstream target delivers the snapshot to the next target, as in the push model.


Alternatives

Sometimes the
slowly changing dimension A slowly changing dimension (SCD) in data management and data warehousing is a dimension which contains relatively static data which can change slowly but unpredictably, rather than according to a regular schedule. Some examples of typical slowly ...
is used as a method.


See also

*
Referential integrity Referential integrity is a property of data stating that all its references are valid. In the context of relational databases, it requires that if a value of one attribute (column) of a relation (table) references a value of another attribute (e ...
*
Slowly changing dimension A slowly changing dimension (SCD) in data management and data warehousing is a dimension which contains relatively static data which can change slowly but unpredictably, rather than according to a regular schedule. Some examples of typical slowly ...


References


See also

* Oracle Data Integrator {{DEFAULTSORT:Change Data Capture Computer data Data management