Statistical Database
   HOME

TheInfoList



OR:

A statistical database is a
database In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases sp ...
used for statistical analysis purposes. It is an
OLAP Online analytical processing, or OLAP (), is an approach to answer multi-dimensional analytical (MDA) queries swiftly in computing. OLAP is part of the broader category of business intelligence, which also encompasses relational databases, repor ...
(online analytical processing), instead of
OLTP In online transaction processing (OLTP), information systems typically facilitate and manage transaction-oriented applications. This is contrasted with online analytical processing. The term "transaction" can have two different meanings, both of wh ...
(online transaction processing) system. Modern decision, and classical statistical databases are often closer to the
relational model The relational model (RM) is an approach to managing data using a Structure (mathematical logic), structure and language consistent with first-order logic, first-order predicate logic, first described in 1969 by English computer scientist Edgar F. ...
than the multidimensional model commonly used in
OLAP Online analytical processing, or OLAP (), is an approach to answer multi-dimensional analytical (MDA) queries swiftly in computing. OLAP is part of the broader category of business intelligence, which also encompasses relational databases, repor ...
systems today. Statistical databases typically contain parameter data and the measured data for these parameters. For example, parameter data consists of the different values for varying conditions in an experiment (e.g., temperature, time). The measured data (or variables) are the measurements taken in the experiment under these varying conditions. Many statistical databases are sparse with many null or zero values. It is not uncommon for a statistical database to be 40% to 50% sparse. There are two options for dealing with the sparseness: (1) leave the null values in there and use compression techniques to squeeze them out or (2) remove the entries that only have null values. Statistical databases often incorporate support for advanced statistical analysis techniques, such as correlations, which go beyond SQL. They also pose unique
security Security is protection from, or resilience against, potential harm (or other unwanted coercive change) caused by others, by restraining the freedom of others to act. Beneficiaries (technically referents) of security may be of persons and social ...
concerns, which were the focus of much research, particularly in the late 1970s and early to mid-1980s.


Privacy in statistical databases

In a statistical database, it is often desired to allow query access only to aggregate data, not individual records. Securing such a database is a difficult problem, since intelligent users can use a combination of aggregate queries to derive information about a single individual. Some common approaches are: * only allowing aggregate queries (SUM, COUNT, AVG, STDEV, etc.) * rather than returning exact values for sensitive data like income, only return which partition it belongs to (e.g. 35k-40k) * return imprecise counts (e.g. rather than 141 records met query, only indicate 130-150 records met it.) * don't allow overly selective WHERE clauses * audit all users queries, so users using system incorrectly can be investigated * use intelligent agents to detect automatically inappropriate system use For many years, research in this area was stalled, and it was thought in 1980 that, to quote: :The conclusion is that statistical databases are almost always subject to compromise. Severe restrictions on allowable query set sizes will render the database useless as a source of statistical information but will not secure the confidential records. But in 2006,
Cynthia Dwork Cynthia Dwork (born June 27, 1958) is an American computer scientist at Harvard University, where she is Gordon McKay Professor of Computer Science, Radcliffe Alumnae Professor at the Radcliffe Institute for Advanced Study, and Affiliated Professo ...
defined the field of
differential privacy Differential privacy (DP) is a system for publicly sharing information about a dataset by describing the patterns of groups within the dataset while withholding information about individuals in the dataset. The idea behind differential privacy is t ...
, using work that started appearing in 2003. While showing that some semantic security goals, related to work of Tore Dalenius, were impossible, it identified new techniques for limiting the increased privacy risk resulting from inclusion of private data in a statistical database. This makes it possible in many cases to provide very accurate statistics from the database while still ensuring high levels of privacy.


Some further reading


Statistical and Scientific Database Management (SSDBM)
An important series of conferences in this field Some key papers in this field: # - Dorothy E. Denning, Secure statistical databases with random sample queries, ACM Transactions on Database Systems (TODS), Volume 5, Issue 3 (September 1980), Pages: 291 - 315 # - Wiebren de Jonge, Compromising statistical databases responding to queries about means, ACM Transactions on Database Systems, Volume 8, Issue 1 (March 1983), Pages: 60 - 80 # - Dorothy E. Denning, Jan Schlörer, A fast procedure for finding a tracker in a statistical database, ACM Transactions on Database Systems, Volume 5, Issue 1 (March 1980) . Pages: 88 - 102 #A. Shoshani, “Statistical Databases: Characteristics, Problems, and some Solutions,” in Proceedings of the 8th International Conference on Very Large Data Bases, San Francisco, CA, USA, 1982, pp. 208–222.


References

{{reflist