Stata Technical Bulletin
   HOME

TheInfoList



OR:

Stata (, , alternatively , occasionally stylized as STATA) is a general-purpose
statistical Statistics (from German: ''Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industria ...
software package developed by StataCorp for data manipulation, visualization, statistics, and automated reporting. It is used by researchers in many fields, including
biomedicine Biomedicine (also referred to as Western medicine, mainstream medicine or conventional medicine)
,
epidemiology Epidemiology is the study and analysis of the distribution (who, when, and where), patterns and determinants of health and disease conditions in a defined population. It is a cornerstone of public health, and shapes policy decisions and evidenc ...
,
sociology Sociology is a social science that focuses on society, human social behavior, patterns of Interpersonal ties, social relationships, social interaction, and aspects of culture associated with everyday life. It uses various methods of Empirical ...
and
science Science is a systematic endeavor that builds and organizes knowledge in the form of testable explanations and predictions about the universe. Science may be as old as the human species, and some of the earliest archeological evidence for ...
. Stata was initially developed by Computing Resource Center in California and the first version was released in 1985. In 1993, the company moved to College Station, TX and was renamed Stata Corporation, now known as StataCorp. A major release in 2003 included a new graphics system and dialog boxes for all commands. Since then, a new version has been released once every two years. The current version is Stata 17, released in April 2021.


Technical overview and terminology


User interface

From its creation, Stata has always employed an integrated command-line interface. Starting with version 8.0, Stata has included a
graphical user interface The GUI ( "UI" by itself is still usually pronounced . or ), graphical user interface, is a form of user interface that allows users to interact with electronic devices through graphical icons and audio indicator such as primary notation, inste ...
based on
Qt framework Qt (pronounced "cute") is cross-platform software for creating graphical user interfaces as well as cross-platform applications that run on various software and hardware platforms such as Linux, Windows, macOS, Android or embedded systems wit ...
which uses
menu In a restaurant, the menu is a list of food and beverages offered to customers and the prices. A menu may be à la carte – which presents a list of options from which customers choose – or table d'hôte, in which case a pre-established seque ...
s and
dialog boxes The dialog box (also called dialogue box (non-U.S. English), message box or simply dialog) is a graphical control element in the form of a small window that communicates information to the user and prompts them for a response. Dialog boxes are ...
to give access to many built-in commands. The dataset can be viewed or edited in spreadsheet format. From version 11 on, other commands can be executed while the data browser or editor is opened.


Data structure and storage

Until the release of version 16, Stata could only open a single
dataset A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record of the ...
at any one time. Stata allows for flexibility with assigning data types to data. Its compress command automatically reassigns data to data types that take up less memory without loss of information. Stata utilizes integer storage types which occupy only one or two bytes rather than four, and single-precision (4 bytes) rather than double-precision (8 bytes) is the default for
floating-point In computing, floating-point arithmetic (FP) is arithmetic that represents real numbers approximately, using an integer with a fixed precision, called the significand, scaled by an integer exponent of a fixed base. For example, 12.345 can b ...
numbers. Stata's data format is always
tabular Table may refer to: * Table (furniture), a piece of furniture with a flat surface and one or more legs * Table (landform), a flat area of land * Table (information), a data arrangement with rows and columns * Table (database), how the table data ...
in format. Stata refers to the columns of tabular data as variables.


Data format compatibility

Stata can import data in a variety of formats. This includes
ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of ...
data formats (such as CSV or databank formats) and spreadsheet formats (including various
Excel ExCeL London (an abbreviation for Exhibition Centre London) is an exhibition centre, international convention centre and former hospital in the Custom House area of Newham, East London. It is situated on a site on the northern quay of the ...
formats). Stata's proprietary
file format A file format is a standard way that information is encoded for storage in a computer file. It specifies how bits are used to encode information in a digital storage medium. File formats may be either proprietary or free. Some file formats ...
s have changed over time, although not every Stata release includes a new dataset format. Every version of Stata can read all older dataset formats, and can write both the current and most recent previous dataset format, using the saveold command. Thus, the current Stata release can always open datasets that were created with older versions, but older versions cannot read newer format datasets. Stata can read and write SAS XPORT format datasets natively, using the ''fdause'' and ''fdasave'' commands. Some other
econometric Econometrics is the application of statistical methods to economic data in order to give empirical content to economic relationships. M. Hashem Pesaran (1987). "Econometrics," '' The New Palgrave: A Dictionary of Economics'', v. 2, p. 8 p. 8 ...
applications, including
gretl gretl is an open-source statistical package, mainly for econometrics. The name is an acronym for ''G''nu ''R''egression, ''E''conometrics and ''T''ime-series ''L''ibrary. It has both a graphical user interface (GUI) and a command-line interfa ...
, can directly import Stata file formats.


History


Origins

The development of Stata began in 1984, initially by William (Bill) Gould and later by Sean Becketti. The software was originally intended to compete with statistical programs for personal computers such as SYSTAT and MicroTSP. Stata was written, then as now, in the C programming language, initially for
PCs A personal computer (PC) is a multi-purpose microcomputer whose size, capabilities, and price make it feasible for individual use. Personal computers are intended to be operated directly by an end user, rather than by a computer expert or techn ...
running the
DOS DOS is shorthand for the MS-DOS and IBM PC DOS family of operating systems. DOS may also refer to: Computing * Data over signalling (DoS), multiplexing data onto a signalling channel * Denial-of-service attack (DoS), an attack on a communicatio ...
operating system. The first version was released in 1985 with 44 commands.


Development

There have been 17 major releases of Stata between 1985 and 2021, and additional code and documentation updates between major releases. In its early years, extra sets of Stata programs were sometimes sold as "kits" or distributed as Support Disks. With the release of Stata 6 in 1999, updates began to be delivered to users via the web. The initial release of Stata was for the
DOS DOS is shorthand for the MS-DOS and IBM PC DOS family of operating systems. DOS may also refer to: Computing * Data over signalling (DoS), multiplexing data onto a signalling channel * Denial-of-service attack (DoS), an attack on a communicatio ...
operating system. Since then, versions of Stata have been released for systems running
Unix Unix (; trademarked as UNIX) is a family of multitasking, multiuser computer operating systems that derive from the original AT&T Unix, whose development started in 1969 at the Bell Labs research center by Ken Thompson, Dennis Ritchie, and ot ...
variants like
Linux Linux ( or ) is a family of open-source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically packaged as a Linux distribution, which ...
distributions,
Windows Windows is a group of several proprietary graphical operating system families developed and marketed by Microsoft. Each family caters to a certain sector of the computing industry. For example, Windows NT for consumers, Windows Server for serv ...
, and
MacOS macOS (; previously OS X and originally Mac OS X) is a Unix operating system developed and marketed by Apple Inc. since 2001. It is the primary operating system for Apple's Mac computers. Within the market of desktop and lapt ...
. All Stata files are platform-independent. Hundreds of commands have been added to Stata in its 37-year history. Certain developments have proved to be particularly important and continue to shape the user experience today, including
extensibility Extensibility is a software engineering and systems design principle that provides for future growth. Extensibility is a measure of the ability to extend a system and the level of effort required to implement the extension. Extensions can be th ...
,
platform independence In computing, cross-platform software (also called multi-platform software, platform-agnostic software, or platform-independent software) is computer software that is designed to work in several computing platforms. Some cross-platform software r ...
, and the active
user community A virtual community is a social network of individuals who connect through specific social media, potentially crossing geographical and political boundaries in order to pursue mutual interests or goals. Some of the most pervasive virtual communi ...
.


Extensibility

The program command was implemented in Stata 1.2, giving users the ability to add their own commands. ado-files followed in Stata 2.1, allowing a user-written program to be automatically loaded into memory. Many user-written ado-files are submitted to the
Statistical Software Components Archive Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
hosted by Boston College. StataCorp added an ssc command to allow community-contributed programs to be added directly within Stata. More recent editions of Stata allow users to call Python and R scripts using commands, as well as allowing Python IDEs like Jupyter Notebooks to import Stata commands.


User community

A number of important developments were initiated by Stata's active user community. The ''Stata Technical Bulletin'', which often contains user-created commands, was introduced in 1991 and issued six times a year. It was relaunched in 2001 as the peer-reviewed ''Stata Journal'', a quarterly publication containing descriptions of community-contributed commands and tips for the effective use of Stata. In 1994, a listserv began as a hub for users to collaboratively solve coding and technical issues; in 2014, it was converted into a web forum. In 1995, Statacorp began organizing user and developer conferences that meet annually. Only the annual Stata Conference held in the United States is hosted by StataCorp. Other user group meetings are held annually in the United States (the Stata Conference), the UK, Germany, and Italy, and less frequently in several other countries. Local Stata distributors host User Group meetings in their own countries.


Software products

There are four builds of Stata: Stata/MP, Stata/SE, Stata/BE, and Numerics by Stata. Whereas Stata/MP allows for built-in parallel processing of certain commands, Stata/SE and Stata/BE are bottlenecked and limit usage to only one single core. Stata/MP runs certain commands about 2.4 times faster, roughly 60% of theoretical maximum efficiency, when running parallel processes on four CPU cores compared to SE or BE versions. Numerics by Stata allows for web integration of Stata commands. SE and BE versions differ in the amount of memory datasets may utilize. Though Stata/MP can store 10 to 20 billion observations and up to 120,000 variables, Stata/SE and Stata/BE store up to 2.14 billion observations and handle 32,767 variables and 2,048 variables respectively. The maximum number of independent variables in a model is 65,532 variables in Stata/MP, 10,998 variables in Stata/SE, and 798 variables in Stata/BE. The pricing and licensing of Stata depends on its intended use: business, government/nonprofit, education, or student. Single user licenses are either renewable annually or perpetual. Other license types include a single license for use by concurrent users, a site license, volume single user for bulk pricing, or a student lab.


Example code

The following set of commands revolve around simple data management. sysuse auto // Open the included auto dataset browse // Browse the dataset (opens the Data Editor window) describe // Describes the dataset and associated variables summarize // Summary information about numerical variables codebook make foreign // Summary information about the make (string) and foreign (numeric) variables browse if missing(rep78) // Browse only observations with missing data for variable rep78 list make if missing(rep78) // List makes of the cars with missing data for variable rep78 The next set of commands move onto descriptive statistics. summarize price, detail // Detailed summary statistics for variable price tabulate foreign // One-way frequency table for variable foreign tabulate rep78 foreign, row // Two-way frequency table for variables rep78 and foreign summarize mpg if foreign

1 // Summary information about mpg if the car is foreign (the "

" sign tests for equality) by foreign, sort: summarize mpg // As above, but using the "by" prefix. tabulate foreign, summarize(mpg) // As above, but using the tabulate command.
A simple hypothesis test: ttest mpg, by(foreign) // T-test for difference in means for domestic vs. foreign cars Graphing data: twoway (scatter mpg weight) // Scatter plot showing relationship between mpg and weight twoway (scatter mpg weight), by(foreign, total) // Three graphs for domestic, foreign, and all cars Linear regression: generate wtsq = weight^2 // Create a new variable for weight squared regress mpg weight wtsq foreign, vce(robust) // Linear regression of mpg on weight, wtsq, and foreign predict mpghat // Create a new variable contained the predicted values of mpg twoway (scatter mpg weight) (line mpghat weight, sort), by(foreign) // Graph data and fitted line


See also

*
List of statistical packages Statistical software are specialized computer programs for analysis in statistics and econometrics. Open-source * ADaMSoft – a generalized statistical software with data mining algorithms and methods for data management * ADMB – a software ...
*
Comparison of statistical packages The following tables compare general and technical information for a number of statistical analysis packages. General information Operating system support ANOVA Support for various ANOVA methods Regression Support for various Regression an ...
*
Data analysis Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, enco ...


References


Further reading

* * *


External links

*
Stata Journal

Stata Press

Stata Technical Bulletin


{{Statistical software 1985 software C (programming language) software Proprietary commercial software for Linux Science software for Linux Data mining and machine learning software Statistical software Statistical programming languages Econometrics software Time series software Data warehousing Proprietary cross-platform software Extract, transform, load tools Mathematical optimization software Numerical software