The shapefile format is a geospatial vector
data format for geographic information system (GIS) software. It is developed and regulated by
Esri
Environmental Systems Research Institute, Inc., doing business as Esri (), is an American Multinational corporation, multinational geographic information system (GIS) software company headquartered in Redlands, California. It is best known for ...
as a mostly
open specification for data interoperability among Esri and other
GIS software products.
The shapefile format can spatially describe
vector
Vector most often refers to:
* Euclidean vector, a quantity with a magnitude and a direction
* Disease vector, an agent that carries and transmits an infectious pathogen into another living organism
Vector may also refer to:
Mathematics a ...
features:
points,
lines, and
polygons
In geometry, a polygon () is a plane figure made up of line segments connected to form a closed polygonal chain.
The segments of a closed polygonal chain are called its '' edges'' or ''sides''. The points where two edges meet are the polygon' ...
, representing, for example,
water well
A well is an excavation or structure created on the earth by digging, driving, or drilling to access liquid resources, usually water. The oldest and most common kind of well is a water well, to access groundwater in underground aquifers. The ...
s,
river
A river is a natural stream of fresh water that flows on land or inside Subterranean river, caves towards another body of water at a lower elevation, such as an ocean, lake, or another river. A river may run dry before reaching the end of ...
s, and
lake
A lake is often a naturally occurring, relatively large and fixed body of water on or near the Earth's surface. It is localized in a basin or interconnected basins surrounded by dry land. Lakes lie completely on land and are separate from ...
s. Each item usually has
attributes that describe it, such as ''name'' or ''temperature''.
Overview
The shapefile format is a digital vector storage format for storing geographic location and associated attribute information. This format lacks the capacity to store
topological
Topology (from the Greek words , and ) is the branch of mathematics concerned with the properties of a geometric object that are preserved under continuous deformations, such as stretching, twisting, crumpling, and bending; that is, wit ...
information. The shapefile format was introduced with
ArcView GIS version 2 in the early 1990s. It is now possible to read and write geographical datasets using the shapefile format with a wide variety of software.
The shapefile format stores the geometry as primitive geometric shapes like points, lines, and polygons. These shapes, together with data attributes that are linked to each shape, create the representation of the geographic data. The term "shapefile" is quite common, but the format consists of a collection of files with a common filename prefix, stored in the same
directory. The three ''mandatory'' files have
filename extension
A filename extension, file name extension or file extension is a suffix to the name of a computer file (for example, .txt, .mp3, .exe) that indicates a characteristic of the file contents or its intended use. A filename extension is typically d ...
s , , and
.dbf
. The actual ''shapefile'' relates specifically to the file, but alone is incomplete for distribution as the other supporting files are required. In line with the ''ESRI Shapefile Technical Description'',
legacy GIS software may expect that the filename prefix be limited to eight characters to conform to the DOS
8.3 filename convention, though modern software applications accept files with longer names.
Mandatory files
;
: Shape format; the feature geometry itself.
Content-type:
;
: Shape index format; a positional index of the feature geometry to allow seeking forwards and backwards quickly.
Content-type:
;
: Attribute format; columnar attributes for each shape, in
dBase
dBase (also stylized dBASE) was one of the first database management systems for microcomputers and the most successful in its day. The dBase system included the core database engine, a query system, a Form (programming), forms engine, and a pr ...
IV format.
Content-type:
Other files
* — projection description, using a
well-known text representation of coordinate reference systems
* and — a
spatial index
A spatial database is a general-purpose database (usually a relational database) that has been enhanced to include spatial data that represents objects defined in a geometric space, along with tools for querying and analyzing such data.
Most ...
of the features
* and — a spatial index of the features that are read-only
* and — an attribute index of the active fields in a table
* — a geocoding index for read-write datasets
* — a geocoding index for read-write datasets (ODB format)
* — an attribute index for the file in the form of
''shapefile''.''columnname''.atx
(ArcGIS 8 and later)
* —
geospatial metadata in XML format, such as
ISO 19115 or other
XML schema
An XML schema is a description of a type of XML document, typically expressed in terms of constraints on the structure and content of documents of that type, above and beyond the basic syntactical constraints imposed by XML itself. These constrai ...
* — used to specify the
code page
In computing, a code page is a character encoding and as such it is a specific association of a set of printable character (computing), characters and control characters with unique numbers. Typically each number represents the binary value in a s ...
(only for ) for identifying the
character encoding
Character encoding is the process of assigning numbers to graphical character (computing), characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using computers. The numerical v ...
to be used
* — an alternative
quadtree spatial index used by
MapServer and
GDAL/OGR software
In each of the , , and files, the shapes in each file correspond to each other in sequence (i.e., the first record in the file corresponds to the first record in the and files, etc.). The and files have various fields with different
endianness
file:Gullivers_travels.jpg, ''Gulliver's Travels'' by Jonathan Swift, the novel from which the term was coined
In computing, endianness is the order in which bytes within a word (data type), word of digital data are transmitted over a data comm ...
, so an implementer of the file formats must be very careful to respect the endianness of each field and treat it properly.
File formats
Shapefile shape format ()
The main file () contains the geometry data. Geometry of a given feature is stored as a set of vector coordinates. The
binary file
A binary file is a computer file that is not a text file. The term "binary file" is often used as a term meaning "non-text file". Many binary file formats contain parts that can be interpreted as text; for example, some computer document files ...
consists of a single fixed-length
header followed by one or more variable-length
records. Each of the variable-length records includes a record-header component and a record-contents component. A detailed description of the file format is given in the ''ESRI Shapefile Technical Description''.
This format should not be confused with the
AutoCAD
AutoCAD is a 2D and
3D computer-aided design (CAD) software application developed by Autodesk. It was first released in December 1982 for the CP/M and IBM PC platforms as a desktop app running on microcomputers with internal graphics control ...
shape font source format, which shares the extension.
The 2D axis ordering of coordinate data assumes a
Cartesian coordinate system
In geometry, a Cartesian coordinate system (, ) in a plane (geometry), plane is a coordinate system that specifies each point (geometry), point uniquely by a pair of real numbers called ''coordinates'', which are the positive and negative number ...
, using the order (X Y) or (Easting Northing). This axis order is consistent for
Geographic coordinate system
A geographic coordinate system (GCS) is a spherical coordinate system, spherical or geodetic coordinates, geodetic coordinate system for measuring and communicating position (geometry), positions directly on Earth as latitude and longitude. ...
s, where the order is similarly (longitude latitude). Geometries may also support 3- or 4-
dimension
In physics and mathematics, the dimension of a mathematical space (or object) is informally defined as the minimum number of coordinates needed to specify any point within it. Thus, a line has a dimension of one (1D) because only one coo ...
al Z and M coordinates, for
elevation
The elevation of a geographic location (geography), ''location'' is its height above or below a fixed reference point, most commonly a reference geoid, a mathematical model of the Earth's sea level as an equipotential gravitational equipotenti ...
and measure, respectively. A Z-dimension stores the elevation of each coordinate in
3D space, which can be used for analysis or for visualisation of geometries using
3D computer graphics
3D computer graphics, sometimes called Computer-generated imagery, CGI, 3D-CGI or three-dimensional Computer-generated imagery, computer graphics, are graphics that use a three-dimensional representation of geometric data (often Cartesian coor ...
. The user-defined M dimension can be used for one of many functions, such as storing
linear referencing measures or relative
time
Time is the continuous progression of existence that occurs in an apparently irreversible process, irreversible succession from the past, through the present, and into the future. It is a component quantity of various measurements used to sequ ...
of a feature in
4D space.
The main file header is fixed at 100 bytes in length and contains 17 fields; nine 4-byte (32-bit signed integer or int32) integer fields followed by eight 8-byte (
double
Double, The Double or Dubble may refer to:
Mathematics and computing
* Multiplication by 2
* Double precision, a floating-point representation of numbers that is typically 64 bits in length
* A double number of the form x+yj, where j^2=+1
* A ...
) signed floating point fields:
Shapefile headers
Shapefile record headers
The file then contains any number of variable-length records. Each record is prefixed with a record header of 8 bytes:
Shapefile records
Following the record header is the actual record:
The variable-length record contents depend on the shape type, which must be either the shape type given in the file header or Null. The following are the possible shape types:
Shapefile shape index format ()
The index contains positional index of the feature geometry and the same 100-byte header as the file, followed by any number of 8-byte fixed-length records which consist of the following two fields:
Using this index, it is possible to seek backwards in the shapefile by, first, seeking backwards in the shape index (which is possible because it uses fixed-length records), then reading the record offset, and using that offset to seek to the correct position in the file. It is also possible to seek forwards an arbitrary number of records using the same method.
It is possible to generate the complete index file given a lone file. However, since a shapefile is supposed to always contain an index, doing so counts as repairing a corrupt file.
Shapefile attribute format ()
This file stores the attributes for each shape; it uses the
dBase
dBase (also stylized dBASE) was one of the first database management systems for microcomputers and the most successful in its day. The dBase system included the core database engine, a query system, a Form (programming), forms engine, and a pr ...
IV format. The format is public knowledge, and has been implemented in many dBase clones known as
xBase. The open-source shapefile C library, for example, calls its format "xBase" even though it's plain dBase IV.
The names and values of attributes are not standardized, and will be different depending on the source of the shapefile.
Shapefile spatial index format ()
This is a binary
spatial index
A spatial database is a general-purpose database (usually a relational database) that has been enhanced to include spatial data that represents objects defined in a geometric space, along with tools for querying and analyzing such data.
Most ...
file, which is used only by Esri software. The format is not documented by Esri. However it has been reverse-engineered and documented by the open source community. The 100-byte header is similar to the one in . It is not currently implemented by other vendors. The file is not strictly necessary, since the file contains all of the information necessary to successfully parse the spatial data.
Limitations
The shapefile format has a number of limitations.
Topology and the shapefile format
The shapefile format does not have the ability to store
topological
Topology (from the Greek words , and ) is the branch of mathematics concerned with the properties of a geometric object that are preserved under continuous deformations, such as stretching, twisting, crumpling, and bending; that is, wit ...
relationships between shapes. The ESRI ArcInfo
coverages and many
geodatabases do have the ability to store feature topology.
Data storage
The size of both and component files cannot exceed 2 GB (or 2
31 bytes) — around 70 million point features at best.
The maximum number of feature for other geometry types varies depending on the number of vertices used.
The attribute database format for the component file is based on an older
dBase
dBase (also stylized dBASE) was one of the first database management systems for microcomputers and the most successful in its day. The dBase system included the core database engine, a query system, a Form (programming), forms engine, and a pr ...
standard. This database format inherently has a number of limitations:
*While the current
dBase
dBase (also stylized dBASE) was one of the first database management systems for microcomputers and the most successful in its day. The dBase system included the core database engine, a query system, a Form (programming), forms engine, and a pr ...
standard, and
GDAL/OGR (the main open source software library for reading and writing shapefile format datasets) support
null
Null may refer to:
Science, technology, and mathematics Astronomy
*Nuller, an optical tool using interferometry to block certain sources of light Computing
*Null (SQL) (or NULL), a special marker and keyword in SQL indicating that a data value do ...
values, ESRI software represents these values as zeros — a very serious issue for analyzing quantitative data, as it may skew representation and statistics if null quantities are represented as zero
*Poor support for
Unicode
Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...
field names or field storage
*Maximum length of field names is 10 characters
*Maximum number of fields is 255
*Supported field types are: floating point (13 character storage), integer (4 or 9 character storage), date (no time storage; 8 character storage), and text (maximum 254 character storage)
*Floating point numbers may contain rounding errors since they are stored as text
Mixing shape types
Because the shape type precedes each geometry record, a shapefile is technically capable of storing a mixture of different shape types. However, the specification states, "All the non-Null shapes in a shapefile are required to be of the same shape type." Therefore, this ability to mix shape types must be limited to interspersing null shapes with the single shape type declared in the file's header. A shapefile must not contain both polyline and polygon data, for example, the descriptions for a well (point), a river (polyline), and a lake (polygon) would be stored in three separate datasets.
See also
*
Geographic information system
A geographic information system (GIS) consists of integrated computer hardware and Geographic information system software, software that store, manage, Spatial analysis, analyze, edit, output, and Cartographic design, visualize Geographic data ...
*
Open Geospatial Consortium
The Open Geospatial Consortium (OGC) is an international voluntary consensus standards organization that develops and maintains international standards for geospatial content and location-based services, sensor web, Internet of Things, Geographi ...
*
Open Source Geospatial Foundation (OSGeo)
*
List of geographic information systems software
*
Comparison of geographic information systems software
References
{{Reflist
External links
Shapefile file extensions– Esri Webhelp docs for ArcGIS 10.0 (2010)
shapelib.maptools.org – Free c library for reading/writing shapefiles
Python Shapefile Library– Open Source (MIT License) Python library for reading/writing shapefiles
* Jav
Shapefilean
DbaseLibraries – Open Source (Apache License) Java libraries for reading/writing shapefiles and the associated dBase files (libraries are part of th
but could be used independently)
Open formats
GIS vector file formats