A GIS file format or geospatial file format is a standard for encoding
geographical information
Geographic data and information is defined in the ISO/TC 211 series of standards as data and information having an implicit or explicit association with a location relative to Earth (a geographic location or geographic position). It is also call ...
into a
computer file
A computer file is a System resource, resource for recording Data (computing), data on a Computer data storage, computer storage device, primarily identified by its filename. Just as words can be written on paper, so too can data be written to a ...
. It is a specialized type of
file format
A file format is a Computer standard, standard way that information is encoded for storage in a computer file. It specifies how bits are used to encode information in a digital storage medium. File formats may be either proprietary format, pr ...
for use in
geographic information system
A geographic information system (GIS) consists of integrated computer hardware and Geographic information system software, software that store, manage, Spatial analysis, analyze, edit, output, and Cartographic design, visualize Geographic data ...
s (GIS),
remote sensing
Remote sensing is the acquisition of information about an physical object, object or phenomenon without making physical contact with the object, in contrast to in situ or on-site observation. The term is applied especially to acquiring inform ...
image processing
An image or picture is a visual representation. An image can be two-dimensional, such as a drawing, painting, or photograph, or three-dimensional, such as a carving or sculpture. Images may be displayed through other media, including a pr ...
tools, and other geospatial applications. Since the 1970s, dozens of formats have been created based on various
data models for various purposes. They have been created by government mapping agencies (such as the
USGS
The United States Geological Survey (USGS), founded as the Geological Survey, is an government agency, agency of the United States Department of the Interior, U.S. Department of the Interior whose work spans the disciplines of biology, geograp ...
or
National Geospatial-Intelligence Agency
The National Geospatial-Intelligence Agency (NGA) is a combat support agency within the United States Department of Defense whose primary mission is collecting, analyzing, and distributing geospatial intelligence (GEOINT) to support national se ...
), GIS software vendors, standards bodies such as the
Open Geospatial Consortium
The Open Geospatial Consortium (OGC) is an international voluntary consensus standards organization that develops and maintains international standards for geospatial content and location-based services, sensor web, Internet of Things, Geographi ...
, informal user communities, and even individual developers.
History
The first GIS installations of the 1960s, such as the
Canada Geographic Information System were based on bespoke software and stored data in bespoke file structures designed for the needs of the particular project. As more of these appeared, they could be compared to find best practices and common structures.
When general-purpose GIS software was developed in the 1970s and early 1980s, including programs from academic labs such as the
Harvard Laboratory for Computer Graphics and Spatial Analysis, government agencies (e.g., the
Map Overlay and Statistical System (MOSS) developed by the U.S.
Fish & Wildlife Service and
Bureau of Land Management), and new GIS software companies such as
Esri and
Intergraph, each program was built around its own proprietary (and often secret) file format.
Since each GIS installation was effectively isolated from all others, interchange between them was not a major consideration.
By the early 1990s, the proliferation of GIS worldwide and an increasing need for sharing data, soon accelerated by the emergence of the
World Wide Web
The World Wide Web (WWW or simply the Web) is an information system that enables Content (media), content sharing over the Internet through user-friendly ways meant to appeal to users beyond Information technology, IT specialists and hobbyis ...
and
spatial data infrastructures, led to the need for interoperable data and standard formats. An early attempt at standardization was the U.S.
Spatial Data Transfer Standard, released in 1994 and designed to encode the wide variety of federal government data.
Although this particular format failed to garner widespread support, it led to other standardization efforts, especially the
Open Geospatial Consortium
The Open Geospatial Consortium (OGC) is an international voluntary consensus standards organization that develops and maintains international standards for geospatial content and location-based services, sensor web, Internet of Things, Geographi ...
(OGC), which has developed or adopted several vendor-neutral standards, some of which have been adopted by the
International Standards Organization (ISO).
Another development in the 1990s was the public release of proprietary file formats by GIS software vendors, enabling them to be used by other software. The most notable example of this was the publication of the Esri
Shapefile format,
which by the late 1990s had become the most popular ''de facto'' standard for data sharing by the entire geospatial industry.
When proprietary formats were not shared (for example, the ESRI ARC/INFO coverage), software developers frequently reverse-engineered them to enable import and export in other software, further facilitating data exchange. One result of this was the emergence of
free and open-source software
Free and open-source software (FOSS) is software available under a license that grants users the right to use, modify, and distribute the software modified or not to everyone free of charge. FOSS is an inclusive umbrella term encompassing free ...
libraries, such as the
Geospatial Data Abstraction Library (GDAL), which have greatly facilitated the integration of spatial data in any format into a variety of software.
During the 2000s, the need for specialized spatial files was reduced somewhat by the emergence of
spatial databases, which incorporated spatial data into general-purpose relational databases. However, new file formats have continued to appear, especially with the proliferation of web mapping; formats such as the
Keyhole Markup Language (KML) and
GeoJSON can be more easily integrated into web development languages than traditional GIS files.
Format characteristics
Over a hundred distinct formats have been created for the storage of spatial data, of which 20-30 are currently in common usage for different purposes. These can be distinguished in a number of ways:
* ''Open'' formats are developed collectively by a community and are available for anyone to implement and contribute improvements, while ''Proprietary'' formats have been developed by a software company for use only in their own software and are generally maintained as a trade secret (although they are often reverse-engineered by others). A third category between these would include formats that are owned exclusively by one company or organization, but are published and available for implementation by anyone, such as the Esri
Shapefile.
* Some file formats are ''
text file
A text file (sometimes spelled textfile; an old alternative name is flat file) is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists stored as data within a computer file system.
In ope ...
s'' that can be read by humans (such as those based on
XML
Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding electronic document, documents in a format that is both human-readable and Machine-r ...
or
JSON
JSON (JavaScript Object Notation, pronounced or ) is an open standard file format and electronic data interchange, data interchange format that uses Human-readable medium and data, human-readable text to store and transmit data objects consi ...
), especially those intended for data exchange, while others are ''
binary files'', most commonly those designed for native use in GIS software.
* ''Inherently spatial'' formats were designed specifically for storing geographic data, while others are ''spatial extensions'' to formats designed for a more general use (e.g.,
GeoTIFF,
spatial databases).
* Many data formats incorporate some form of ''
data compression
In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compressi ...
'', especially raster files. Generally, lossless compression methods are preferable over
lossy methods, because the original data values need to be retrieved.
Raster formats
Like any digital image, raster GIS data is based on a regular tessellation of space into a rectangular grid of rows and columns of ''cells'' (also known as
pixel
In digital imaging, a pixel (abbreviated px), pel, or picture element is the smallest addressable element in a Raster graphics, raster image, or the smallest addressable element in a dot matrix display device. In most digital display devices, p ...
s), with each cell having a measured value stored. The major difference from a photograph is that the grid is
registered to geographic space rather than a field of view. The
resolution of the raster data set is its cell width in ground units.
Because a grid is a sample of a continuous space, raster data is most commonly used to represent
geographic fields, in which a property varies continuously or discretely over space. Common examples include
remote sensing
Remote sensing is the acquisition of information about an physical object, object or phenomenon without making physical contact with the object, in contrast to in situ or on-site observation. The term is applied especially to acquiring inform ...
imagery,
terrain/elevation,
population density
Population density (in agriculture: Standing stock (disambiguation), standing stock or plant density) is a measurement of population per unit land area. It is mostly applied to humans, but sometimes to other living organisms too. It is a key geog ...
,
weather and climate,
soil properties, and many others. Raster data can be images with each pixel (or cell) containing a color value. The value recorded for each cell may be of any
level of measurement, including a discrete qualitative value, such as land use type, or a continuous quantitative value, such as temperature, or a
null
Null may refer to:
Science, technology, and mathematics Astronomy
*Nuller, an optical tool using interferometry to block certain sources of light Computing
*Null (SQL) (or NULL), a special marker and keyword in SQL indicating that a data value do ...
value if no data is available. While a raster cell stores a single value, it can be extended by using raster bands to represent RGB (red, green, blue) colors, colormaps (a mapping between a thematic code and RGB value), or an extended attribute table with one row for each unique cell value. It can also be used to represent discrete
Geographic feature
In geography and particularly in geographic information science, a geographic feature or simply feature (also called an object or entity) is a representation of phenomenon that exists at a location in the space and scale of relevance to geograph ...
s, but usually only in exigent circumstances.
Raster data is stored in various formats; from a standard file-based structure of TIFF, JPEG, etc. to
binary large object (BLOB) data stored directly in a
relational database management system
A relational database (RDB) is a database based on the relational model of data, as proposed by E. F. Codd in 1970.
A Relational Database Management System (RDBMS) is a type of database management system that stores data in a structured for ...
(RDBMS) similar to other vector-based feature classes. Database storage, when properly indexed, typically allows for quicker retrieval of the raster data but can require storage of millions of significantly sized records.
Raster format examples
*ADRG –
National Geospatial-Intelligence Agency
The National Geospatial-Intelligence Agency (NGA) is a combat support agency within the United States Department of Defense whose primary mission is collecting, analyzing, and distributing geospatial intelligence (GEOINT) to support national se ...
(NGA)'s ARC Digitized Raster Graphics
*
Binary file – An unformatted file consisting of raster data written in one of several
data type
In computer science and computer programming, a data type (or simply type) is a collection or grouping of data values, usually specified by a set of possible values, a set of allowed operations on these values, and/or a representation of these ...
s, where multiple band are stored in BSQ (band sequential), BIP (band interleaved by pixel) or BIL (band interleaved by line). Georeferencing and other metadata are stored one or more
sidecar files.
*
Digital raster graphic (DRG) – digital scan of a paper
USGS
The United States Geological Survey (USGS), founded as the Geological Survey, is an government agency, agency of the United States Department of the Interior, U.S. Department of the Interior whose work spans the disciplines of biology, geograp ...
topographic map
In modern mapping, a topographic map or topographic sheet is a type of map characterized by large- scale detail and quantitative representation of relief features, usually using contour lines (connecting points of equal elevation), but histori ...
*ECRG –
National Geospatial-Intelligence Agency
The National Geospatial-Intelligence Agency (NGA) is a combat support agency within the United States Department of Defense whose primary mission is collecting, analyzing, and distributing geospatial intelligence (GEOINT) to support national se ...
(NGA)'s Enhanced Compressed ARC Raster Graphics (better resolution than CADRG and no color loss)
*
ECW – Enhanced Compressed Wavelet (from ERDAS). A compressed wavelet format, often lossy.
*
Esri grid – proprietary
binary raster format used by
Esri since the mid-1980s
*
GeoTIFF –
TIFF variant enriched with GIS relevant metadata, especially
georeferencing. An open format that has become one of the most common formats for data sharing.
*IMG –
ERDAS IMAGINE image file format
*
JPEG2000 – Open-source raster format. A compressed format, allows both lossy and lossless compression.
*
MrSID – Multi-Resolution Seamless Image Database (by Lizardtech). A compressed wavelet format, allows both lossy and lossless compression.
*
netCDF-CF – netCDF file format with
CF medata conventions for earth science data. Binary storage in open format with optional compression. Allows for direct web-access of subsets/aggregations of maps through
OPeNDAP protocol.
*RPF – Raster Product Format, military file format specified in
MIL-STD-2411
**CADRG – Compressed ADRG, developed by
NGA, nominal compression of 55:1 over ADRG (type of Raster Product Format)
**
CIB – Controlled Image Base, developed by
NGA (type of Raster Product Format)
*
USGS DEM – The
USGS
The United States Geological Survey (USGS), founded as the Geological Survey, is an government agency, agency of the United States Department of the Interior, U.S. Department of the Interior whose work spans the disciplines of biology, geograp ...
' Digital Elevation Model
**
GTOPO30 – Large complete Earth elevation model at 30 arc seconds, delivered in the USGS DEM format
*
DTED –
National Geospatial-Intelligence Agency
The National Geospatial-Intelligence Agency (NGA) is a combat support agency within the United States Department of Defense whose primary mission is collecting, analyzing, and distributing geospatial intelligence (GEOINT) to support national se ...
(NGA)'s Digital Terrain Elevation Data, the military standard for elevation data
*
World file –
Georeferencing a raster image file (e.g. JPEG, BMP)
Vector formats
A ''vector'' dataset (sometimes called a ''feature'' dataset) stores information about discrete objects, using an encoding of the
vector logical data model to represent the location or ''geometry'' of each object, and an encoding of its other properties that is usually based on
relational database
A relational database (RDB) is a database based on the relational model of data, as proposed by E. F. Codd in 1970.
A Relational Database Management System (RDBMS) is a type of database management system that stores data in a structured for ...
technology. Typically, a single dataset collects information about a set of closely related or similar objects, such as all of the roads in a city.
The Vector data model uses
coordinate geometry to represent each shape as one of several
geometric primitive
In vector computer graphics, CAD systems, and geographic information systems, a geometric primitive (or prim) is the simplest (i.e. 'atomic' or irreducible) geometric shape that the system can handle (draw, store). Sometimes the subroutines ...
s, most commonly ''
points'' (a single coordinate of zero
dimension
In physics and mathematics, the dimension of a mathematical space (or object) is informally defined as the minimum number of coordinates needed to specify any point within it. Thus, a line has a dimension of one (1D) because only one coo ...
), ''
lines'' (a one-dimensional ordered list of coordinates connected by straight lines), and ''
polygon
In geometry, a polygon () is a plane figure made up of line segments connected to form a closed polygonal chain.
The segments of a closed polygonal chain are called its '' edges'' or ''sides''. The points where two edges meet are the polygon ...
s'' (a self-closing boundary line enclosing a two-dimensional region). Many data structures have been developed to encode these primitives as digital data, but most modern vector file formats are based on the
Open Geospatial Consortium
The Open Geospatial Consortium (OGC) is an international voluntary consensus standards organization that develops and maintains international standards for geospatial content and location-based services, sensor web, Internet of Things, Geographi ...
(OGC)
Simple Features specification, often directly incorporating its
Well-known text (WKT) or Well-known binary (WKB) encodings.
In addition to the geometry of each object, a vector dataset must also be able to store its ''attributes''. For example, a database that describes lakes may contain each lake's depth, water quality, and pollution level. Since the 1970s, almost all vector file formats have adopted the
relational database
A relational database (RDB) is a database based on the relational model of data, as proposed by E. F. Codd in 1970.
A Relational Database Management System (RDBMS) is a type of database management system that stores data in a structured for ...
model, either in principle or directly incorporating
RDBMS software. Thus, the entire dataset is stored in a ''table'', with each ''row'' representing a single object that contains ''columns'' for each attribute.
Two strategies have been used to integrate the geometry and attributes into a single vector file format structure:
* A ''
georelational format'' stores them as two separate files, with the geometry and attributes of each object being linked by file ordering or a
primary key. This was most common from the 1970s through the early 1990s, because GIS software developers had to invent their own geometry data structures, but incorporated existing relational database file formats for the attributes. For example, the
Esri Shapefile format includes the .dbf file from the DOS
dBase
dBase (also stylized dBASE) was one of the first database management systems for microcomputers and the most successful in its day. The dBase system included the core database engine, a query system, a Form (programming), forms engine, and a pr ...
software.
* The ''Object-based model'' stores them in a single structure, loosely or directly based on the objects in
object-oriented programming
Object-oriented programming (OOP) is a programming paradigm based on the concept of '' objects''. Objects can contain data (called fields, attributes or properties) and have actions they can perform (called procedures or methods and impl ...
languages. This is the basis of most modern file formats, including
spatial databases that include a geometry column along with the other attributes in a single relational table. Other formats, such as
GeoJSON, use different structures for geometry and attributes, but combine them for each object in the same file.
Geospatial topology is often an important part of vector data, representing the inherent spatial relationships (especially adjacency) between objects. Topology has been managed in vector file formats in four ways. In a ''topological data structure'', most notably Harvard's POLYVRT and its successor the
ARC/INFO coverage, topological connections between points, lines, and polygons are an inherent part of the encoding of those features.
Conversely, non-topological or ''spaghetti data'' (such as the Esri
Shapefile and most
spatial databases) includes no topology information, with each geometry being completely independent of all others. A ''topology dataset'' (often used in
network analysis) augments spaghetti data with a separate file encoding the topological connections.
A ''topology rulebase'' is a list of desired topology rules used to enforce spatial integrity in spaghetti data, such as "county polygons must not overlap" and "state polygons must share boundaries with county polygons."
Vector datasets usually represent discrete
geographical features, such as buildings, trees, and counties. However, they may also be used to represent
geographical fields by storing locations where the spatially continuous field has been sampled. Sample points (e.g.,
weather stations and
sensor networks),
Contour line
A contour line (also isoline, isopleth, isoquant or isarithm) of a Function of several real variables, function of two variables is a curve along which the function has a constant value, so that the curve joins points of equal value. It is a ...
s and
triangulated irregular networks (TIN) are used to represent elevation or other values that change continuously over space. TINs record values at point locations, which are connected by lines to form an irregular mesh of triangles. The face of the triangles represent the terrain surface.
Example vector file formats
Formats commonly in current usage:
*
Shapefile – a popular vector data GIS format, developed by
Esri
*
Geography Markup Language (GML) – XML based open standard (by
OpenGIS) for GIS data exchange
*
GeoJSON – a lightweight format based on
JSON
JSON (JavaScript Object Notation, pronounced or ) is an open standard file format and electronic data interchange, data interchange format that uses Human-readable medium and data, human-readable text to store and transmit data objects consi ...
, used by many open source GIS packages
*
GeoMedia –
Intergraph's
Microsoft Access
Microsoft Access is a database management system (DBMS) from Microsoft that combines the relational database, relational Access Database Engine (ACE) with a graphical user interface and software-development tools. It is a member of the Microsof ...
based format for spatial vector storage
*
Keyhole Markup Language (KML) – XML based open standard (by
OpenGIS) for GIS data exchange
*
MapInfo TAB format –
MapInfo's vector data format using TAB, DAT, ID and MAP files
*
Measure Map Pro format –
XML
Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding electronic document, documents in a format that is both human-readable and Machine-r ...
data format to store GIS data
*
National Transfer Format (NTF) – National Transfer Format (mostly used by the UK Ordnance Survey)
*
Spatialite – a spatial extension to
SQLite, providing vector geodatabase functionality. It is similar to
PostGIS
PostGIS ( ) is an open source software program that adds support for geographic objects to the PostgreSQL object-relational database. PostGIS follows the Simple Features for SQL specification from the Open Geospatial Consortium (OGC).
PostGIS is ...
,
Oracle Spatial, and SQL Server with spatial extensions
*
Simple Features –
Open Geospatial Consortium
The Open Geospatial Consortium (OGC) is an international voluntary consensus standards organization that develops and maintains international standards for geospatial content and location-based services, sensor web, Internet of Things, Geographi ...
specification for vector data
**
Well-known text (WKT) – A text markup language for representing feature geometry, developed by
Open Geospatial Consortium
The Open Geospatial Consortium (OGC) is an international voluntary consensus standards organization that develops and maintains international standards for geospatial content and location-based services, sensor web, Internet of Things, Geographi ...
**
Well-known binary (WKB) – Binary version of well-known text, used in many
spatial databases
*
SOSI – a spatial data format used for all public exchange of spatial data in Norway
*
AutoCAD DXF – data transfer format for
AutoCAD data (by
Autodesk
Autodesk, Inc. is an American multinational software corporation that provides software products and services for the architecture, engineering, construction, manufacturing, media, education, and entertainment industries. Autodesk is headquarte ...
)
*
Geographic Data Files Geographic Data Files (GDF) is an data exchange, interchange file format for geographic data.
In contrast with generic GIS file format, GIS formats, GDF provides detailed rules for data capture and representation, and an extensive catalog of standar ...
(GDF) — An interchange file format for geographic data
Historical formats seldom used today:
*
ArcInfo Coverage - topological data structure used in Arc/INFO from 1981 through 2000
*
Esri TIN – proprietary
binary format for
triangulated irregular network data used by
Esri
*
Digital line graph (DLG) – a USGS format for vector data
*
TIGER
The tiger (''Panthera tigris'') is a large Felidae, cat and a member of the genus ''Panthera'' native to Asia. It has a powerful, muscular body with a large head and paws, a long tail and orange fur with black, mostly vertical stripes. It is ...
– Topologically Integrated Geographic Encoding and Referencing
*
Vector Product Format (VPF) –
National Geospatial-Intelligence Agency
The National Geospatial-Intelligence Agency (NGA) is a combat support agency within the United States Department of Defense whose primary mission is collecting, analyzing, and distributing geospatial intelligence (GEOINT) to support national se ...
(NGA)'s format of vectored data for large geographic databases
*
Spatial Data File –
Autodesk
Autodesk, Inc. is an American multinational software corporation that provides software products and services for the architecture, engineering, construction, manufacturing, media, education, and entertainment industries. Autodesk is headquarte ...
's high-performance geodatabase format, native to
MapGuide
* ISFC –
Intergraph's
MicroStation based CAD solution attaching vector elements to a relational
Microsoft Access
Microsoft Access is a database management system (DBMS) from Microsoft that combines the relational database, relational Access Database Engine (ACE) with a graphical user interface and software-development tools. It is a member of the Microsof ...
database
*
Dual Independent Map Encoding (DIME) – A historic GIS file format, developed in the 1960s
Advantages and disadvantages
There are some important advantages and disadvantages to using a raster or vector data model to represent reality:
* Raster datasets record a value for all points in the area covered which may require more storage space than representing data in a vector format that can store data only where needed.
* Raster data is computationally less expensive to render than vector graphics
* Combining values and writing custom formulas for combining values from different layers are much easier using raster data.
* There are transparency and aliasing problems when overlaying multiple stacked pieces of raster images.
* Vector data allows for visually smooth and easy implementation of overlay operations, especially in terms of graphics and shape-driven information like maps, routes and custom fonts, which are more difficult with raster data.
* Vector data can be displayed as
vector graphics
Vector graphics are a form of computer graphics in which visual images are created directly from geometric shapes defined on a Cartesian plane, such as points, lines, curves and polygons. The associated mechanisms may include vector displ ...
used on traditional maps, whereas raster data will appear as an
image
An image or picture is a visual representation. An image can be Two-dimensional space, two-dimensional, such as a drawing, painting, or photograph, or Three-dimensional space, three-dimensional, such as a carving or sculpture. Images may be di ...
that may have a blocky appearance for object boundaries. (depending on the resolution of the raster file).
* Vector data can be easier to register, scale, and re-project, which can simplify combining vector layers from different sources.
* Vector data is more compatible with relational database environments, where they can be part of a relational table as a normal column and processed using a multitude of operators.
* Vector file sizes are usually smaller than raster data, which can be tens, hundreds or more times larger than vector data (depending on resolution).
* Vector data is simpler to update and maintain, whereas a raster image will have to be completely reproduced. (Example: a new road is added).
* Vector data allows much more analysis capability, especially for "networks" such as roads, power, rail, telecommunications, etc. (Examples: Best route, largest port, airfields connected to two-lane highways). Raster data will not have all the characteristics of the features it displays.
Integrated file formats
Modern
object–relational databases can now store a variety of complex data using the
binary large object datatype, including both raster grids and vector geometries. This enables some
spatial database systems to store data of both models in the same database.
*
Esri File
Geodatabase - A proprietary format for storing "feature" (vector) and raster data locally
*
Esri Enterprise
Geodatabase - A proprietary model for storing a geodatabase structure in a variety of commercial and open-source
relational database management system
A relational database (RDB) is a database based on the relational model of data, as proposed by E. F. Codd in 1970.
A Relational Database Management System (RDBMS) is a type of database management system that stores data in a structured for ...
s
*
GeoPackage (GPKG) – A standards-based, open format based on the SQLite database format for both vector and raster data, adopted by the
Open Geospatial Consortium
The Open Geospatial Consortium (OGC) is an international voluntary consensus standards organization that develops and maintains international standards for geospatial content and location-based services, sensor web, Internet of Things, Geographi ...
See also
*
Datum (geodesy)
A geodetic datum or geodetic system (also: geodetic reference datum, geodetic reference system, or geodetic reference frame, or terrestrial reference frame) is a global datum reference or reference frame for unambiguously representing the positi ...
*
GDAL/OGR, a library for reading and writing many formats
*
Feature Manipulation Engine (FME), a commercial program for converting data between a large number of formats
References
{{Markup languages