Shapefile

Last updated

Shapefile
Simple vector map.svg
A vector map, with points, polylines and polygons
Filename extension .shp, .shx, .dbf
Internet media type
x-gis/x-shapefile
Developed by Esri
Type of format GIS
Standard Shapefile Technical Description

The shapefile format is a geospatial vector data format for geographic information system (GIS) software. It is developed and regulated by Esri as a mostly open specification for data interoperability among Esri and other GIS software products. [1] The shapefile format can spatially describe vector features: points, lines, and polygons, representing, for example, water wells, rivers, and lakes. Each item usually has attributes that describe it, such as name or temperature.

Contents

Overview

The shapefile format is a digital vector storage format for storing geographic location and associated attribute information. This format lacks the capacity to store topological information. The shapefile format was introduced with ArcView GIS version 2 in the early 1990s. It is now possible to read and write geographical datasets using the shapefile format with a wide variety of software.

The shapefile format stores the geometry as primitive geometric shapes like points, lines, and polygons. These shapes, together with data attributes that are linked to each shape, create the representation of the geographic data. The term "shapefile" is quite common, but the format consists of a collection of files with a common filename prefix, stored in the same directory. The three mandatory files have filename extensions .shp, .shx, and .dbf . The actual shapefile relates specifically to the .shp file, but alone is incomplete for distribution as the other supporting files are required. Legacy GIS software may expect that the filename prefix be limited to eight characters to conform to the DOS 8.3 filename convention, though modern software applications accept files with longer names.

Mandatory files
Other files

In each of the .shp, .shx, and .dbf files, the shapes in each file correspond to each other in sequence (i.e., the first record in the .shp file corresponds to the first record in the .shx and .dbf files, etc.). The .shp and .shx files have various fields with different endianness, so an implementer of the file formats must be very careful to respect the endianness of each field and treat it properly.

Shapefile shape format (.shp)

The main file (.shp) contains the geometry data. Geometry of a given feature is stored as a set of vector coordinates. [1] :5 The binary file consists of a single fixed-length header followed by one or more variable-length records. Each of the variable-length records includes a record-header component and a record-contents component. A detailed description of the file format is given in the ESRI Shapefile Technical Description. [1] This format should not be confused with the AutoCAD shape font source format, which shares the .shp extension.

The 2D axis ordering of coordinate data assumes a Cartesian coordinate system, using the order (X Y) or (Easting Northing). This axis order is consistent for Geographic coordinate systems, where the order is similarly (longitude latitude). Geometries may also support 3- or 4-dimensional Z and M coordinates, for elevation and measure, respectively. A Z-dimension stores the elevation of each coordinate in 3D space, which can be used for analysis or for visualisation of geometries using 3D computer graphics. The user-defined M dimension can be used for one of many functions, such as storing linear referencing measures or relative time of a feature in 4D space.

The main file header is fixed at 100 bytes in length and contains 17 fields; nine 4-byte (32-bit signed integer or int32) integer fields followed by eight 8-byte (double) signed floating point fields:

Shapefile headers

Header of a .shp file format
BytesType Endianness Usage
0–3int32bigFile code (always hex value 0x0000270a)
4–23int32bigUnused; five uint32
24–27int32bigFile length (in 16-bit words, including the header)
28–31int32littleVersion
32–35int32littleShape type (see reference below)
36–67doublelittle Minimum bounding rectangle (MBR) of all shapes contained within the dataset; four doubles in the following order: min X, min Y, max X, max Y
68–83doublelittleRange of Z; two doubles in the following order: min Z, max Z
84–99doublelittleRange of M; two doubles in the following order: min M, max M

Shapefile record headers

The file then contains any number of variable-length records. Each record is prefixed with a record header of 8 bytes:

BytesType Endianness Usage
0–3int32bigRecord number (1-based)
4–7int32bigRecord length (in 16-bit words)

Shapefile records

Following the record header is the actual record:

BytesType Endianness Usage
0–3int32littleShape type (see reference below)
4–Shape content

The variable-length record contents depend on the shape type, which must be either the shape type given in the file header or Null. The following are the possible shape types:

ValueShape typeFields
0Null shapeNone
1PointX, Y
3PolylineMBR, Number of parts, Number of points, Parts, Points
5PolygonMBR, Number of parts, Number of points, Parts, Points
8MultiPointMBR, Number of points, Points
11PointZX, Y, Z

Optional: M

13PolylineZMandatory: MBR, Number of parts, Number of points, Parts, Points, Z range, Z array

Optional: M range, M array

15PolygonZMandatory: MBR, Number of parts, Number of points, Parts, Points, Z range, Z array

Optional: M range, M array

18MultiPointZMandatory: MBR, Number of points, Points, Z range, Z array

Optional: M range, M array

21PointMX, Y, M
23PolylineMMandatory: MBR, Number of parts, Number of points, Parts, Points

Optional: M range, M array

25PolygonMMandatory: MBR, Number of parts, Number of points, Parts, Points

Optional: M range, M array

28MultiPointMMandatory: MBR, Number of points, Points

Optional Fields: M range, M array

31MultiPatchMandatory: MBR, Number of parts, Number of points, Parts, Part types, Points, Z range, Z array

Optional: M range, M array

Shapefile shape index format (.shx)

The index contains positional index of the feature geometry and the same 100-byte header as the .shp file, followed by any number of 8-byte fixed-length records which consist of the following two fields:

BytesType Endianness Usage
0–3int32bigRecord offset (in 16-bit words)
4–7int32bigRecord length (in 16-bit words)

Using this index, it is possible to seek backwards in the shapefile by, first, seeking backwards in the shape index (which is possible because it uses fixed-length records), then reading the record offset, and using that offset to seek to the correct position in the .shp file. It is also possible to seek forwards an arbitrary number of records using the same method.

It is possible to generate the complete index file given a lone .shp file. However, since a shapefile is supposed to always contain an index, doing so counts as repairing a corrupt file. [2]

Shapefile attribute format (.dbf)

This file stores the attributes for each shape; it uses the dBase IV format. The format is public knowledge, and has been implemented in many dBase clones known as xBase. The open-source shapefile C library, for example, calls its format "xBase" even though it's plain dBase IV. [3]

The names and values of attributes are not standardized, and will be different depending on the source of the shapefile.

Shapefile spatial index format (.sbn)

This is a binary spatial index file, which is used only by Esri software. The format is not documented by Esri. However it has been reverse-engineered and documented by the open source community. The 100-byte header is similar to the one in .shp. [4] It is not currently implemented by other vendors. The .sbn file is not strictly necessary, since the .shp file contains all of the information necessary to successfully parse the spatial data.

Limitations

Topology and the shapefile format

The shapefile format does not have the ability to store topological information. The ESRI ArcInfo coverages and personal/file/enterprise geodatabases do have the ability to store feature topology.

Spatial representation

The edges of a polyline or polygon are composed of points. The spacing of the points implicitly determines the scale at which the feature is useful visually. Exceeding that scale results in jagged representation. Additional points would be required to achieve smooth shapes at greater scales. For features better represented by smooth curves, the polygon representation requires much more data storage than, for example, splines, which can capture smoothly varying shapes efficiently. None of the shapefile format types supports splines.

Data storage

The size of both .shp and .dbf component files cannot exceed 2 GB (or 231 bytes) — around 70 million point features at best. [5] The maximum number of feature for other geometry types varies depending on the number of vertices used.

The attribute database format for the .dbf component file is based on an older dBase standard. This database format inherently has a number of limitations: [5]

Mixing shape types

Because the shape type precedes each geometry record, a shapefile is technically capable of storing a mixture of different shape types. However, the specification states, "All the non-Null shapes in a shapefile are required to be of the same shape type." Therefore, this ability to mix shape types must be limited to interspersing null shapes with the single shape type declared in the file's header. A shapefile must not contain both polyline and polygon data, for example, the descriptions for a well (point), a river (polyline), and a lake (polygon) would be stored in three separate datasets.

See also

Related Research Articles

<span class="mw-page-title-main">Vector graphics</span> Computer graphics images defined by points, lines and curves

Vector graphics are a form of computer graphics in which visual images are created directly from geometric shapes defined on a Cartesian plane, such as points, lines, curves and polygons. The associated mechanisms may include vector display and printing hardware, vector data models and file formats, as well as the software based on these data models. Vector graphics is an alternative to raster or bitmap graphics, with each having advantages and disadvantages in specific situations.

<span class="mw-page-title-main">Esri</span> Geospatial software & SaaS company

Environmental Systems Research Institute, Inc., doing business as Esri, is an American multinational geographic information system (GIS) software company headquartered in Redlands, California. It is best known for its ArcGIS products. With a 40% market share, Esri is the world's leading supplier of GIS software, web GIS and geodatabase management applications.

dBase was one of the first database management systems for microcomputers and the most successful in its day. The dBase system included the core database engine, a query system, a forms engine, and a programming language that tied all of these components together.

<span class="mw-page-title-main">Geometric primitive</span> Basic shapes represented in vector graphics

In vector computer graphics, CAD systems, and geographic information systems, geometric primitive is the simplest geometric shape that the system can handle. Sometimes the subroutines that draw the corresponding objects are called "geometric primitives" as well. The most "primitive" primitives are point and straight line segment, which were all that early vector graphics systems had.

A GIS file format is a standard for encoding geographical information into a computer file, as a specialized type of file format for use in geographic information systems (GIS) and other geospatial applications. Since the 1970s, dozens of formats have been created based on various data models for various purposes. They have been created by government mapping agencies, GIS software vendors, standards bodies such as the Open Geospatial Consortium, informal user communities, and even individual developers.

<span class="mw-page-title-main">TerraLib</span> Geographic information system software library

TerraLib is an open-source geographic information system (GIS) software library. It extends object-relational database management systems (DBMS) to handle spatiotemporal data types.

<span class="mw-page-title-main">ArcGIS</span> Geographic information system maintained by Esri

ArcGIS is a family of client, server and online geographic information system (GIS) software developed and maintained by Esri.

gvSIG Desktop application for working with geographic data

gvSIG, geographic information system (GIS), is a desktop application designed for capturing, storing, handling, analyzing and deploying any kind of referenced geographic information in order to solve complex management and planning problems. gvSIG is known for having a user-friendly interface, being able to access the most common formats, both vector and raster ones. It features a wide range of tools for working with geographic-like information.

A spatial database is a general-purpose database that has been enhanced to include spatial data that represents objects defined in a geometric space, along with tools for querying and analyzing such data.

The MapInfo TAB format is a geospatial vector data format for geographic information systems software. It is developed and regulated by Precisely as a proprietary format.

PLY is a computer file format known as the Polygon File Format or the Stanford Triangle Format. It was principally designed to store three-dimensional data from 3D scanners. The data storage format supports a relatively simple description of a single object as a list of nominally flat polygons. A variety of properties can be stored, including color and transparency, surface normals, texture coordinates and data confidence values. The format permits one to have different properties for the front and back of a polygon.

ArcMap is the former main component of Esri's ArcGIS suite of geospatial processing programs. Used primarily to view, edit, create, and analyze geospatial data. ArcMap allows the user to explore data within a data set, symbolize features accordingly, and create maps. This is done through two distinct sections of the program, the table of contents and the data frame. In October 2020, it was announced that there are no plans to release 10.9 in 2021, and that ArcMap would no longer be supported after March 1, 2026. Esri is encouraging their users to transition to ArcGIS Pro.

A georelational data model is a geographic data model that represents geographic features as an interrelated set of spatial and attribute data. The georelational model was the dominant form of vector file format during the 1980s and 1990s, including the Esri coverage and Shapefile.

A geographic data model, geospatial data model, or simply data model in the context of geographic information systems, is a mathematical and digital structure for representing phenomena over the Earth. Generally, such data models represent various aspects of these phenomena by means of geographic data, including spatial locations, attributes, change over time, and identity. For example, the vector data model represents geography as collections of points, lines, and polygons, and the raster data model represent geography as cell matrices that store numeric values. Data models are implemented throughout the GIS ecosystem, including the software tools for data management and spatial analysis, data stored in a variety of GIS file formats, specifications and standards, and specific designs for GIS installations.

A file format is a standard way that information is encoded for storage in a computer file. It specifies how bits are used to encode information in a digital storage medium. File formats may be either proprietary or free.

GeoJSON is an open standard format designed for representing simple geographical features, along with their non-spatial attributes. It is based on the JSON format.

The following tables compare general and technical information for a number of GIS vector file format. Please see the individual products' articles for further information. Unless otherwise specified in footnotes, comparisons are based on the stable versions without any add-ons, extensions or external programs.

<span class="mw-page-title-main">CityEngine</span> 3D modelling software

ArcGIS CityEngine is a commercial three-dimensional (3D) modeling program developed by Esri R&D Center Zurich and specialises in the generation of 3D urban environments. Using a procedural modeling approach, it supports the creation of detailed large-scale 3D city models. CityEngine works with architectural object placement and arrangement in the same manner that software like VUE manages terrain, ecosystems and atmosphere mapping. Unlike the traditional 3D modeling methodology which uses Computer-Aided Design (CAD) tools and techniques, CityEngine takes a different approach to shape generation via a rule-based system. It can also use Geographic Information System (GIS) datasets due to its integration with the wider Esri/ArcGIS platform. Due to this unique feature set, CityEngine has been used in academic research and built environment professions, e.g., urban planning, architecture, visualization, game development, entertainment, archeology, military and cultural heritage. CityEngine can be used within Building Information Model (BIM) workflows as well as visualizing the data of buildings in a larger urban context, enhancing its working scenario toward real construction projects.

The .dbf file extension represents the dBase database file. The file type was introduced in 1983 with dBASE II. The file structure has evolved to include many features and capabilities. Several additional file types have been added, to support data storage and manipulation. The current .dbf file level is called Level 7. The .dbf format is supported by a number of database products.

The Esri TIN format is a popular yet proprietary geospatial vector data format for geographic information system (GIS) software for storing elevation data as a triangulated irregular network. It is developed and regulated by Esri, US. The Esri TIN format can spatially describe elevation information including breaking edge features. Each points and triangle can carry a tag information. A TIN stored in this file format can have any shape, cover multiple regions and contain holes.

References

  1. 1 2 3 ESRI (July 1998). "ESRI Shapefile Technical Description" (PDF). Retrieved 4 July 2007.
  2. Rollason, Ed. "qgis - Creating missing .shx file?". Geographic Information Systems Stack Exchange.
  3. "Shapefile C Library V1.2".
  4. "SBN Format" (PDF). 4 October 2011. Archived from the original (PDF) on 13 August 2016. Retrieved 21 June 2023.
  5. 1 2 "ArcGIS Desktop 9.3 Help – Geoprocessing considerations for shapefile output". Esri. 24 April 2009.
  1. Egger, Manfred. "Shapefile Projectionfinder" (PDF). www.egger-gis.at.
  2. "Shapefile Projectionfinder".