Developer(s) | LocationTech, Azavea |
---|---|
Initial release | 12 May 2012 |
Stable release | 3.6.0 / April 30, 2021 [1] |
Repository | |
Written in | Scala |
Operating system | Linux |
Type | Big Data, Map algebra |
License | Apache License 2.0 |
Website | geotrellis |
GeoTrellis is an open source, geographic data processing library designed to work with large geospatial raster data sets. It is written in Scala and has an open-source Apache 2.0 license.
GeoTrellis' core competency is raster data processing: enabling distributed processing of large geospatial raster data sets using the techniques of map algebra. In addition to support for raster data operations, GeoTrellis includes some support for operations using vector and point cloud data.
GeoTrellis leverages Apache Spark for distributed processing. Distributed processing relies on indexing large datasets based on a multi-dimensional space-filling curve (SFC). SFCs enable the translation of multi-dimensional indices into a single-dimensional one, while maintaining geospatial locality. This allows for efficient reading and writing of large datasets to be performed in parallel across multiple computers.
Python bindings have been developed for GeoTrellis as a sub-project called GeoPySpark that enables Python developers to access and use the GeoTrellis library.
GeoTrellis started as a research project at Azavea, a geospatial software company based in Philadelphia. A precursor software component, DecisionTree, was developed beginning in 2006 with support from a Small Business Innovation Research grant from the U.S. Department of Agriculture. In 2009, with financial support from the William Penn Foundation and Stroud Water Research Center, Azavea embarked on early development of GeoTrellis.
GeoTrellis was released as an open source project in 2011 [2] with the goal of supporting fast processing of geospatial raster data at scale.
GeoTrellis initially supported distributed computation through Akka, a Scala framework for building concurrent and distributed applications. The need to support additional use cases and features such as caching and sharding datasets across a storage cluster led to a search for a new distribution framework. GeoTrellis moved to Apache Spark as its distribution engine in 2014 [3] in order to leverage management, scheduling, and other features in the Spark framework. One key use case that drove this phase of development was the need to efficiently process large, spatiotemporal datasets like those used for many earth science applications, such as climate change. [4] The move to Apache Spark enabled efficient support for large climate change forecast datasets published by the Intergovernmental Panel on Climate Change (IPCC).
GeoTrellis was submitted to the Eclipse Foundation's LocationTech [5] working group in 2013 and graduated from incubation with a 1.0 release in December 2016. [6]
GeoTrellis has been used in a number of geospatial domains including: satellite and aerial image processing, forest growth simulation, agricultural yield predictions, planning, digital humanities, government infrastructure investment, and machine learning to support crime risk forecasting. It is currently integrated into other open source software projects including: Raster Foundry, [7] Raster Frames, [8] and GeoPySpark. [9]
A GIS file format is a standard for encoding geographical information into a computer file, as a specialized type of file format for use in geographic information systems (GIS) and other geospatial applications. Since the 1970s, dozens of formats have been created based on various data models for various purposes. They have been created by government mapping agencies, GIS software vendors, standards bodies such as the Open Geospatial Consortium, informal user communities, and even individual developers.
Hierarchical Data Format (HDF) is a set of file formats designed to store and organize large amounts of data. Originally developed at the U.S. National Center for Supercomputing Applications, it is supported by The HDF Group, a non-profit corporation whose mission is to ensure continued development of HDF5 technologies and the continued accessibility of data stored in HDF.
GeoTIFF is a public domain metadata standard which allows georeferencing information to be embedded within a TIFF file. The potential additional information includes map projection, coordinate systems, ellipsoids, datums, and everything else necessary to establish the exact spatial reference for the file. The GeoTIFF format is fully compliant with TIFF 6.0, so software incapable of reading and interpreting the specialized metadata will still be able to open a GeoTIFF format file.
A GIS software program is a computer program to support the use of a geographic information system, providing the ability to create, store, manage, query, analyze, and visualize geographic data, that is, data representing phenomena for which location is important. The GIS software industry encompasses a broad range of commercial and open-source products that provide some or all of these capabilities within various information technology architectures.
The Open Source Geospatial Foundation (OSGeo), is a non-profit non-governmental organization whose mission is to support and promote the collaborative development of open geospatial technologies and data. The foundation was formed in February 2006 to provide financial, organizational and legal support to the broader Free and open-source geospatial community. It also serves as an independent legal entity to which community members can contribute code, funding and other resources.
The Geospatial Data Abstraction Library (GDAL) is a computer software library for reading and writing raster and vector geospatial data formats, and is released under the permissive X/MIT style free software license by the Open Source Geospatial Foundation. As a library, it presents a single abstract data model to the calling application for all supported formats. It may also be built with a variety of useful command line interface utilities for data translation and processing. Projections and transformations are supported by the PROJ library.
ParaView is an open-source multiple-platform application for interactive, scientific visualization. It has a client–server architecture to facilitate remote visualization of datasets, and generates level of detail (LOD) models to maintain interactive frame rates for large datasets. It is an application built on top of the Visualization Toolkit (VTK) libraries. ParaView is an application designed for data parallelism on shared-memory or distributed-memory multicomputers and clusters. It can also be run as a single-computer application.
HBase is an open-source non-relational distributed database modeled after Google's Bigtable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS or Alluxio, providing Bigtable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data.
NetCDF is a set of software libraries and self-describing, machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data. The project homepage is hosted by the Unidata program at the University Corporation for Atmospheric Research (UCAR). They are also the chief source of netCDF software, standards development, updates, etc. The format is an open standard. NetCDF Classic and 64-bit Offset Format are an international standard of the Open Geospatial Consortium.
PyCharm is an integrated development environment (IDE) used for programming in Python. It provides code analysis, a graphical debugger, an integrated unit tester, integration with version control systems, and supports web development with Django. PyCharm is developed by the Czech company JetBrains.
Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Built chiefly by contributions from developers from MapR, Drill is inspired by Google's Dremel system. Drill is an Apache top-level project. Tom Shiran is the founder of the Apache Drill Project. It was designated an Apache Software Foundation top-level project in December 2016.
Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license. The name is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals, as well as a play on the phrase "Python data analysis". Wes McKinney started building what would become Pandas at AQR Capital while he was a researcher there from 2007 to 2010.
Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.
Eclipse Deeplearning4j is a programming library written in Java for the Java virtual machine (JVM). It is a framework with wide support for deep learning algorithms. Deeplearning4j includes implementations of the restricted Boltzmann machine, deep belief net, deep autoencoder, stacked denoising autoencoder and recursive neural tensor network, word2vec, doc2vec, and GloVe. These algorithms all include distributed parallel versions that integrate with Apache Hadoop and Spark.
Apache SystemDS is an open source ML system for the end-to-end data science lifecycle.
Apache Arrow is a language-agnostic software framework for developing data analytics applications that process columnar data. It contains a standardized column-oriented memory format that is able to represent flat and hierarchical data for efficient analytic operations on modern CPU and GPU hardware. This reduces or eliminates factors that limit the feasibility of working with large sets of data, such as the cost, volatility, or physical constraints of dynamic random-access memory.
Dask is an open-source Python library for parallel computing. Dask scales Python code from multi-core local machines to large distributed clusters in the cloud. Dask provides a familiar user interface by mirroring the APIs of other libraries in the PyData ecosystem including: Pandas, scikit-learn and NumPy. It also exposes low-level APIs that help programmers run custom algorithms in parallel.