OpenRefine

OpenRefine
Developer(s)	Freebase, then Google, now open source community
Initial release	November 10, 2010;14 years ago
Stable release	3.8.7 / 21 November 2024;8 days ago
Repository	github.com/OpenRefine/OpenRefine ;
Written in	Java
Platform	Microsoft Windows, Linux, macOS
Available in	English, Italian, Chinese, Japanese, French, German
Type	Data management ; Data visualization ;
License	BSD License
Website	openrefine.org

Last updated November 30, 2024

OpenRefine is an open-source desktop application for data cleanup and transformation to other formats, an activity commonly known as data wrangling.^[3] It is similar to spreadsheet applications, and can handle spreadsheet file formats such as CSV, but it behaves more like a database.

Unlike spreadsheets, most operations in OpenRefine are done on all visible rows, for example, the transformation of all cells in all rows under one column,^[4] or the creation of a new column based on existing data. Actions performed on a dataset are stored the project and can be 'replayed' on other datasets. Formulas are not stored in cells, but are used to transform the data. Transformation is done only once.^[5] Formula expressions can be written in General Refine Expression Language (GREL),^[6] in Jython (i.e., Python), and in Clojure.^[7]

The program operates as a local web app: it starts a web server and opens the default browser to 127.0.0.1:3333.

Uses

Cleaning messy data: for example if working with a text file with some semi-structured data, it can be edited using transformations, facets and clustering to make the data cleanly structured.^[8]
Transformation of data: converting values to other formats, normalizing and denormalizing.
Parsing data from web sites: OpenRefine has a URL fetch feature and jsoup HTML parser and DOM engine.^[9]
Adding data to dataset by fetching it from web services (i.e. returning JSON).^[10] For example, can be used for geocoding addresses to geographic coordinates.^[11]
Aligning to Wikidata (formerly Freebase ^[12]): this involves reconciliation — mapping string values in cells to entities in Wikidata.^[13]

Supported formats

Import is supported from following formats:^[14]

TSV, CSV
Text file with custom separators or columns split by fixed width
XML
RDF triples (RDF/XML and Notation3 serialization formats)
JSON
Google Spreadsheets ^[15]

If input data is in a non-standard text format, it can be imported as whole lines, without splitting into columns, and then columns extracted later with OpenRefine's tools. Archived and compressed files are supported (.zip, .tar.gz, .tgz, .tar.bz2, .gz, or .bz2) and Refine can download input files from a URL. To use web pages as input, it is possible to import a list of URLs and then invoke a URL fetch function.

Export is supported in following formats:^[16]

TSV
CSV
Microsoft Excel
HTML table
Google Spreadsheets
Templating exporter: it is possible to define custom template for outputting data, for example as MediaWiki table.

Whole OpenRefine projects in native format can be exported as a .tar.gz archive.

Development

OpenRefine started life as Freebase Gridworks, developed by Metaweb and has been available as open source since January 2010.^[17] On 16 July 2010, Google acquired Metaweb,^[18] the creators of Freebase, and on 10 November 2010 renamed Freebase Gridwords Google Refine, releasing version 2.0.^[19] On 2 October 2012, original author David Huynh announced that Google would soon stop its active support of Google Refine.^[20]^[21]^[22] Since then, the codebase has been in transition to an open source project named OpenRefine.^[23]

Related Research Articles

Microsoft Excel is a spreadsheet editor developed by Microsoft for Windows, macOS, Android, iOS and iPadOS. It features calculation or computation capabilities, graphing tools, pivot tables, and a macro programming language called Visual Basic for Applications (VBA). Excel forms part of the Microsoft 365 suite of software.

A spreadsheet is a computer application for computation, organization, analysis and storage of data in tabular form. Spreadsheets were developed as computerized analogs of paper accounting worksheets. The program operates on data entered in cells of a table. Each cell may contain either numeric or text data, or the results of formulas that automatically calculate and display a value based on the contents of other cells. The term spreadsheet may also refer to one such electronic document.

<span class="mw-page-title-main">Comma-separated values</span> File format used to store data

Comma-separated values (CSV) is a text file format that uses commas to separate values, and newlines to separate records. A CSV file stores tabular data in plain text, where each line of the file typically represents one data record. Each record consists of the same number of fields, and these are separated by commas in the CSV file. If the field delimiter itself may appear within a field, fields can be surrounded with quotation marks.

A flat-file database is a database stored in a file called a flat file. Records follow a uniform format, and there are no structures for indexing or recognizing relationships between records. The file is simple. A flat file can be a plain text file, or a binary file. Relationships can be inferred from the data in the database, but the database format itself does not make those relationships explicit.

The following tables compare general and technical information for many wiki software packages.

In computer hypertext, a URI fragment is a string of characters that refers to a resource that is subordinate to another, primary resource. The primary resource is identified by a Uniform Resource Identifier (URI), and the fragment identifier points to the subordinate resource.

<span class="mw-page-title-main">File Roller</span> An archive manager for the GNOME desktop environment

File Roller is a file archiver for the GNOME desktop environment.

Symbolic Link (SYLK) is a Microsoft file format typically used to exchange data between applications, specifically spreadsheets. SYLK files conventionally have a .slk suffix. Composed of only displayable ANSI characters, it can be easily created and processed by other applications, such as databases.

Data Interchange Format (.dif) is a text file format used to import/export single spreadsheets between spreadsheet programs.

A semantic wiki is a wiki that has an underlying model of the knowledge described in its pages. Regular, or syntactic, wikis have structured text and untyped hyperlinks. Semantic wikis, on the other hand, provide the ability to capture or identify information about the data within pages, and the relationships between pages, in ways that can be queried or exported like a database through semantic queries.

CellProfiler is free, open-source software designed to enable biologists without training in computer vision or programming to quantitatively measure phenotypes from thousands of images automatically. Advanced algorithms for image analysis are available as individual modules that can be placed in sequential order together to form a pipeline; the pipeline is then used to identify and measure biological objects and features in images, particularly those obtained through fluorescence microscopy.

<span class="mw-page-title-main">Gramps (software)</span> Genealogy software

Gramps, formerly GRAMPS, is a free and open-source genealogy software. It is developed in Python using PyGObject and utilizes Graphviz to create relationship graphs.

Google Code Search was a free beta product from Google which debuted in Google Labs on October 5, 2006, allowing web users to search for open-source code on the Internet. Features included the ability to search using operators, namely lang:, package:, license:, and file:.

Numbers is a spreadsheet application developed by Apple Inc. as part of the iWork productivity suite alongside Keynote and Pages. Numbers is available for iOS and macOS High Sierra or newer. Numbers 1.0 on Mac OS X was announced on August 7, 2007, making it the newest application in the iWork suite. The iPad version was released on January 27, 2010. The app was later updated to support iPhone and iPod Touch.

Metaweb Technologies, Inc. was a San Francisco–based company that developed Freebase, described as an "open, shared database of the world's knowledge". The company was co-founded by Danny Hillis, Veda Hlubinka-Cook and John Giannandrea in 2005.

Freebase was a large collaborative knowledge base consisting of data composed mainly by its community members. It was an online collection of structured data harvested from many sources, including individual, user-submitted wiki contributions. Freebase aimed to create a global resource that allowed people to access common information more effectively. It was developed by the American software company Metaweb and run publicly beginning in March 2007. Metaweb was acquired by Google in a private sale announced on 16 July 2010. Google's Knowledge Graph is powered in part by Freebase.

LibreOffice Calc is the spreadsheet component of the LibreOffice software package.

Wikidata is a collaboratively edited multilingual knowledge graph hosted by the Wikimedia Foundation. It is a common source of open data that Wikimedia projects such as Wikipedia, and anyone else, is able to use under the CC0 public domain license. Wikidata is a wiki powered by the software MediaWiki, including its extension for semi-structured data, the Wikibase. As of mid-2024, Wikidata had 1.57 billion item statements.

translatewiki.net, formerly named Betawiki, is a web-based translation platform powered by the Translate extension for MediaWiki. It can be used to translate various kinds of texts but is commonly used for creating localisations for software interfaces.

References

↑ "Release 3.8.7". 21 November 2024. Retrieved 24 November 2024.
↑ "OpenRefine/OpenRefine - GitHub". GitHub . Retrieved 25 June 2017.
↑ "openrefine.github.com". openrefine.org.
↑ "Editing by transforming: Cell Editing wiki page from Refine documentation" . Retrieved 18 April 2012.
↑ "Comparison with spreadsheet software: Cell Editing wiki page in Refine documentation" . Retrieved 18 April 2012.
↑ General Refine expression language OpenRefine/OpenRefine Wiki GitHub. Github.com (2013-04-03). Retrieved on 2013-08-16.
↑ "Expressions: Refine documentation" . Retrieved 18 April 2012.
↑ "Screencast: Google Refine 2.0 - Introduction (1 of 3) - editing government data". YouTube . 19 July 2011. Retrieved 18 April 2012.
↑ "Stripping HTML: Refine documentation wiki page" . Retrieved 18 April 2012.
↑ "FetchingURLsFromWebServices wiki page: Refine documentation" . Retrieved 18 April 2012.
↑ "Screencast: Google Refine 2.0 - Data Augmentation (3 of 3) - using Openstreetmap Nominatim for geocoding and Freebase for augmentation". YouTube . 19 July 2011. Retrieved 18 April 2012.
↑ "Schema Alignment: Refine documentation wiki page" . Retrieved 18 April 2012.
↑ "OpenRefine documentation: Reconciliation". GitHub . Retrieved 12 March 2017.
↑ "Importers: Refine documentation wiki page" . Retrieved 18 April 2012.
↑ "Changelog for 2.5" . Retrieved 18 April 2012.
↑ "Exporting: Refine documentation wiki page" . Retrieved 18 April 2012.
↑ "Google Code Archive - Long-term storage for Google Code Project Hosting". code.google.com.
↑ "Google Official Blog: Deeper understanding with Metaweb" . Retrieved 18 April 2012.
↑ "Google Opensource blog: Announcing Google Refine 2.0, a power tool for data wranglers" . Retrieved 18 April 2012.
↑ "Google Groups". groups.google.com.
↑ "From Freebase Gridworks to Google Refine and now OpenRefine".
↑ OpenRefine Archived 2016-09-25 at the Wayback Machine . OpenRefine. Retrieved on 2013-08-16.
↑ google-refine - Google Refine, a power tool for working with messy data (formerly Freebase Gridworks) - Google Project Hosting. Code.google.com. Retrieved on 2013-08-16.

External links

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[wikidata-cc2fe5cceed356882fe9203fc39c43c6af587cb8-v18-1] "Release 3.8.7". 21 November 2024. Retrieved 24 November 2024.

[2] "OpenRefine/OpenRefine - GitHub". GitHub . Retrieved 25 June 2017.

[3] "openrefine.github.com". openrefine.org.

[4] "Editing by transforming: Cell Editing wiki page from Refine documentation" . Retrieved 18 April 2012.

[5] "Comparison with spreadsheet software: Cell Editing wiki page in Refine documentation" . Retrieved 18 April 2012.

[6] General Refine expression language OpenRefine/OpenRefine Wiki GitHub. Github.com (2013-04-03). Retrieved on 2013-08-16.

[7] "Expressions: Refine documentation" . Retrieved 18 April 2012.

[8] "Screencast: Google Refine 2.0 - Introduction (1 of 3) - editing government data". YouTube . 19 July 2011. Retrieved 18 April 2012.

[9] "Stripping HTML: Refine documentation wiki page" . Retrieved 18 April 2012.

[10] "FetchingURLsFromWebServices wiki page: Refine documentation" . Retrieved 18 April 2012.

[11] "Screencast: Google Refine 2.0 - Data Augmentation (3 of 3) - using Openstreetmap Nominatim for geocoding and Freebase for augmentation". YouTube . 19 July 2011. Retrieved 18 April 2012.

[12] "Schema Alignment: Refine documentation wiki page" . Retrieved 18 April 2012.

[13] "OpenRefine documentation: Reconciliation". GitHub . Retrieved 12 March 2017.

[14] "Importers: Refine documentation wiki page" . Retrieved 18 April 2012.

[15] "Changelog for 2.5" . Retrieved 18 April 2012.

[16] "Exporting: Refine documentation wiki page" . Retrieved 18 April 2012.

[17] "Google Code Archive - Long-term storage for Google Code Project Hosting". code.google.com.

[18] "Google Official Blog: Deeper understanding with Metaweb" . Retrieved 18 April 2012.

[19] "Google Opensource blog: Announcing Google Refine 2.0, a power tool for data wranglers" . Retrieved 18 April 2012.

[20] "Google Groups". groups.google.com.

[21] "From Freebase Gridworks to Google Refine and now OpenRefine".

[22] OpenRefine Archived 2016-09-25 at the Wayback Machine . OpenRefine. Retrieved on 2013-08-16.

[23] -refine - Google Refine, a power tool for working with messy data (formerly Freebase Gridworks) - Google Project Hosting. Code.google.com. Retrieved on 2013-08-16.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]