Flat-file database

Last updated
Example of a flat file model Flat File Model.svg
Example of a flat file model

A flat-file database is a database stored in a file called a flat file. Records follow a uniform format, and there are no structures for indexing or recognizing relationships between records. The file is simple. A flat file can be a plain text file (e.g. csv, txt or tsv), or a binary file. Relationships can be inferred from the data in the database, but the database format itself does not make those relationships explicit.

Contents

The term has generally implied a small database, but very large databases can also be flat.

Overview

Plain text files usually contain one record per line. [2]

Examples of flat files include /etc/passwd and /etc/group on Unix-like operating systems. Another example of a flat file is a name-and-address list with the fields Name, Address, and Phone Number.

A list of names, addresses, and phone numbers written by hand on a sheet of paper is a flat-file database. This can also be done with any typewriter or word processor. A spreadsheet or text editor program may be used to implement a flat-file database, which may then be printed or used online for improved search capabilities.

Flat files are typically either delimiter-separated (such as comma-separated values (CSV)) or fixed-width (each column has a fixed width).

Delimiter-separated values

In delimiter-separated values files, fields are separated by a character or string called the delimiter. Common variants are CSV (delimiter is ,), tab-separated values (TSV) (delimiter is the tab character), space-separated values, and vertical-bar-separated values (delimiter is |).

If the delimiter is allowed inside a field, there needs to be a way to distinguish delimiters characters or strings that are meant literally. For example, consider the sentence "If I have to, I'll do it myself.". To encode it in CSV, there needs to be a way to prevent the comma from splitting the field. Several strategies to prevent delimiter collision exist.

Fixed-width formats

With fixed-width formats, each column has a fixed length, and fields are padded with spaces as needed. The fixed lengths can be predefined and known ahead of time (i.e. stated in the format's specification), or parsed from a header.

With predefined lengths, fields are limited to a maximum length. The need for longer fields may appear sometime after the format is defined. Possible workarounds include abbreviating phrases, replacing values with links (e.g. a URI pointing to the value), and splitting a file into multiple files.

With delimiter-separated formats, determining the field boundaries requires finding the delimiters, which incurs some computational overhead. This is not needed for fixed-width formats. However, fixed-width formats can lead to unnecessarily large file sizes if fields tend to be shorter than the lengths reserved for them.

Declarative notation

Delimiters can be used alongside a notation stating the length of each field. For example, 5apple|9pineapple specifies the length (5 and 9) of each field. This is called declarative notation. It has low overhead and trivially avoids delimiter collisions, but it is brittle when edited manually and is rarely used.

History

Herman Hollerith's work for the US Census Bureau first exercised in the 1890 United States Census, involving data tabulated via hole punches in paper cards, [3] is sometimes considered the first computerized flat-file database, as it included no cards indexing other cards, or otherwise relating the individual cards to one another, save by their group membership.[ citation needed ]

In the 1980s, configurable flat-file database computer applications were popular on the IBM PC and the Macintosh. These programs were designed to make it easy for individuals to design and use their own databases, and were almost on par with word processors and spreadsheets in popularity.[ citation needed ] Examples of flat-file database software include early versions of FileMaker and the shareware PC-File and the popular dBase.

Flat-file databases are common and ubiquitous because they are easy to write and edit, and suit myriad purposes in an uncomplicated way.

Modern implementations

Linear stores of NoSQL data, JSON data, primitive spreadsheets (perhaps comma-separated or tab-delimited), and text files can all be seen as flat-file databases because they lack integrated indexes, built-in references between data elements, and complex data types. Programs to manage collections of books or appointments and address books may use single-purpose flat-file databases, storing and retrieving information from flat files unadorned with indexes or pointing systems.

While a user can write a table of contents into a text file, the text file format itself does not include a concept of a table of contents. While a user may write "friends with Kathy" in the "Notes" section for John's contact information, this is interpreted by the user rather than a built-in feature of the database. When a database system begins to recognize and codify relationships between records, it begins to drift away from being "flat," and when it has a detailed system for describing types and hierarchical relationships, it is now too structured to be considered "flat."

Example database

The following example illustrates typical elements of a flat-file database. The data arrangement consists of a series of columns and rows organized into a tabular format. This specific example uses only one table.

The columns include: name (a person's name, second column); team (the name of an athletic team supported by the person, third column); and a numeric unique ID, (used to uniquely identify records, first column).

Here is an example textual representation of the described data:

id    name    team 1     Amy     Blues 2     Bob     Reds 3     Chuck   Blues 4     Richard Blues 5     Ethel   Reds 6     Fred    Blues 7     Gilly   Blues 8     Hank    Reds 9     Hank    Blues

This type of data representation is quite standard for a flat-file database, although there are some additional considerations that are not readily apparent from the text:

See also

Related Research Articles

<span class="mw-page-title-main">Spreadsheet</span> Computer application for organization, analysis, and storage of data in tabular form

A spreadsheet is a computer application for computation, organization, analysis and storage of data in tabular form. Spreadsheets were developed as computerized analogs of paper accounting worksheets. The program operates on data entered in cells of a table. Each cell may contain either numeric or text data, or the results of formulas that automatically calculate and display a value based on the contents of other cells. The term spreadsheet may also refer to one such electronic document.

<span class="mw-page-title-main">Machine-readable medium and data</span> Medium capable of storing data in a format readable by a machine

In communications and computing, a machine-readable medium is a medium capable of storing data in a format easily readable by a digital computer or a sensor. It contrasts with human-readable medium and data.

<span class="mw-page-title-main">Extract, transform, load</span> Procedure in computing

In computing, extract, transform, load (ETL) is a three-phase process where data is extracted from an input source, transformed, and loaded into an output data container. The data can be collated from one or more sources and it can also be output to one or more destinations. ETL processing is typically executed using software applications but it can also be done manually by system operators. ETL software typically automates the entire process and can be run manually or on recurring schedules either as single jobs or aggregated into a batch of jobs.

<span class="mw-page-title-main">Tab key</span> Key on a keyboard for tabulation

The tab keyTab ↹ on a keyboard is used to advance the cursor to the next tab stop.

A GIS file format is a standard for encoding geographical information into a computer file, as a specialized type of file format for use in geographic information systems (GIS) and other geospatial applications. Since the 1970s, dozens of formats have been created based on various data models for various purposes. They have been created by government mapping agencies, GIS software vendors, standards bodies such as the Open Geospatial Consortium, informal user communities, and even individual developers.

<span class="mw-page-title-main">Comma-separated values</span> File format used to store data

Comma-separated values (CSV) is a text file format that uses commas to separate values, and newlines to separate records. A CSV file stores tabular data in plain text, where each line of the file typically represents one data record. Each record consists of the same number of fields, and these are separated by commas in the CSV file. If the field delimiter itself may appear within a field, fields can be surrounded with quotation marks.

passwd Tool to change passwords on Unix-like OSes

passwd is a command on Unix, Plan 9, Inferno, and most Unix-like operating systems used to change a user's password. The password entered by the user is run through a key derivation function to create a hashed version of the new password, which is saved. Only the hashed version is stored; the entered password is not saved for security reasons.

<span class="mw-page-title-main">Delimiter</span> Characters that specify the boundary between regions in a data stream

A delimiter is a sequence of one or more characters for specifying the boundary between separate, independent regions in plain text, mathematical expressions or other data streams. An example of a delimiter is the comma character, which acts as a field delimiter in a sequence of comma-separated values. Another example of a delimiter is the time gap used to separate letters and words in the transmission of Morse code.

paste is a Unix command line utility which is used to join files horizontally by outputting lines consisting of the sequentially corresponding lines of each file specified, separated by tabs, to the standard output.

Formats that use delimiter-separated values store two-dimensional arrays of data by separating the values in each row with specific delimiter characters. Most database and spreadsheet programs are able to read or save data in a delimited format. Due to their wide support, DSV files can be used in data exchange among many applications.

A table is a collection of related data held in a table format within a database. It consists of columns and rows.

<span class="mw-page-title-main">Tab-separated values</span> Text file format

Tab-separated values (TSV) is a simple, text-based file format for storing tabular data. Records are separated by newlines, and values within a record are separated by tab characters. The TSV format is thus a delimiter-separated values format, similar to comma-separated values.

Data drilling refers to any of various operations and transformations on tabular, relational, and multidimensional data. The term has widespread use in various contexts, but is primarily associated with specialized software designed specifically for data analysis.

Symbolic Link (SYLK) is a Microsoft file format typically used to exchange data between applications, specifically spreadsheets. SYLK files conventionally have a .slk suffix. Composed of only displayable ANSI characters, it can be easily created and processed by other applications, such as databases.

TPL Tables is a cross tabulation system used to generate statistical tables for analysis or publication.

Data feed is a mechanism for users to receive updated data from data sources. It is commonly used by real-time applications in point-to-point settings as well as on the World Wide Web. The latter is also called web feed. News feed is a popular form of web feed. RSS feed makes dissemination of blogs easy. Product feeds play increasingly important role in e-commerce and internet marketing, as well as news distribution, financial markets, and cybersecurity. Data feeds usually require structured data that include different labelled fields, such as "title" or "product".

Shazam is a comprehensive econometrics and statistics package for estimating, testing, simulating and forecasting many types of econometrics and statistical models. SHAZAM was originally created in 1977 by Kenneth White.

<span class="mw-page-title-main">OpenRefine</span> Application for data cleanup and data transformation

OpenRefine is an open-source desktop application for data cleanup and transformation to other formats, an activity commonly known as data wrangling. It is similar to spreadsheet applications, and can handle spreadsheet file formats such as CSV, but it behaves more like a database.

Stylus Studio is an integrated development environment (IDE) for the Extensible Markup Language (XML). It consists of a variety of tools and visual designers to edit and transform XML documents and legacy data such as electronic data interchange (EDI), comma-separated values (CSV) and relational data.

Fielded Text is a proposed standard which provides structure and schema definition to text files which contain tables of values. The standard allows the format and structure of the data within the text file to be specified by a Meta file. This Meta file can then be used to access the data in the file in manner similar to which data is accessed in a database.

References

  1. Data Integration Glossary Archived March 20, 2009, at the Wayback Machine , U.S. Department of Transportation, August 2001.
  2. Fowler, Glenn (1994), "cql: Flat-file database query language", WTEC'94: Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
  3. Blodgett, John H.; Schultz, Claire K. (1969). "Herman hollerith: data processing pioneer". American Documentation. 20 (3): 221–226. doi:10.1002/asi.4630200307. ISSN   1936-6108.