Last updated
Some of the different types of data. Data types - en.svg
Some of the different types of data.

Data are individual facts, statistics, or items of information, often numeric, that are collected through observation. [1] In a more technical sense, data are a set of values of qualitative or quantitative variables about one or more persons or objects, [1] while a datum (singular of data) is a single value of a single variable. [2]


Although the terms "data" and "information" are often used interchangeably, these terms have distinct meanings. In some popular publications, data are sometimes said to be transformed into information when they are viewed in context or in post-analysis. [3] However, in academic treatments of the subject data are simply units of information. Data are used in scientific research, businesses management (e.g., sales data, revenue, profits, stock price), finance, governance (e.g., crime rates, unemployment rates, literacy rates), and in virtually every other form of human organizational activity (e.g., censuses of the number of homeless people by non-profit organizations).

Data are measured, collected, reported, and analyzed, and used to create data visualizations such as graphs, tables or images. Data as a general concept refers to the fact that some existing information or knowledge is represented or coded in some form suitable for better usage or processing. Raw data ("unprocessed data") is a collection of numbers or characters before it has been "cleaned" and corrected by researchers. Raw data needs to be corrected to remove outliers or obvious instrument or data entry errors (e.g., a thermometer reading from an outdoor Arctic location recording a tropical temperature). Data processing commonly occurs by stages, and the "processed data" from one stage may be considered the "raw data" of the next stage. Field data is raw data that is collected in an uncontrolled "in situ" environment. Experimental data is data that is generated within the context of a scientific investigation by observation and recording.

Data has been described as the new oil of the digital economy. [4] [5]

Etymology and terminology

The first English use of the word "data" is from the 1640s. The word "data" was first used to mean "transmissible and storable computer information" in 1946. The expression "data processing" was first used in 1954. [6]

The Latin word data is the plural of 'datum', "(thing) given," neuter past participle of dare "to give". [6] In English the word data may be used as a plural noun in this sense, with some writers—usually those working in natural sciences, life sciences, and social sciences—using datum in the singular and data for plural, especially in the 20th century and in many cases also the 21st (for example, APA style as of the 7th edition still requires "data" to be plural. [7] ). However, in everyday language and in much of the usage of software development and computer science, "data" is most commonly used in the singular as a mass noun (like "sand" or "rain"). The term big data takes the singular.


Adrien Auzout's "A TABLE of the Apertures of Object-Glasses" from a 1665 article in Philosophical Transactions Philosophical Transactions - Volume 001.djvu
Adrien Auzout's "A TABLE of the Apertures of Object-Glasses" from a 1665 article in Philosophical Transactions

Data, information, knowledge and wisdom are closely related concepts, but each has its own role in relation to the other, and each term has its own meaning. According to a common view, data are collected and analyzed; data only becomes information suitable for making decisions once it has been analyzed in some fashion. [8] One can say that the extent to which a set of data is informative to someone depends on the extent to which it is unexpected by that person. The amount of information contained in a data stream may be characterized by its Shannon entropy.

Knowledge is the understanding based on extensive experience dealing with information on a subject. For example, the height of Mount Everest is generally considered data. The height can be measured precisely with an altimeter and entered into a database. This data may be included in a book along with other data on Mount Everest to describe the mountain in a manner useful for those who wish to make a decision about the best method to climb it. An understanding based on experience climbing mountains that could advise persons on the way to reach Mount Everest's peak may be seen as "knowledge". The practical climbing of Mount Everest's peak based on this knowledge may be seen as "wisdom". In other words, wisdom refers to the practical application of a person's knowledge in those circumstances where good may result. Thus wisdom complements and completes the series "data", "information" and "knowledge" of increasingly abstract concepts.

Data are often assumed to be the least abstract concept, information the next least, and knowledge the most abstract. [9] In this view, data becomes information by interpretation; e.g., the height of Mount Everest is generally considered "data", a book on Mount Everest geological characteristics may be considered "information", and a climber's guidebook containing practical information on the best way to reach Mount Everest's peak may be considered "knowledge". "Information" bears a diversity of meanings that ranges from everyday usage to technical use. This view, however, has also been argued to reverse the way in which data emerges from information, and information from knowledge. [10] Generally speaking, the concept of information is closely related to notions of constraint, communication, control, data, form, instruction, knowledge, meaning, mental stimulus, pattern, perception, and representation. Beynon-Davies uses the concept of a sign to differentiate between data and information; data are a series of symbols, while information occurs when the symbols are used to refer to something. [11] [12]

Before the development of computing devices and machines, people had to manually collect data and impose patterns on it. Since the development of computing devices and machines, these devices can also collect data. In the 2010s, computers are widely used in many fields to collect data and sort or process it, in disciplines ranging from marketing, analysis of social services usage by citizens to scientific research. These patterns in data are seen as information which can be used to enhance knowledge. These patterns may be interpreted as "truth" (though "truth" can be a subjective concept), and may be authorized as aesthetic and ethical criteria in some disciplines or cultures. Events that leave behind perceivable physical or virtual remains can be traced back through data. Marks are no longer considered data once the link between the mark and observation is broken. [13]

Mechanical computing devices are classified according to the means by which they represent data. An analog computer represents a datum as a voltage, distance, position, or other physical quantity. A digital computer represents a piece of data as a sequence of symbols drawn from a fixed alphabet. The most common digital computers use a binary alphabet, that is, an alphabet of two characters, typically denoted "0" and "1". More familiar representations, such as numbers or letters, are then constructed from the binary alphabet. Some special forms of data are distinguished. A computer program is a collection of data, which can be interpreted as instructions. Most computer languages make a distinction between programs and the other data on which programs operate, but in some languages, notably Lisp and similar languages, programs are essentially indistinguishable from other data. It is also useful to distinguish metadata, that is, a description of other data. A similar yet earlier term for metadata is "ancillary data." The prototypical example of metadata is the library catalog, which is a description of the contents of books.

Data documents

Whenever data needs to be registered, data exists in the form of a data documents. Kinds of data documents include:

Some of these data documents (data repositories, data studies, data sets and software) are indexed in Data Citation Indexes, while data papers are indexed in traditional bibliographic databases, e.g., Science Citation Index. See further. [14]

Data collection

Gathering data can be accomplished through a primary source (the researcher is the first person to obtain the data) or a secondary source (the researcher obtains the data that has already been collected by other sources, such as data disseminated in a scientific journal). Data analysis methodologies vary and include data triangulation and data percolation. [15] The latter offers an articulate method of collecting, classifying and analyzing data using five possible angles of analysis (at least three) in order to maximize the research's objectivity and permit an understanding of the phenomena under investigation as complete as possible: qualitative and quantitative methods, literature reviews (including scholarly articles), interviews with experts, and computer simulation. The data are thereafter "percolated" using a series of pre-determined steps so as to extract the most relevant information.

In other fields

Although data are also increasingly used in other fields, it has been suggested that the highly interpretive nature of them might be at odds with the ethos of data as "given". Peter Checkland introduced the term capta (from the Latin capere, “to take”) to distinguish between an immense number of possible data and a sub-set of them, to which attention is oriented. [16] Johanna Drucker has argued that since the humanities affirm knowledge production as "situated, partial, and constitutive," using data may introduce assumptions that are counterproductive, for example that phenomena are discrete or are observer-independent. [17] The term capta, which emphasizes the act of observation as constitutive, is offered as an alternative to data for visual representations in the humanities.

See also

Related Research Articles

Empirical research Research using empirical evidence

Empirical research is research using empirical evidence. It is also a way of gaining knowledge by means of direct and indirect observation or experience. Empiricism values some research more than other kinds. Empirical evidence can be analyzed quantitatively or qualitatively. Quantifying the evidence or making sense of it in qualitative form, a researcher can answer empirical questions, which should be clearly defined and answerable with the evidence collected. Research design varies by field and by the question being investigated. Many researchers combine qualitative and quantitative forms of analysis to better answer questions which cannot be studied in laboratory settings, particularly in the social sciences and in education.

Semantics is the study of meaning, reference, or truth. The term can be used to refer to subfields of several distinct disciplines, including philosophy, linguistics and computer science.

English plurals How English plurals are formed; typically -(e)s

English nouns are inflected for grammatical number, meaning that, if they are of the countable type, they generally have different forms for singular and plural. This article discusses the variety of ways in which English plural nouns are formed from the corresponding singular forms, as well as various issues concerning the usage of singulars and plurals in English. For plurals of pronouns, see English personal pronouns.

Business intelligence (BI) comprises the strategies and technologies used by enterprises for the data analysis of business information. BI technologies provide historical, current, and predictive views of business operations. Common functions of business intelligence technologies include reporting, online analytical processing, analytics, dashboard development, data mining, process mining, complex event processing, business performance management, benchmarking, text mining, predictive analytics, and prescriptive analytics. BI technologies can handle large amounts of structured and sometimes unstructured data to help identify, develop, and otherwise create new strategic business opportunities. They aim to allow for the easy interpretation of these big data. Identifying new opportunities and implementing an effective strategy based on insights can provide businesses with a competitive market advantage and long-term stability.

In computer science, a reference is a value that enables a program to indirectly access a particular datum, such as a variable's value or a record, in the computer's memory or in some other storage device. The reference is said to refer to the datum, and accessing the datum is called dereferencing the reference.

Entity–relationship model Model or diagram describing interrelated things

An entity–relationship model describes interrelated things of interest in a specific domain of knowledge. A basic ER model is composed of entity types and specifies relationships that can exist between entities.

Tag (metadata) Keyword assigned to information

In information systems, a tag is a keyword or term assigned to a piece of information. This kind of metadata helps describe an item and allows it to be found again by browsing or searching. Tags are generally chosen informally and personally by the item's creator or by its viewer, depending on the system, although they may also be chosen from a controlled vocabulary.

Elevation Height of a geographic location above a fixed reference point

The elevation of a geographic location is its height above or below a fixed reference point, most commonly a reference geoid, a mathematical model of the Earth's sea level as an equipotential gravitational surface . The term elevation is mainly used when referring to points on the Earth's surface, while altitude or geopotential height is used for points above the surface, such as an aircraft in flight or a spacecraft in orbit, and depth is used for points below the surface.

Data (computing) Quantities, characters, or symbols on which operations are performed by a computer

In computing, data is any sequence of one or more symbols. Datum is a single symbol of data. Data requires interpretation to become information. Digital data is data that is represented using the binary number system of ones (1) and zeros (0), as opposed to analog representation. In modern (post-1960) computer systems, all data is digital.

Unstructured data is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well. This results in irregularities and ambiguities that make it difficult to understand using traditional programs as compared to data stored in fielded form in databases or annotated in documents.

Digital humanities

Digital humanities (DH) is an area of scholarly activity at the intersection of computing or digital technologies and the disciplines of the humanities. It includes the systematic use of digital resources in the humanities, as well as the analysis of their application. DH can be defined as new ways of doing scholarship that involve collaborative, transdisciplinary, and computationally engaged research, teaching, and publishing. It brings digital tools and methods to the study of the humanities with the recognition that the printed word is no longer the main medium for knowledge production and distribution.

Raw data

Raw data, also known as primary data, are data collected from a source. In the context of examinations, the raw data might be described as a raw score.

Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. The goal of data wrangling is to assure quality and useful data. Data analysts typically spend the majority of their time in the process of data wrangling compared to the actual analysis of the data.

Amit Sheth is a computer scientist at University of South Carolina in Columbia, South Carolina. He is the founding Director of the Artificial Intelligence Institute, and a Professor of Computer Science and Engineering. From 2007 to June 2019, he was the Lexis Nexis Ohio Eminent Scholar, director of the Ohio Center of Excellence in Knowledge-enabled Computing, and a Professor of Computer Science at Wright State University. Sheth's work has been cited by over 48,800 publications. He has an h-index of 106, which puts him among the top 100 computer scientists with the highest h-index. Prior to founding the Kno.e.sis Center, he served as the director of the Large Scale Distributed Information Systems Lab at the University of Georgia in Athens, Georgia.

Metadata Data about data

Metadata is "data that provides information about other data". In other words, it is "data about data". Many distinct types of metadata exist, including descriptive metadata, structural metadata, administrative metadata, reference metadata, statistical metadata and legal metadata.

Information facts provided or learned about something or someone

Information, in a general sense, is processed, organised and structured data. It provides context for data and enables decision making. For example, a single customer’s sale at a restaurant is data – this becomes information when the business is able to identify the most popular or least popular dish.

A Technical Data Management System (TDMS) is a document management system (DMS) pertaining to the management of technical and engineering drawings and documents. Often the data are contained in 'records' of various forms, such as on paper, microfilms or digital media. Hence technical data management is also concerned with record management involving technical data. Technical document management systems are used within large organisations with large scale projects involving engineering. For example, a TDMS can be used for steel plants (ISP), automobile factories, aero-space facilities, infrastructure companies, city corporations, research organisations, etc. In such organisations, Technical Archives or Technical Documentation Centres are created as central facilities for effective management of technical data and records.

The following is provided as an overview of and topical guide to databases:

In computing, a data definition specification (DDS) is a guideline to ensure comprehensive and consistent data definition. It represents the attributes required to quantify data definition. A comprehensive data definition specification encompasses enterprise data, the hierarchy of data management, prescribed guidance enforcement and criteria to determine compliance.

The word data has generated considerable controversy on whether it is an uncountable noun used with verbs conjugated in the singular, or should be treated as the plural of the now-rarely-used datum.


This article is based on material taken from the Free On-line Dictionary of Computing prior to 1 November 2008 and incorporated under the "relicensing" terms of the GFDL, version 1.3 or later.

  1. 1 2 OECD Glossary of Statistical Terms. OECD. 2008. p. 119. ISBN   978-92-64-025561.
  2. "Statistical Language - What are Data?". Australian Bureau of Statistics. 2013-07-13. Archived from the original on 2019-04-19. Retrieved 2020-03-09.
  3. "Data vs Information - Difference and Comparison | Diffen". Retrieved 2018-12-11.
  4. Yonego, Joris Toonders (July 23, 2014). "Data Is the New Oil of the Digital Economy" via
  5. "Data is the new oil". July 16, 2018. Archived from the original on 2018-07-16.
  6. 1 2 "data | Origin and meaning of data by Online Etymology Dictionary".
  7. American Psychological Association (2020). "6.11". Publication Manual of the American Psychological Association: the official guide to APA style. American Psychological Association. ISBN   9781433832161.
  8. "Joint Publication 2-0, Joint Intelligence" (PDF). Joint Chiefs of Staff, Joint Doctrine Publications. Department of Defense. 23 October 2013. pp. I-1. Retrieved July 17, 2018.
  9. Akash Mitra (2011). "Classifying data for successful modeling".
  10. Tuomi, Ilkka (2000). "Data is more than knowledge". Journal of Management Information Systems. 6 (3): 103–117. doi:10.1080/07421222.1999.11518258.
  11. P. Beynon-Davies (2002). Information Systems: An introduction to informatics in organisations. Basingstoke, UK: Palgrave Macmillan. ISBN   0-333-96390-3.
  12. P. Beynon-Davies (2009). Business information systems. Basingstoke, UK: Palgrave. ISBN   978-0-230-20368-6.
  13. Sharon Daniel. The Database: An Aesthetics of Dignity.
  14. Schöpfel et al. 2020. "Data Documents". ISKO Encyclopedia of Knowledge Organization
  15. Mesly, Olivier (2015). Creating Models in Psychological Research. États-Unis : Springer Psychology  : 126 pages. ISBN   978-3-319-15752-8
  16. P. Checkland and S. Holwell (1998). Information, Systems, and Information Systems: Making Sense of the Field. Chichester, West Sussex: John Wiley & Sons. pp. 86–89. ISBN   0-471-95820-4.
  17. Johanna Drucker (2011). "Humanities Approaches to Graphical Display".