Raw data

Last updated

The two columns to the right of the left-most column in this computerized table are raw data. Origin histograma raw data.png
The two columns to the right of the left-most column in this computerized table are raw data.

Raw data, also known as primary data, are data (e.g., numbers, instrument readings, figures, etc.) collected from a source. In the context of examinations, the raw data might be described as a raw score (after test scores).

Contents

If a scientist sets up a computerized thermometer which records the temperature of a chemical mixture in a test tube every minute, the list of temperature readings for every minute, as printed out on a spreadsheet or viewed on a computer screen are "raw data". Raw data have not been subjected to processing, "cleaning" by researchers to remove outliers, obvious instrument reading errors or data entry errors, or any analysis (e.g., determining central tendency aspects such as the average or median result). As well, raw data have not been subject to any other manipulation by a software program or a human researcher, analyst or technician. They are also referred to as primary data. Raw data is a relative term (see data), because even once raw data have been "cleaned" and processed by one team of researchers, another team may consider these processed data to be "raw data" for another stage of research. Raw data can be inputted to a computer program or used in manual procedures such as analyzing statistics from a survey. The term "raw data" can refer to the binary data on electronic storage devices, such as hard disk drives (also referred to as "low-level data").

Generating data

Data has two ways of being created or made. The first is what is called 'captured data', [1] and is found through purposeful investigation or analysis. The second is called 'exhaust data', [1] and is gathered usually by machines or terminals as a secondary function. For example, cash registers, smartphones, and speedometers serve a main function but may collect data as a secondary task. Exhaustive data is usually too large or of little use to process and becomes 'transient' or thrown away. [1]

Examples

In computing, raw data may have the following attributes: it may possibly contain human, machine, or instrument errors, it may not be validated; it might be in different area (colloquial) formats; uncoded or unformatted; or some entries might be "suspect" (e.g., outliers), requiring confirmation or citation. For example, a data input sheet might contain dates as raw data in many forms: "31st January 1999", "31/01/1999", "31/1/99", "31 Jan", or "today". Once captured, this raw data may be processed stored as a normalized format, perhaps a Julian date, to make it easier for computers and humans to interpret during later processing. Raw data (sometimes colloquially called "sources" data or "eggy" data, the latter a reference to the data being "uncooked", that is, "unprocessed", like a raw egg) are the data input to processing. A distinction is made between data and information, to the effect that information is the end product of data processing. Raw data that has undergone processing are sometimes referred to as "cooked" data in a colloquial sense.[ dubious ] Although raw data has the potential to be transformed into "information," extraction, organization, analysis, and formatting for presentation are required before raw data can be transformed into usable information.

For example, a point-of-sale terminal (POS terminal, a computerized cash register) in a busy supermarket collects huge volumes of raw data each day about customers' purchases. However, this list of grocery items and their prices and the time and date of purchase does not yield much information until it is processed. Once processed and analyzed by a software program or even by a researcher using a pen and paper and a calculator, this raw data may indicate the particular items that each customer buys, when they buy them, and at what price; as well, an analyst or manager could calculate the average total sales per customer or the average expenditure per day of the week by hour. This processed and analyzed data provides information for the manager, that the manager could then use to help her determine, for example, how many cashiers to hire and at what times. Such information could then become data for further processing, for example as part of a predictive marketing campaign. As a result of processing, raw data sometimes ends up being put in a database, which enables the raw data to become accessible for further processing and analysis in any number of different ways.

Tim Berners-Lee (inventor of the World Wide Web) argues that sharing raw data is important for society. Inspired by a post by Rufus Pollock of the Open Knowledge Foundation his call to action is "Raw Data Now", meaning that everyone should demand that governments and businesses share the data they collect as raw data. He points out that "data drives a huge amount of what happens in our lives… because somebody takes the data and does something with it." To Berners-Lee, it is essentially from this sharing of raw data, that advances in science will emerge. Advocates of open data argue that once citizens and civil society organizations have access to data from businesses and governments, it will enable citizens and NGOs to do their own analysis of the data, which can empower people and civil society. For example, a government may claim that its policies are reducing the unemployment rate, but a poverty advocacy group may be able to have its staff econometricians do their own analysis of the raw data, which may lead this group to draw different conclusions about the data set.

See also

Related Research Articles

<span class="mw-page-title-main">Semantic Web</span> Extension of the Web to facilitate data exchange

The Semantic Web, sometimes known as Web 3.0, is an extension of the World Wide Web through standards set by the World Wide Web Consortium (W3C). The goal of the Semantic Web is to make Internet data machine-readable.

Computerized batch processing is a method of running software programs called jobs in batches automatically. While users are required to submit the jobs, no other interaction by the user is required to process the batch. Batches may automatically be run at scheduled times as well as being run contingent on the availability of computer resources.

Data mining is the process of extracting and discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and statistics with an overall goal of extracting information from a data set and transforming the information into a comprehensible structure for further use. Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD. Aside from the raw analysis step, it also involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.

In connection-oriented communication, a data stream is the transmission of a sequence of digitally encoded signals to convey information. Typically, the transmitted symbols are grouped into a series of packets.

<span class="mw-page-title-main">Outlier</span> Observation far apart from others in statistics and data science

In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are sometimes excluded from the data set. An outlier can be an indication of exciting possibility, but can also cause serious problems in statistical analyses.

<span class="mw-page-title-main">Extract, transform, load</span> Procedure in computing

In computing, extract, transform, load (ETL) is a three-phase process where data is extracted, transformed and loaded into an output data container. The data can be collated from one or more sources and it can also be output to one or more destinations. ETL processing is typically executed using software applications but it can also be done manually by system operators. ETL software typically automates the entire process and can be run manually or on reccurring schedules either as single jobs or aggregated into a batch of jobs.

Design for Six Sigma (DFSS) is a collection of best-practices for the development of new products and processes. It is sometimes deployed as an engineering design process or business process management method. DFSS originated at General Electric to build on the success they had with traditional Six Sigma; but instead of process improvement, DFSS was made to target new product development. It is used in many industries, like finance, marketing, basic engineering, process industries, waste management, and electronics. It is based on the use of statistical tools like linear regression and enables empirical research similar to that performed in other fields, such as social science. While the tools and order used in Six Sigma require a process to be in place and functioning, DFSS has the objective of determining the needs of customers and the business, and driving those needs into the product solution so created. It is used for product or process design in contrast with process improvement. Measurement is the most important part of most Six Sigma or DFSS tools, but whereas in Six Sigma measurements are made from an existing process, DFSS focuses on gaining a deep insight into customer needs and using these to inform every design decision and trade-off.

<span class="mw-page-title-main">Performance indicator</span> Measurement that evaluates the success of an organization

A performance indicator or key performance indicator (KPI) is a type of performance measurement. KPIs evaluate the success of an organization or of a particular activity in which it engages. KPIs provide a focus for strategic and operational improvement, create an analytical basis for decision making and help focus attention on what matters most.

In statistics, classification is the problem of identifying which of a set of categories (sub-populations) an observation belongs to. Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diagnosis to a given patient based on observed characteristics of the patient.

In computing, data validation is the process of ensuring data has undergone data cleansing to confirm they have data quality, that is, that they are both correct and useful. It uses routines, often called "validation rules", "validation constraints", or "check routines", that check for correctness, meaningfulness, and security of data that are input to the system. The rules may be implemented through the automated facilities of a data dictionary, or by the inclusion of explicit application program validation logic of the computer and its application.

In software engineering, profiling is a form of dynamic program analysis that measures, for example, the space (memory) or time complexity of a program, the usage of particular instructions, or the frequency and duration of function calls. Most commonly, profiling information serves to aid program optimization, and more specifically, performance engineering.

<span class="mw-page-title-main">Data analysis</span> The process of analyzing data to discover useful information and support decision-making

Data analysis is the process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, and is used in different business, science, and social science domains. In today's business world, data analysis plays a role in making decisions more scientific and helping businesses operate more effectively.

Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. The goal of data wrangling is to assure quality and useful data. Data analysts typically spend the majority of their time in the process of data wrangling compared to the actual analysis of the data.

The marketing research process is a six-step process involving the definition of the problem being studied upon, determining what approach to take, formulation of research design, field work entailed, data preparation and analysis, and the generation of reports, how to present these reports, and overall, how the task can be accomplished.

<span class="mw-page-title-main">Data</span> Units of information

In common usage data is a collection of discrete or continuous values that convey information, describing the quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted formally. A datum is an individual value in a collection of data. Data is usually organized into structures such as tables that provide additional context and meaning, and which may themselves be used as data in larger structures. Data may be used as variables in a computational process. Data may represent abstract ideas or concrete measurements. Data is commonly used in scientific research, economics, and in virtually every other form of human organizational activity. Examples of data sets include price indices, unemployment rates, literacy rates, and census data. In this context, data represents the raw facts and figures from which useful information can be extracted.

Manufacturing execution systems (MES) are computerized systems used in manufacturing to track and document the transformation of raw materials to finished goods. MES provides information that helps manufacturing decision-makers understand how current conditions on the plant floor can be optimized to improve production output. MES works as real-time monitoring system to enable the control of multiple elements of the production process.

Market intelligence (MI) is gathering and analyzing information relevant to a company's market - trends, competitor and customer monitoring. It is a subtype of competitive intelligence (CI), which is data and information gathered by companies that provide continuous insight into market trends such as competitors' and customers' values and preferences.

<span class="mw-page-title-main">Rufus Pollock</span> British economist, activist and social entrepreneur

Rufus Pollock is a British economist, activist and social entrepreneur.

The fields of marketing and artificial intelligence converge in systems which assist in areas such as market forecasting, and automation of processes and decision making, along with increased efficiency of tasks which would usually be performed by humans. The science behind these systems can be explained through neural networks and expert systems, computer programs that process input and provide valuable output for marketers.

Data scraping is a technique where a computer program extracts data from human-readable output coming from another program.

References

  1. 1 2 3 Kitchin, Rob (2014). The Data Revolution. United States: Sage. p. 6.

Further reading