Data proliferation

Last updated

Data proliferation refers to the prodigious amount of data, structured and unstructured, that businesses and governments continue to generate at an unprecedented rate and the usability problems that result from attempting to store and manage that data. While originally pertaining to problems associated with paper documentation, data proliferation has become a major problem in primary and secondary data storage on computers.

Data facts represented for handling

Data is a set of values of subjects with respect to qualitative or quantitative variables.

Usability the ease of use and learnability of a human-made object such as a tool or device; the degree to which a software can be used by consumers to achieve quantified objectives with effectiveness, efficiency, and satisfaction in a quantified context of use

Usability is the ease of use and learnability of a human-made object such as a tool or device. In software engineering, usability is the degree to which a software can be used by specified consumers to achieve quantified objectives with effectiveness, efficiency, and satisfaction in a quantified context of use.

Documentation set of documents providing knowledge

Documentation is a set of documents provided on paper, or online, or on digital or analog media, such as audio tape or CDs. Examples are user guides, white papers, on-line help, quick-reference guides. It is becoming less common to see paper (hard-copy) documentation. Documentation is distributed via websites, software products, and other on-line applications.

Contents

While digital storage has become cheaper, the associated costs, from raw power to maintenance and from metadata to search engines, have not kept up with the proliferation of data. Although the power required to maintain a unit of data has fallen, the cost of facilities which house the digital storage has tended to rise. [1]

Data proliferation has been documented as a problem for the U.S. military since August 1971, in particular regarding the excessive documentation submitted during the acquisition of major weapon systems. [3] Efforts to mitigate data proliferation and the problems associated with it are ongoing. [4]

Problems caused

The problem of data proliferation is affecting all areas of commerce as the result of the availability of relatively inexpensive data storage devices. This has made it very easy to dump data into secondary storage immediately after its window of usability has passed. This masks problems that could gravely affect the profitability of businesses and the efficient functioning of health services, police and security forces, local and national governments, and many other types of organizations. [2] Data proliferation is problematic for several reasons:

Xerox American document management corporation

Xerox Corporation is an American global corporation that sells print and digital document and services in more than 160 countries. Xerox is headquartered in Norwalk, Connecticut, though its largest population of employees is based around Rochester, New York, the area in which the company was founded. The company purchased Affiliated Computer Services for $6.4 billion in early 2010. As a large developed company, it is consistently placed in the list of Fortune 500 companies.

Document retrieval is defined as the matching of some stated user query against a set of free-text records. These records could be any type of mainly unstructured text, such as newspaper articles, real estate records or paragraphs in a manual. User queries can range from multi-sentence full descriptions of an information need to a few words.

Data loss is an error condition in information systems in which information is destroyed by failures or neglect in storage, transmission, or processing. Information systems implement backup and disaster recovery equipment and processes to prevent data loss or restore lost data.

Proposed solutions

Metadata data about data

Metadata is "data [information] that provides information about other data". Many distinct types of metadata exist, among these descriptive metadata, structural metadata, administrative metadata, reference metadata and statistical metadata.

See also

Related Research Articles

Computer data storage technology consisting of computer components and recording media used to retain digital data

Computer data storage, often called storage or memory, is a technology consisting of computer components and recording media that are used to retain digital data. It is a core function and fundamental component of computers.

A document management system (DMS) is a system used to track, manage and store documents and reduce paper. Most are capable of keeping a record of the various versions created and modified by different users. The term has some overlap with the concepts of content management systems. It is often viewed as a component of enterprise content management (ECM) systems and related to digital asset management, document imaging, workflow systems and records management systems.

Business intelligence (BI) comprises the strategies and technologies used by enterprises for the data analysis of business information. BI technologies provide historical, current and predictive views of business operations. Common functions of business intelligence technologies include reporting, online analytical processing, analytics, data mining, process mining, complex event processing, business performance management, benchmarking, text mining, predictive analytics and prescriptive analytics. BI technologies can handle large amounts of structured and sometimes unstructured data to help identify, develop and otherwise create new strategic business opportunities. They aim to allow for the easy interpretation of these big data. Identifying new opportunities and implementing an effective strategy based on insights can provide businesses with a competitive market advantage and long-term stability.

Content management (CM) is a set of processes and technologies that supports the collection, managing, and publishing of information in any form or medium. When stored and accessed via computers, this information may be more specifically referred to as digital content, or simply as content.

A management information system (MIS) is an information system used for decision-making, and for the coordination, control, analysis, and visualization of information in an organization; especially in a company.

Product lifecycle

In industry, product lifecycle management (PLM) is the process of managing the entire lifecycle of a product from inception, through engineering design and manufacture, to service and disposal of manufactured products. PLM integrates people, data, processes and business systems and provides a product information backbone for companies and their extended enterprise.

Data Management comprises all disciplines related to managing data as a valuable resource.

Enterprise content management (ECM) extends the concept of content management by adding a time line for each content item and possibly enforcing processes for the creation, approval and distribution of them. Systems that implement ECM generally provide a secure repository for managed items, be they analog or digital, that indexes them. They also include one or more methods for importing content to bring new items under management and several presentation methods to make items available for use.

Hierarchical storage management (HSM) is a data storage technique that automatically moves data between high-cost and low-cost storage media. HSM systems exist because high-speed storage devices, such as solid state drive arrays, are more expensive than slower devices, such as hard disk drives, optical discs and magnetic tape drives. While it would be ideal to have all data available on high-speed devices all the time, this is prohibitively expensive for many organizations. Instead, HSM systems store the bulk of the enterprise's data on slower devices, and then copy data to faster disk drives when needed. In effect, HSM turns the fast disk drives into caches for the slower mass storage devices. The HSM system monitors the way data is used and makes best guesses as to which data can safely be moved to slower devices and which data should stay on the fast devices.

Information lifecycle management (ILM) refers to strategies for administering storage systems on computing devices.

The following outline is provided as an overview of and topical guide to library science:

A data steward is a role within an organization responsible for utilizing an organization's data governance processes to ensure fitness of data elements - both the content and metadata. Data stewards have a specialist role that incorporates processes, policies, guidelines and responsibilities for administering organizations' entire data in compliance with policy and/or regulatory obligations. A data steward may share some responsibilities with a data custodian.

An information repository is an easy way to deploy a secondary tier of data storage that can comprise multiple, networked data storage technologies running on diverse operating systems, where data that no longer needs to be in primary storage is protected, classified according to captured metadata, processed, de-duplicated, and then purged, automatically, based on data service level objectives and requirements. In information repositories, data storage resources are virtualized as composite storage sets and operate as a federated environment.

Documentum is an enterprise content management platform, now owned by OpenText, as well as the name of the software company that originally developed the technology. EMC acquired Documentum for $1.7 billion in December, 2003. The Documentum platform was part of EMC's Enterprise Content Division (ECD) business unit, one of EMC's four operating divisions.

HP Information Management Software is software from the HP Software Division, used to organize, protect, retrieve, acquire, manage and maintain information. The HP Software Division also offers information analytics software. The amount of data that companies have to deal with has grown tremendously over the past decade, making the management of this information more difficult. The University of California at Berkeley claims the amount of information produced globally increases by 30 percent annually. An April 2010 Information Management article cited a survey in which nearly 90 percent of businesses blame poor performance on data growth. The survey concluded that for many businesses their applications and databases are growing by 50 percent or more annually, making it difficult to manage the rapid expansion of information. Because of this Information explosion, IT companies have created technology solutions to help businesses manage this ever-expanding data.

The web content lifecycle is the multi-disciplinary and often complex process that web content undergoes as it is managed through various publishing stages.

Converged infrastructure is a way of structuring an information technology (IT) system which groups multiple components into a single optimized computing package. Components of a converged infrastructure may include servers, data storage devices, networking equipment and software for IT infrastructure management, automation and orchestration.

Active Archive Alliance

The Active Archive Alliance is a trade association that promotes a method of tiered storage which gives the user access to data across a virtual file system that migrates data between multiple storage systems and media types including solid-state drive/flash, hard disk drives, magnetic tape, optical disk, and cloud. The result of an active archive implementation is that data can be stored on the most appropriate media type for the given retention and restoration requirements of that data. This allows less time sensitive or infrequently accessed data to be stored on less expensive media, and eliminates the need for an administrator to manually migrate data between storage systems. Additionally since storage systems such as tape libraries have very low power consumption, the operational expense of storing data in an active archive is greatly reduced.

A machine-readable document is a document whose content can be readily processed by computers. Such documents are distinguished from machine-readable data by virtue of having sufficient structure to provide the necessary context to support the business processes for which they are created.

References