Data proliferation

Last updated January 27, 2023

Data proliferation refers to the prodigious amount of data, structured and unstructured, that businesses and governments continue to generate at an unprecedented rate and the usability problems that result from attempting to store and manage that data. While originally pertaining to problems associated with paper documentation, data proliferation has become a major problem in primary and secondary data storage on computers.

While digital storage has become cheaper, the associated costs, from raw power to maintenance and from metadata to search engines, have not kept up with the proliferation of data. Although the power required to maintain a unit of data has fallen, the cost of facilities which house the digital storage has tended to rise.^[1]

At the simplest level, company e-mail systems spawn large amounts of data. Business e-mail – some of it important to the enterprise, some much less so – is estimated to be growing at a rate of 25-30% annually. And whether it’s relevant or not, the load on the system is being magnified by practices such as multiple addressing and the attaching of large text, audio and even video files.
— IBM Global Technology Services^[2]

Data proliferation has been documented as a problem for the U.S. military since August 1971, in particular regarding the excessive documentation submitted during the acquisition of major weapon systems.^[3] Efforts to mitigate data proliferation and the problems associated with it are ongoing.^[4]

Problems caused

The problem of data proliferation is affecting all areas of commerce as a result of the availability of relatively inexpensive data storage devices. This has made it very easy to dump data into secondary storage immediately after its window of usability has passed. This masks problem that could gravely affect the profitability of businesses and the efficient functioning of health services, police and security forces, local and national governments, and many other types of organizations.^[2] Data proliferation is problematic for several reasons:

Difficulty when trying to find and retrieve information. At Xerox, on average it takes employees more than one hour per week to find hard-copy documents, costing $2,152 a year to manage and store them. For businesses with more than 10 employees, this increases to almost two hours per week at $5,760 per year.^[5] In large networks of primary and secondary data storage, problems finding electronic data are analogous to problems finding hard copy data.
Data loss and legal liability when data is disorganized, not properly replicated, or cannot be found promptly. In April 2005, the Ameritrade Holding Corporation told 200,000 current and past customers that a tape containing confidential information had been lost or destroyed in transit. In May of the same year, Time Warner Incorporated reported that 40 tapes containing personal data on 600,000 current and former employees had been lost en route to a storage facility. In March 2005, a Florida judge hearing a $2.7 billion lawsuit against Morgan Stanley issued an "adverse inference order" against the company for "willful and gross abuse of its discovery obligations." The judge cited Morgan Stanley for repeatedly finding misplaced tapes of e-mail messages long after the company had claimed that it had turned over all such tapes to the court.^[6]
Increased manpower requirements to manage increasingly chaotic data storage resources.
Slower networks and application performance due to excess traffic as users search and search again for the material they need.^[2]
High cost in terms of the energy resources required to operate storage hardware. A 100 terabyte system will cost up to $35,040 a year to run—not counting cooling costs.^[7]

Proposed solutions

Applications that better utilize modern technology
Reductions in duplicate data (especially as caused by data movement)
Improvement of metadata structures
Improvement of file and storage transfer structures
User education and discipline^[3]
The implementation of Information Lifecycle Management solutions to eliminate low-value information as early as possible before putting the rest into actively managed long-term storage in which it can be quickly and cheaply accessed.^[2]

Related Research Articles

Computer data storage is a technology consisting of computer components and recording media that are used to retain digital data. It is a core function and fundamental component of computers.

<span class="mw-page-title-main">Disk storage</span> General category of storage mechanisms

Disk storage is a general category of storage mechanisms where data is recorded by various electronic, magnetic, optical, or mechanical changes to a surface layer of one or more rotating disks. A disk drive is a device implementing such a storage mechanism. Notable types are the hard disk drive (HDD) containing a non-removable disk, the floppy disk drive (FDD) and its removable floppy disk, and various optical disc drives (ODD) and associated optical disc media.

A document management system (DMS) is usually a computerized system used to store, share, track and manage files or documents. Some systems include history tracking where a log of the various versions created and modified by different users is recorded. The term has some overlap with the concepts of content management systems. It is often viewed as a component of enterprise content management (ECM) systems and related to digital asset management, document imaging, workflow systems and records management systems.

A management information system (MIS) is an information system used for decision-making, and for the coordination, control, analysis, and visualization of information in an organization. The study of the management information systems involves people, processes and technology in an organizational context.

Quantum Corporation is a data storage, management, and protection company that provides technology to store, manage, archive, and protect video and unstructured data throughout the data lifecycle. Their products are used by enterprises, media and entertainment companies, government agencies, big data companies, and life science organizations. Quantum is headquartered in San Jose, California and has offices around the world, supporting customers globally in addition to working with a network of distributors, VARs, DMRs, OEMs and other suppliers.

In industry, Product Lifecycle Management (PLM) is the process of managing the entire lifecycle of a product from its inception through the engineering, design and manufacture, as well as the service and disposal of manufactured products. PLM integrates people, data, processes and business systems and provides a product information backbone for companies and their extended enterprises.

Disaster recovery is the process of maintaining or reestablishing vital infrastructure and systems following a natural or human-induced disaster, such as a storm or battle. It employs policies, tools, and procedures. Disaster recovery focuses on the information technology (IT) or technology systems supporting critical business functions as opposed to business continuity. This involves keeping all essential aspects of a business functioning despite significant disruptive events; it can therefore be considered a subset of business continuity. Disaster recovery assumes that the primary site is not immediately recoverable and restores data and services to a secondary site.

Data management comprises all disciplines related to handling data as a valuable resource.

In computing, a file system or filesystem is a method and data structure that the operating system uses to control how data is stored and retrieved. Without a file system, data placed in a storage medium would be one large body of data with no way to tell where one piece of data stopped and the next began, or where any piece of data was located when it was time to retrieve it. By separating the data into pieces and giving each piece a name, the data are easily isolated and identified. Taking its name from the way a paper-based data management system is named, each group of data is called a "file". The structure and logic rules used to manage the groups of data and their names is called a "file system."

Business software is any software or set of computer programs used by business users to perform various business functions. These business applications are used to increase productivity, measure productivity, and perform other business functions accurately.

In library and archival science, digital preservation is a formal endeavor to ensure that digital information of continuing value remains accessible and usable. It involves planning, resource allocation, and application of preservation methods and technologies, and it combines policies, strategies and actions to ensure access to reformatted and "born-digital" content, regardless of the challenges of media failure and technological change. The goal of digital preservation is the accurate rendering of authenticated content over time. The Association for Library Collections and Technical Services Preservation and Reformatting Section of the American Library Association, defined digital preservation as combination of "policies, strategies and actions that ensure access to digital content over time." According to the Harrod's Librarian Glossary, digital preservation is the method of keeping digital material alive so that they remain usable as technological advances render original hardware and software specification obsolete.

Enterprise content management (ECM) extends the concept of content management by adding a timeline for each content item and, possibly, enforcing processes for its creation, approval and distribution. Systems using ECM generally provide a secure repository for managed items, analog or digital. They also include one methods for importing content to bring manage new items, and several presentation methods to make items available for use. Although ECM content may be protected by digital rights management (DRM), it is not required. ECM is distinguished from general content management by its cognizance of the processes and procedures of the enterprise for which it is created.

Hierarchical storage management (HSM), also known as Tiered storage, is a data storage and Data management technique that automatically moves data between high-cost and low-cost storage media. HSM systems exist because high-speed storage devices, such as solid state drive arrays, are more expensive than slower devices, such as hard disk drives, optical discs and magnetic tape drives. While it would be ideal to have all data available on high-speed devices all the time, this is prohibitively expensive for many organizations. Instead, HSM systems store the bulk of the enterprise's data on slower devices, and then copy data to faster disk drives when needed. The HSM system monitors the way data is used and makes best guesses as to which data can safely be moved to slower devices and which data should stay on the fast devices.

Information lifecycle management (ILM) refers to strategies for administering storage systems on computing devices.

A data steward is an oversight or data governance role within an organization, and is responsible for ensuring the quality and fitness for purpose of the organization's data assets, including the metadata for those data assets. A data steward may share some responsibilities with a data custodian, such as the awareness, accessibility, release, appropriate use, security and management of data. A data steward would also participate in the development and implementation of data assets. A data steward may seek to improve the quality and fitness for purpose of other data assets their organization depends upon but is not responsible for.

Magnetic-tape data storage is a system for storing digital information on magnetic tape using digital recording.

A paperless office is a work environment in which the use of paper is eliminated or greatly reduced. This is done by converting documents and other papers into digital form, a process known as digitization. Proponents claim that "going paperless" can save money, boost productivity, save space, make documentation and information sharing easier, keep personal information more secure, and help the environment. The concept can be extended to communications outside the office as well.

In information technology, an information repository or simply a repository is "a central place in which an aggregation of data is kept and maintained in an organized way, usually in computer storage." It "may be just the aggregation of data itself into some accessible place of storage or it may also imply some ability to selectively extract data."

<span class="mw-page-title-main">Active Archive Alliance</span> Trade association

The Active Archive Alliance is a trade association that promotes a method of tiered storage. This method provides users access to data across a virtual file system that migrates data between multiple storage systems and media types including solid-state drive/flash, hard disk drives, magnetic tape, optical disk, and cloud. The result of an active archive implementation is that data can be stored on the most appropriate media type for the given retention and restoration requirements of that data. This allows less time sensitive or infrequently accessed data to be stored on less expensive media and eliminates the need for an administrator to manually migrate data between storage systems. Additionally, since storage systems such as tape libraries have low power consumption, the operational expense of storing data in an active archive is significantly reduced.

Information technology (IT) is the use of computers to create, process, store, retrieve and exchange all kinds of data and information. IT forms part of information and communications technology (ICT). An information technology system is generally an information system, a communications system, or, more specifically speaking, a computer system — including all hardware, software, and peripheral equipment — operated by a limited group of IT users.

References

↑ "Downsizing the digital attic". Deloitte Technology Predictions. Archived from the original on July 22, 2011.
1 2 3 4 "The Toxic Terabyte", IBM Global Technology Services, July 2006
1 2 "Evolution of the Data Proliferation Problem within Major Air Force Acquisition Programs". Archived from the original on 2007-10-09. Retrieved 2007-10-09.
↑ Data Proliferation: Stop That
↑ “Dealing with data proliferation”; Vawn Himmelsbach. it business.ca: Canadian Technology News, September 19, 2006
↑ “Data: Lost, Stolen or Strayed”, Computer World, Security
↑ "Power and storage: the hidden cost of ownership”, Computer Technology Review, October 2003

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] "Downsizing the digital attic". Deloitte Technology Predictions. Archived from the original on July 22, 2011.

[IBM-2] 1 2 3 4 "The Toxic Terabyte", IBM Global Technology Services, July 2006

[DODPP-3] 1 2 "Evolution of the Data Proliferation Problem within Major Air Force Acquisition Programs". Archived from the original on 2007-10-09. Retrieved 2007-10-09.

[4] Data Proliferation: Stop That

[5] “Dealing with data proliferation”; Vawn Himmelsbach. it business.ca: Canadian Technology News, September 19, 2006

[6] “Data: Lost, Stolen or Strayed”, Computer World, Security

[7] "Power and storage: the hidden cost of ownership”, Computer Technology Review, October 2003

[1]

[2]

[3]

[4]

[5]

[6]

[7]

Data proliferation

Contents

Problems caused

Proposed solutions

See also

Related Research Articles

References