Dark data

Last updated

Dark data is data which is acquired through various computer network operations but not used in any manner to derive insights or for decision making. [1] [2] The ability of an organisation to collect data can exceed the throughput at which it can analyse the data. In some cases the organisation may not even be aware that the data is being collected. [3] IBM estimate that roughly 90 percent of data generated by sensors and analog-to-digital conversions never get used. [4]

Contents

In an industrial context, dark data can include information gathered by sensors and telematics. [5]

Organizations retain dark data for a multitude of reasons, and it is estimated that most companies are only analyzing 1% of their data. [6] Often it is stored for regulatory compliance [7] and record keeping. [1] Some organizations believe that dark data could be useful to them in the future, once they have acquired better analytic and business intelligence technology to process the information. [3] Because storage is inexpensive, storing data is easy. However, storing and securing the data usually entails greater expenses (or even risk) than the potential return profit. [1]

In academic discourse, the term dark data was essentially coined by Bryan P. Heidorn. He uses it to describe research data, especially from the long tail of science (the many, small research projects), which are not or no longer available for research because they disappear in a drawer without adequate data management. [8] Without this, the data become dark, and further reasons for this are e.g. missing metadata annotation, missing data management plans and data curators. [9]

Analysis

The term "dark data" very often refers to data that is not amenable to computer processing. For example, a company might have a great deal of data that exists only as scanned page-images. Even the bare text in such documents is not available without something like Optical character recognition, which can vary greatly in accuracy. Even with OCR, the significance of each part of the data is unavailable. An obvious examples is whether a capitalized word is a name or not, and if so, whether it represents a person, place, organization, or even a work of art. Bibliographic and other references, data within tables (that may be labeled quite adequately for humans, but not for processing), and countless assertions represented with the full complexity and ambiguity of human language.

A lot of unused data is very valuable, and would be used if it could be; but is blocked because it is in formats that are difficult to process, categorise, identify, and analyse. Often the reason that business does not use their dark data is because of the amount of resources it would take and the difficulty of having that data analysed. In other words, the data is "dark" not because it is not used, but because it cannot (feasibly or affordably) be used, given its poor representation.

There are many data representations that can make data much more accessible for automation. However, a great deal of information lacks any such identification of information items or relationships; and much more loses it during "downhill" conversion such as saving to page-oriented representations, printing, scanning, or faxing. The journey back "uphill" can be costly.

According to Computer Weekly , 60% of organisations believe that their own business intelligence reporting capability is "inadequate" and 65% say that they have "somewhat disorganised content management approaches". [10]

Relevance

Useful data may become dark data after it becomes irrelevant, as it is not processed fast enough. This is called "perishable insights" in "live flowing data". For example, if the geolocation of a customer is known to a business, the business can make offer based on the location, however if this data is not processed immediately, it may be irrelevant in the future. According to IBM, about 60 percent of data loses its value immediately. [4]

Storage

According to the New York Times, 90% of energy used by data centres is wasted. [11] If data was not stored, energy costs could be saved. Furthermore, there are costs associated with the underutilisation of information and thus missed opportunities. According to Datamation, "the storage environments of EMEA organizations consist of 54 percent dark data, 32 percent redundant, obsolete and trivial data and 14 percent business-critical data. By 2020, this can add up to $891 billion in storage and management costs that can otherwise be avoided." [12]

The continuous storage of dark data can put an organisation at risk, especially if this data is sensitive. In the case of a breach, this can result in serious repercussions. These can be financial, legal and can seriously hurt an organisation's reputation. For example, a breach of private records of customers could result in the stealing of sensitive information, which could result in identity theft. Another example could be the breach of the company's own sensitive information, for example relating to research and development. These risks can be mitigated by assessing and auditing whether this data is useful to the organisation, employing strong encryption and security [13] and finally, if it is determined to be discarded, then it should be discarded in a way that it becomes unretrievable. [14]

Future

It is generally considered that as more advanced computing systems for analysis of data are built, the higher the value of dark data will be. It has been noted that "data and analytics will be the foundation of the modern industrial revolution". [5] Of course, this includes data that is currently considered "dark data" since there are not enough resources to process it. All this data that is being collected can be used in the future to bring maximum productivity and an ability for organisations to meet consumers' demand. Technology advancements are helping to leverage this dark data affordably. Furthermore, many organisations do not realise the value of dark data right now, for example in healthcare and education organisations deal with large amounts of data that could create a significant "potential to service students and patients in the manner in which the consumer and financial services pursue their target population". [15]

Related Research Articles

<span class="mw-page-title-main">Computer data storage</span> Storage of digital data readable by computers

Computer data storage is a technology consisting of computer components and recording media that are used to retain digital data. It is a core function and fundamental component of computers.

<span class="mw-page-title-main">Computer memory</span> Component of a computer storing information for immediate use.

Computer memory stores information, such as data and programs for immediate use in the computer. The term memory is often synonymous with the term primary storage or main memory. An archaic synonym for memory is store.

<span class="mw-page-title-main">Database</span> Organized collection of data in computing

In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and analyze the data. The DBMS additionally encompasses the core facilities provided to administer the database. The sum total of the database, the DBMS and the associated applications can be referred to as a database system. Often the term "database" is also used loosely to refer to any of the DBMS, the database system or an application associated with the database.

In processor design, microcode serves as an intermediary layer situated between the central processing unit (CPU) hardware and the programmer-visible instruction set architecture of a computer, also known as its machine code. It consists of a set of hardware-level instructions that implement the higher-level machine code instructions or control internal finite-state machine sequencing in many digital processing components. While microcode is utilized in general-purpose CPUs in contemporary desktops, it also functions as a fallback path for scenarios that the faster hardwired control unit is unable to manage.

<span class="mw-page-title-main">Punched card</span> Paper-based recording medium

A punched card is a piece of card stock that stores digital data using punched holes. Punched cards were once common in data processing and the control of automated machines.

<span class="mw-page-title-main">Firmware</span> Low-level computer software

In computing, firmware is a specific class of computer software that provides the low-level control for a device's specific hardware. Firmware, such as the BIOS of a personal computer, may contain basic functions of a device, and may provide hardware abstraction services to higher-level software such as operating systems. For less complex devices, firmware may act as the device's complete operating system, performing all control, monitoring and data manipulation functions. Typical examples of devices containing firmware are embedded systems, home and personal-use appliances, computers, and computer peripherals.

<span class="mw-page-title-main">Tokenization (data security)</span> Concept in data security

Tokenization, when applied to data security, is the process of substituting a sensitive data element with a non-sensitive equivalent, referred to as a token, that has no intrinsic or exploitable meaning or value. The token is a reference that maps back to the sensitive data through a tokenization system. The mapping from original data to a token uses methods that render tokens infeasible to reverse in the absence of the tokenization system, for example using tokens created from random numbers. A one-way cryptographic function is used to convert the original data into tokens, making it difficult to recreate the original data without obtaining entry to the tokenization system's resources. To deliver such services, the system maintains a vault database of tokens that are connected to the corresponding sensitive data. Protecting the system vault is vital to the system, and improved processes must be put in place to offer database integrity and physical security.

<span class="mw-page-title-main">IBM Db2</span> Relational model database server

Db2 is a family of data management products, including database servers, developed by IBM. It initially supported the relational model, but was extended to support object–relational features and non-relational structures like JSON and XML. The brand name was originally styled as DB/2, then DB2 until 2017 and finally changed to its present form.

Business intelligence (BI) comprises the strategies and technologies used by enterprises for the data analysis and management of business information. Common functions of business intelligence technologies include reporting, online analytical processing, analytics, dashboard development, data mining, process mining, complex event processing, business performance management, benchmarking, text mining, predictive analytics, and prescriptive analytics.

<span class="mw-page-title-main">SPSS</span> Statistical analysis software

'Italic text'

<span class="mw-page-title-main">Data management</span> Disciplines related to managing data as a resource

Data management comprises all disciplines related to handling data as a valuable resource.

Essbase is a multidimensional database management system (MDBMS) that provides a platform upon which to build analytic applications. Essbase began as a product from Arbor Software, which merged with Hyperion Software in 1998. Oracle Corporation acquired Hyperion Solutions Corporation in 2007. Until late 2005 IBM also marketed an OEM version of Essbase as DB2 OLAP Server.

Data loss prevention (DLP) software detects potential data breaches/data exfiltration transmissions and prevents them by monitoring, detecting and blocking sensitive data while in use, in motion, and at rest.

Information capital is a concept which asserts that information has intrinsic value which can be shared and leveraged within and between organizations. Information capital connotes that sharing information is a means of sharing power, supporting personnel, and optimizing working processes. Information capital is the pieces of information which enables the exchange of Knowledge capital.

<span class="mw-page-title-main">Data at rest</span> Data stored on a device or backup medium

Data at rest in information technology means data that is housed physically on computer data storage in any digital form. Data at rest includes both structured and unstructured data. This type of data is subject to threats from hackers and other malicious threats to gain access to the data digitally or physical theft of the data storage media. To prevent this data from being accessed, modified or stolen, organizations will often employ security protection measures such as password protection, data encryption, or a combination of both. The security options used for this type of data are broadly referred to as data at rest protection (DARP).

Bring your own device —also called bring your own technology (BYOT), bring your own phone (BYOP), and bring your own personal computer (BYOPC)—refers to being allowed to use one's personally owned device, rather than being required to use an officially provided device.

<span class="mw-page-title-main">SAP HANA</span> Database management system by SAP

SAP HANA is an in-memory, column-oriented, relational database management system developed and marketed by SAP SE. Its primary function as the software running a database server is to store and retrieve data as requested by the applications. In addition, it performs advanced analytics and includes extract, transform, load (ETL) capabilities as well as an application server.

Data-centric security is an approach to security that emphasizes the dependability of the data itself rather than the security of networks, servers, or applications. Data-centric security is evolving rapidly as enterprises increasingly rely on digital information to run their business and big data projects become mainstream. It involves the separation of data and digital rights management that assign encrypted files to pre-defined access control lists, ensuring access rights to critical and confidential data are aligned with documented business needs and job requirements that are attached to user identities.

Cryptographic splitting, also known as cryptographic bit splitting or cryptographic data splitting, is a technique for securing data over a computer network. The technique involves encrypting data, splitting the encrypted data into smaller data units, distributing those smaller units to different storage locations, and then further encrypting the data at its new location. With this process, the data is protected from security breaches, because even if an intruder is able to retrieve and decrypt one data unit, the information would be useless unless it can be combined with decrypted data units from the other locations.

Data sanitization involves the secure and permanent erasure of sensitive data from datasets and media to guarantee that no residual data can be recovered even through extensive forensic analysis. Data sanitization has a wide range of applications but is mainly used for clearing out end-of-life electronic devices or for the sharing and use of large datasets that contain sensitive information. The main strategies for erasing personal data from devices are physical destruction, cryptographic erasure, and data erasure. While the term data sanitization may lead some to believe that it only includes data on electronic media, the term also broadly covers physical media, such as paper copies. These data types are termed soft for electronic files and hard for physical media paper copies. Data sanitization methods are also applied for the cleaning of sensitive data, such as through heuristic-based methods, machine-learning based methods, and k-source anonymity.

References

  1. 1 2 3 "Dark Data". Gartner.
  2. Tittel, Ed (24 September 2014). "The Dangers of Dark Data and How to Minimize Your Exposure". CIO. Archived from the original on 15 January 2019. Retrieved 15 September 2015.
  3. 1 2 Brantley, Bill (2015-06-17). "The API Briefing: the Challenge of Government's Dark Data". Digitalgov.gov.
  4. 1 2 Johnson, Heather (2015-10-30). "Digging up dark data: What puts IBM at the forefront of insight economy". SiliconANGLE. Retrieved 2015-11-03.
  5. 1 2 Dennies, Paul (February 19, 2015). "TeradataVoice: Factories Of The Future: The Value Of Dark Data". Forbes. Archived from the original on 2015-02-22.
  6. Shahzad, M. Ahmad (January 3, 2017). "The big data challenge of transformation for the manufacturing industry". IBM Big Data & Analytics Hub.
  7. "Are you using your dark data effectively". Archived from the original on 2017-01-16. Retrieved 2017-01-12.
  8. Heidorn, P. Bryan. "Shedding light on the dark data in the long tail of science." Library trends 57.2 (2008): 280-299.
  9. Schembera, B., Durán, J.M. Dark Data as the New Challenge for Big Data Science and the Introduction of the Scientific Data Officer. Philos. Technol. 33, 93–115 (2020). https://doi.org/10.1007/s13347-019-00346-x
  10. Miles, Doug (27 December 2013). "Dark data could halt big data's path to success". ComputerWeekly. Retrieved 2015-11-03.
  11. Glanz, James (2012-09-22). "Data Centers Waste Vast Amounts of Energy, Belying Industry Image". The New York Times. Retrieved 2015-11-02.
  12. Hernandez, Pedro (October 30, 2015). "Enterprises are Hoarding 'Dark' Data: Veritas". Datamation. Retrieved 2015-11-04.
  13. "DarkShield Uses Machine Learning to Find and Mask PII". IRI. Retrieved 2019-01-14.
  14. Tittel, Ed (2014-09-24). "The Dangers of Dark Data and How to Minimize Your Exposure". CIO. Archived from the original on 2019-01-15. Retrieved 2015-11-02.
  15. Prag, Crystal (2014-09-30). "Leveraging Dark Data: Q&A with Melissa McCormack". The Machine Learning Times. Retrieved 2015-11-04.