Unstructured data

Last updated
Unsorted records captured from Nazi Germany at the U.S. National Archives Military Records Center in Alexandria, Virginia, 1956 Photograph of Departmental Records Branch Military Records Center in Alexandria, Virginia - NARA - 23855327.jpg
Unsorted records captured from Nazi Germany at the U.S. National Archives Military Records Center in Alexandria, Virginia, 1956

Unstructured data (or unstructured information) is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well. This results in irregularities and ambiguities that make it difficult to understand using traditional programs as compared to data stored in fielded form in databases or annotated (semantically tagged) in documents.

Contents

In 1998, Merrill Lynch said "unstructured data comprises the vast majority of data found in an organization, some estimates run as high as 80%." [1] It's unclear what the source of this number is, but nonetheless it is accepted by some. [2] Other sources have reported similar or higher percentages of unstructured data. [3] [4] [5]

As of 2012, IDC and Dell EMC project that data will grow to 40 zettabytes by 2020, resulting in a 50-fold growth from the beginning of 2010. [6] More recently, IDC and Seagate predict that the global datasphere will grow to 163 zettabytes by 2025 [7] and majority of that will be unstructured. The Computer World magazine states that unstructured information might account for more than 70–80% of all data in organizations.

Background

The earliest research into business intelligence focused in on unstructured textual data, rather than numerical data. [8] As early as 1958, computer science researchers like H.P. Luhn were particularly concerned with the extraction and classification of unstructured text. [8] However, only since the turn of the century has the technology caught up with the research interest. In 2004, the SAS Institute developed the SAS Text Miner, which uses Singular Value Decomposition (SVD) to reduce a hyper-dimensional textual space into smaller dimensions for significantly more efficient machine-analysis. [9] The mathematical and technological advances sparked by machine textual analysis prompted a number of businesses to research applications, leading to the development of fields like sentiment analysis, voice of the customer mining, and call center optimization. [10] The emergence of Big Data in the late 2000s led to a heightened interest in the applications of unstructured data analytics in contemporary fields such as predictive analytics and root cause analysis. [11]

Issues with terminology

The term is imprecise for several reasons:

  1. Structure, while not formally defined, can still be implied.
  2. Data with some form of structure may still be characterized as unstructured if its structure is not helpful for the processing task at hand.
  3. Unstructured information might have some structure (semi-structured) or even be highly structured but in ways that are unanticipated or unannounced.

Dealing with unstructured data

Techniques such as data mining, natural language processing (NLP), and text analytics provide different methods to find patterns in, or otherwise interpret, this information. Common techniques for structuring text usually involve manual tagging with metadata or part-of-speech tagging for further text mining-based structuring. The Unstructured Information Management Architecture (UIMA) standard provided a common framework for processing this information to extract meaning and create structured data about the information.

Software that creates machine-processable structure can utilize the linguistic, auditory, and visual structure that exist in all forms of human communication. [12] Algorithms can infer this inherent structure from text, for instance, by examining word morphology, sentence syntax, and other small- and large-scale patterns. Unstructured information can then be enriched and tagged to address ambiguities and relevancy-based techniques then used to facilitate search and discovery. Examples of "unstructured data" may include books, journals, documents, metadata, health records, audio, video, analog data, images, files, and unstructured text such as the body of an e-mail message, Web page, or word-processor document. While the main content being conveyed does not have a defined structure, it generally comes packaged in objects (e.g. in files or documents, ...) that themselves have structure and are thus a mix of structured and unstructured data, but collectively this is still referred to as "unstructured data". [13] For example, an HTML web page is tagged, but HTML mark-up typically serves solely for rendering. It does not capture the meaning or function of tagged elements in ways that support automated processing of the information content of the page. XHTML tagging does allow machine processing of elements, although it typically does not capture or convey the semantic meaning of tagged terms.

Since unstructured data commonly occurs in electronic documents, the use of a content or document management system which can categorize entire documents is often preferred over data transfer and manipulation from within the documents. Document management thus provides the means to convey structure onto document collections.

Search engines have become popular tools for indexing and searching through such data, especially text.

Approaches in natural language processing

Specific computational workflows have been developed to impose structure upon the unstructured data contained within text documents. These workflows are generally designed to handle sets of thousands or even millions of documents, or far more than manual approaches to annotation may permit. Several of these approaches are based upon the concept of online analytical processing, or OLAP, and may be supported by data models such as text cubes. [14] Once document metadata is available through a data model, generating summaries of subsets of documents (i.e., cells within a text cube) may be performed with phrase-based approaches. [15]

Approaches in medicine and biomedical research

Biomedical research generates one major source of unstructured data as researchers often publish their findings in scholarly journals. Though the language in these documents is challenging to derive structural elements from (e.g., due to the complicated technical vocabulary contained within and the domain knowledge required to fully contextualize observations), the results of these activities may yield links between technical and medical studies [16] and clues regarding new disease therapies. [17] Recent efforts to enforce structure upon biomedical documents include self-organizing map approaches for identifying topics among documents, [18] general-purpose unsupervised algorithms, [19] and an application of the CaseOLAP workflow [15] to determine associations between protein names and cardiovascular disease topics in the literature. [20] CaseOLAP defines phrase-category relationships in an accurate (identifies relationships), consistent (highly reproducible), and efficient manner. This platform offers enhanced accessibility and empowers the biomedical community with phrase-mining tools for widespread biomedical research applications. [20]

The use of "unstructured" in data privacy regulations

In Sweden (EU), pre 2018, some data privacy regulations did not apply if the data in question was confirmed as "unstructured". [21] This terminology, unstructured data, is rarely used in the EU after GDPR came into force in 2018. GDPR does neither mention nor define "unstructured data". It does use the word "structured" as follows (without defining it);

GDPR Case-law on what defines a "filing system"; "the specific criterion and the specific form in which the set of personal data collected by each of the members who engage in preaching is actually structured is irrelevant, so long as that set of data makes it possible for the data relating to a specific person who has been contacted to be easily retrieved, which is however for the referring court to ascertain in the light of all the circumstances of the case in the main proceedings.” (CJEU, Todistajat v. Tietosuojavaltuutettu, Jehovan, Paragraph 61).

If personal data is easily retrieved - then it is a filing system and - then it is in scope for GDPR regardless of being "structured" or "unstructured". Most electronic systems today,[ as of? ] subject to access and applied software, can allow for easy retrieval of data.

See also

Notes

  1. ^ Today's Challenge in Government: What to do with Unstructured Information and Why Doing Nothing Isn't An Option, Noel Yuhanna, Principal Analyst, Forrester Research, Nov 2010

Related Research Articles

<span class="mw-page-title-main">Data warehouse</span> Centralized storage of knowledge

In computing, a data warehouse, also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis and is considered a core component of business intelligence. Data warehouses are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place that are used for creating reports. This is beneficial for companies as it enables them to interrogate and draw insights from their data and make decisions.

Data mining is the process of extracting and discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and statistics with an overall goal of extracting information from a data set and transforming the information into a comprehensible structure for further use. Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD. Aside from the raw analysis step, it also involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.

Business intelligence (BI) consists of strategies, methodologies, and technologies used by enterprises for data analysis and management of business information. Common functions of BI technologies include reporting, online analytical processing, analytics, dashboard development, data mining, process mining, complex event processing, business performance management, benchmarking, text mining, predictive analytics, and prescriptive analytics.

In computing, online analytical processing, or OLAP, is an approach to quickly answer multi-dimensional analytical (MDA) queries. The term OLAP was created as a slight modification of the traditional database term online transaction processing (OLTP). OLAP is part of the broader category of business intelligence, which also encompasses relational databases, report writing and data mining. Typical applications of OLAP include business reporting for sales, marketing, management reporting, business process management (BPM), budgeting and forecasting, financial reporting and similar areas, with new applications emerging, such as agriculture.

Text mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources." Written resources may include websites, books, emails, reviews, and articles. High-quality information is typically obtained by devising patterns and trends by means such as statistical pattern learning. According to Hotho et al. (2005) we can distinguish between three different perspectives of text mining: information extraction, data mining, and a knowledge discovery in databases (KDD) process. Text mining usually involves the process of structuring the input text, deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interest. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling.

Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. Typically, this involves processing human language texts by means of natural language processing (NLP). Recent activities in multimedia document processing like automatic annotation and content extraction out of images/audio/video/documents could be seen as information extraction.

Essbase is a multidimensional database management system (MDBMS) that provides a platform upon which to build analytic applications. Essbase began as a product from Arbor Software, which merged with Hyperion Software in 1998. Oracle Corporation acquired Hyperion Solutions Corporation in 2007. Until late 2005 IBM also marketed an OEM version of Essbase as DB2 OLAP Server.

Crystal Analysis is an On Line Analytical Processing (OLAP) application for analysing business data originally developed by Seagate Software.

Microsoft SQL Server Analysis Services (SSAS) is an online analytical processing (OLAP) and data mining tool in Microsoft SQL Server. SSAS is used as a tool by organizations to analyze and make sense of information possibly spread out across multiple databases, or in disparate tables or files. Microsoft has included a number of services in SQL Server related to business intelligence and data warehousing. These services include Integration Services, Reporting Services and Analysis Services. Analysis Services includes a group of OLAP and data mining capabilities and comes in two flavors multidimensional and tabular, where the difference between the two is how the data is presented. In a tabular model, the information is arranged in two-dimensional tables which can thus be more readable for a human. A multidimensional model can contain information with many degrees of freedom, and must be unfolded to increase readability by a human.

Biomedical text mining refers to the methods and study of how text mining may be applied to texts and literature of the biomedical domain. As a field of research, biomedical text mining incorporates ideas from natural language processing, bioinformatics, medical informatics and computational linguistics. The strategies in this field have been applied to the biomedical literature available through services such as PubMed.

Business intelligence software is a type of application software designed to retrieve, analyze, transform and report data for business intelligence (BI). The applications generally read data that has been previously stored, often - though not necessarily - in a data warehouse or data mart.

Noisy text analytics is a process of information extraction whose goal is to automatically extract structured or semistructured information from noisy unstructured text data. While Text analytics is a growing and mature field that has great value because of the huge amounts of data being produced, processing of noisy text is gaining in importance because a lot of common applications produce noisy text data. Noisy unstructured text data is found in informal settings such as online chat, text messages, e-mails, message boards, newsgroups, blogs, wikis and web pages. Also, text produced by processing spontaneous speech using automatic speech recognition and printed or handwritten text using optical character recognition contains processing noise. Text produced under such circumstances is typically highly noisy containing spelling errors, abbreviations, non-standard words, false starts, repetitions, missing punctuations, missing letter case information, pause filling words such as “um” and “uh” and other texting and speech disfluencies. Such text can be seen in large amounts in contact centers, chat rooms, optical character recognition (OCR) of text documents, short message service (SMS) text, etc. Documents with historical language can also be considered noisy with respect to today's knowledge about the language. Such text contains important historical, religious, ancient medical knowledge that is useful. The nature of the noisy text produced in all these contexts warrants moving beyond traditional text analysis techniques.

Sentiment analysis is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine. With the rise of deep language models, such as RoBERTa, also more difficult data domains can be analyzed, e.g., news texts where authors typically express their opinion/sentiment less explicitly.

The following tables compare general and technical information for a number of online analytical processing (OLAP) servers. Please see the individual products articles for further information.

<span class="mw-page-title-main">Pentaho</span> Business intelligence software

Pentaho is the brand name for several Data Management software products that make up the Pentaho+ Data Platform. These include Pentaho Data Integration, Pentaho Business AnalyticsPentaho Data Catalog, and Pentaho Data Optimiser. The Pentaho+ Platform helps organisations to become “data-fit” prior to operationalising AI.

Patent visualisation is an application of information visualisation. The number of patents has been increasing, encouraging companies to consider intellectual property as a part of their strategy. Patent visualisation, like patent mapping, is used to quickly view a patent portfolio.

The following is provided as an overview of and topical guide to databases:

The following outline is provided as an overview of and topical guide to natural-language processing:

<span class="mw-page-title-main">KH Coder</span> Qualitative data analysis software

KH Coder is an open source software for computer assisted qualitative data analysis, particularly quantitative content analysis and text mining. It can be also used for computational linguistics. It supports processing and etymological information of text in several languages, such as Japanese, English, French, German, Italian, Portuguese and Spanish. Specifically, it can contribute factual examination co-event system hub structure, computerized arranging guide, multidimensional scaling and comparative calculations. Word frequency statistics, part-of-speech analysis, grouping, correlation analysis, and visualization are among the features offered by KH Coder.

References

  1. Shilakes, Christopher C.; Tylman, Julie (16 Nov 1998). "Enterprise Information Portals" (PDF). Merrill Lynch. Archived from the original (PDF) on 24 July 2011.
  2. Grimes, Seth (1 August 2008). "Unstructured Data and the 80 Percent Rule". Breakthrough Analysis - Bridgepoints. Clarabridge.
  3. Gandomi, Amir; Haider, Murtaza (April 2015). "Beyond the hype: Big data concepts, methods, and analytics". International Journal of Information Management. 35 (2): 137–144. doi: 10.1016/j.ijinfomgt.2014.10.007 . ISSN   0268-4012.
  4. "The biggest data challenges that you might not even know you have - Watson". Watson. 2016-05-25. Retrieved 2018-10-02.
  5. "Structured vs. Unstructured Data". www.datamation.com. Retrieved 2018-10-02.
  6. "EMC News Press Release: New Digital Universe Study Reveals Big Data Gap: Less Than 1% of World's Data is Analyzed; Less Than 20% is Protected". www.emc.com. EMC Corporation. December 2012.
  7. "Trends | Seagate US". Seagate.com. Retrieved 2018-10-01.
  8. 1 2 Grimes, Seth. "A Brief History of Text Analytics". B Eye Network. Retrieved June 24, 2016.
  9. Albright, Russ. "Taming Text with the SVD" (PDF). SAS. Archived from the original (PDF) on 2016-09-30. Retrieved June 24, 2016.
  10. Desai, Manish (2009-08-09). "Applications of Text Analytics". My Business Analytics @ Blogspot. Retrieved June 24, 2016.
  11. Chakraborty, Goutam. "Analysis of Unstructured Data: Applications of Text Analytics and Sentiment Mining" (PDF). SAS. Retrieved June 24, 2016.
  12. "Structure, Models and Meaning: Is "unstructured" data merely unmodeled?". InformationWeek. March 1, 2005.
  13. Malone, Robert (April 5, 2007). "Structuring Unstructured Data". Forbes.
  14. Lin, Cindy Xide; Ding, Bolin; Han, Jiawei; Zhu, Feida; Zhao, Bo (December 2008). "Text Cube: Computing IR Measures for Multidimensional Text Database Analysis". 2008 Eighth IEEE International Conference on Data Mining. IEEE. pp. 905–910. CiteSeerX   10.1.1.215.3177 . doi:10.1109/icdm.2008.135. ISBN   9780769535029. S2CID   1522480.
  15. 1 2 Tao, Fangbo; Zhuang, Honglei; Yu, Chi Wang; Wang, Qi; Cassidy, Taylor; Kaplan, Lance; Voss, Clare; Han, Jiawei (2016). "Multi-Dimensional, Phrase-Based Summarization in Text Cubes" (PDF).
  16. Collier, Nigel; Nazarenko, Adeline; Baud, Robert; Ruch, Patrick (June 2006). "Recent advances in natural language processing for biomedical applications". International Journal of Medical Informatics. 75 (6): 413–417. doi:10.1016/j.ijmedinf.2005.06.008. ISSN   1386-5056. PMID   16139564. S2CID   31449783.
  17. Gonzalez, Graciela H.; Tahsin, Tasnia; Goodale, Britton C.; Greene, Anna C.; Greene, Casey S. (January 2016). "Recent Advances and Emerging Applications in Text and Data Mining for Biomedical Discovery". Briefings in Bioinformatics. 17 (1): 33–42. doi:10.1093/bib/bbv087. ISSN   1477-4054. PMC   4719073 . PMID   26420781.
  18. Skupin, André; Biberstine, Joseph R.; Börner, Katy (2013). "Visualizing the topical structure of the medical sciences: a self-organizing map approach". PLOS ONE. 8 (3): e58779. Bibcode:2013PLoSO...858779S. doi: 10.1371/journal.pone.0058779 . ISSN   1932-6203. PMC   3595294 . PMID   23554924.
  19. Kiela, Douwe; Guo, Yufan; Stenius, Ulla; Korhonen, Anna (2015-04-01). "Unsupervised discovery of information structure in biomedical documents". Bioinformatics. 31 (7): 1084–1092. doi: 10.1093/bioinformatics/btu758 . ISSN   1367-4811. PMID   25411329.
  20. 1 2 Liem, David A.; Murali, Sanjana; Sigdel, Dibakar; Shi, Yu; Wang, Xuan; Shen, Jiaming; Choi, Howard; Caufield, John H.; Wang, Wei; Ping, Peipei; Han, Jiawei (Oct 1, 2018). "Phrase mining of textual data to analyze extracellular matrix protein patterns across cardiovascular disease". American Journal of Physiology. Heart and Circulatory Physiology. 315 (4): H910–H924. doi:10.1152/ajpheart.00175.2018. ISSN   1522-1539. PMC   6230912 . PMID   29775406.
  21. "Swedish data privacy regulations discontinue separation of "unstructured" and "structured"".