Metadata discovery

Last updated October 05, 2025

In metadata, metadata discovery (also metadata harvesting) is the process of using automated tools to discover the semantics of a data element in data sets. This process usually ends with a set of mappings between the data source elements and a centralized metadata registry. Metadata discovery is also known as metadata scanning.^[1]

Data source formats for metadata discovery

Data sets may be in a variety of different forms including:

Relational databases
NoSQL databases
Spreadsheets
XML files
Web services
Software source code such as Fortran, Jovial, COBOL, Assembler, RPG, PL/1, EasyTrieve, Java, C# or C++ classes, and thousands of other software languages
Unstructured text documents such as Microsoft Word or PDF files

A taxonomy of metadata matching algorithms

There are distinct categories of automated metadata discovery:

Lexical matching

Exact match - where data element linkages are made based on the exact name of a column in a database, the name of an XML element or a label on a screen. For example, if a database column has the name "PersonBirthDate" and a data element in a metadata registry also has the name "PersonBirthDate", automated tools can infer that the column of a database has the same semantics (meaning) as the data element in the metadata registry.
Synonym match - where the discovery tool is not just given a single name but a set of synonym.
Pattern match - in this case the tools is given a set of lexical patterns that it can match. For example, the tools may search for "*gender*" or "*sex*"

Semantic matching

Semantic matching attempts to use semantics to associate target data with registered data elements.

Semantic similarity - In this algorithm that relies on a database of word conceptual nearness is used. For example, the WordNet system can rank how close words are conceptually to each other. For example, the terms "Person", "Individual" and "Human" may be highly similar concepts.

Statistical matching

Statistical matching uses statistics about data sources data itself to derive similarities with registered data elements.

Distinct value analysis - By analyzing all the distinct values in a column the similarity to a registered data element may be made. For example, if a column only has two distinct values of 'male' and 'female' this could be mapped to 'PersonGenderCode'.
Data distribution analysis - By analyzing the distribution of values within a single column and comparing this distribution with known data elements a semantic linkage could be inferred.

Research

INDUS project at Iowa State University.
Mercury - A Distributed Metadata Management and Data Discovery System developed at the Oak Ridge National Laboratory DAAC. ^[2]
National Digital Library of India.

References

↑ Cofield, Melanie. "LibGuides: Metadata Basics: Harvesting". guides.lib.utexas.edu. Retrieved 2025-06-04.
↑ Devarakonda, R., Palanisamy, G., Wilson, B., and Green, J. (2010), "Mercury: reusable metadata management, data discovery and access system", Earth Science Informatics, 3 (1), Springer Berlin / Heidelberg: 87–94, Bibcode:2010ESIn....3...87D, doi:10.1007/s12145-010-0050-7, S2CID 27597035 {{citation}}: CS1 maint: multiple names: authors list (link)

Massive Data Analysis Systems by San Diego Supercomputer Center June 1997
IBM Whitepaper on Enterprise Metadata Discovery
White Paper on Metadata Management - by Esquire Innovations
A. Graham, Rebecca (1 September 2001). "Metadata harvesting". Library Hi Tech. 19 (3). Emerald Publishing. ISSN 0737-8831.
Simeoni, Fabio; Yakici, Murat; Neely, Steve; Crestani, Fabio (3 December 2007). "Metadata harvesting for content-based distributed information retrieval". Journal of the American Society for Information Science and Technology . 59 (1). Wiley Periodicals: 12–24. doi:10.1002/asi.20694. eISSN 1532-2890.
Nag, Ruben; Guhathakurta, Rahul (31 December 2024). "Metadata Harvesting: Applications and Influence in Digital Publishing". Open Access Cases. 1 (4). eISSN 3067-0349.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Cofield, Melanie. "LibGuides: Metadata Basics: Harvesting". guides.lib.utexas.edu. Retrieved 2025-06-04.

[2] Devarakonda, R., Palanisamy, G., Wilson, B., and Green, J. (2010), "Mercury: reusable metadata management, data discovery and access system", Earth Science Informatics, 3 (1), Springer Berlin / Heidelberg: 87–94, Bibcode:2010ESIn....3...87D, doi:10.1007/s12145-010-0050-7, S2CID 27597035 {{citation}}: CS1 maint: multiple names: authors list (link)

[1]

[2]