Metadata repository

Last updated

A metadata repository is a database created to store metadata. Metadata is information about the structures that contain the actual data. Metadata is often said to be "data about data", but this is misleading. Data profiles are an example of actual "data about data". Metadata adds one layer of abstraction to this definition– it is data about the structures that contain data. Metadata may describe the structure of any data, of any subject, stored in any format.

Contents

A well-designed metadata repository typically contains data far beyond simple definitions of the various data structures. Typical repositories store dozens to hundreds of separate pieces of information about each data structure.

Comparing the metadata of a couple data items - one digital and one physical - clarify what metadata is:

First, digital: For data stored in a database one may have a table called "Patient" with many columns, each containing data which describes a different attribute of each patient. One of these columns may be named "Patient_Last_Name". What is some of the metadata about the column that contains the actual surnames of patients in the database? We have already used two items: the name of the column that contains the data (Patient_Last_Name) and the name of the table that contains the column (Patient). Other metadata might include the maximum length of last name that may be entered, whether or not last name is required (can we have a patient without Patient_Last_Name?), and whether the database converts any surnames entered in lower case to upper case. Metadata of a security nature may show the restrictions which limit who may view these names.

Second, physical: For data stored in a brick and mortar library, one have many volumes and may have various media, including books. Metadata about books would include ISBN, Binding_Type, Page_Count, Author, etc. Within Binding_Type, metadata would include possible bindings, material, etc.

This contextual information of business data include meaning and content, policies that govern, technical attributes, specifications that transform, and programs that manipulate. [1] :171

Definition

The metadata repository is responsible for physically storing and cataloging metadata. Data in a metadata repository should be generic, integrated, current, and historical:

Generic
Meta model should store the metadata by generic terms instead of storing it by an applications-specific defined way, so that if your data base standard changes from one product to another the physical meta model of the metadata repository would not need to change.
Integration
of the metadata repository allows all business areas' metadata to be in an integrated fashion: Covering all domains and subject areas of the organization.
current and historical
The metadata repository should have accessible current and historical metadata. [2] Metadata repositories used to be referred to as a data dictionary. [1] :239

With the transition of needs for the metadata usage for business intelligence has increased so is the scope of the metadata repository increased. Earlier data dictionaries are the closest place to interact technology with business. Data dictionaries are the universe of metadata repository in the initial stages but as the scope increased Business glossary and their tags to variety of status flags emerged in the business side while consumption of the technology metadata, their lineage and linkages made the repository, the source for valuable reports to bring business and technology together and helped data management decisions easier as well as assess the cost of the changes.

Metadata repository explores the enterprise wide data governance, data quality and master data management (includes master data and reference data) and integrates this wealth of information with integrated metadata across the organization to provide decision support system for data structures, even though it only reflects the structures consumed from various systems.

Repository vs. registry

Repository has additional functionalities compared with registry. Metadata repository not only stores metadata like Metadata registry but also adds relationships with related metadata types. Metadata when related in a flow from its point of entry into organization up to the deliverables is considered as the lineage of that data point. Metadata when related across other related metadata types is called linkages. By providing the relationships to all the metadata points across the organization and maintaining its integrity with an architecture to handle the changes, metadata repository provides the basic material for understanding the complete data flow and their definitions and their impact. Also the important feature is to maintain the version control though this statement for contrasting is open for discussion. These definitions are still evolving, so the accuracy of the definitions needs refinement.

The purpose of registry is to define the metadata element and maintained across the organization. And data models and other data management teams refer to the registry for any changes to follow. While Metadata repository sources metadata from various metadata systems in the organizations and reflects what is in the upstream. Repository never acts as an upstream while registry is used as an upstream for metadata changes.

Reason for use

Metadata repository enables all the structure of the organizations data containers to one integrated place. This opens plethora of resourceful information for making calculated business decisions. This tool uses one generic form of data model to integrate all the models thus brings all the applications and programs of the organization into one format. And on top of it applying the business definitions and business processes brings the business and technology closer that will help organizations make reliable roadmaps with definite goals. With one stop information, business will have more control on the changes, and can do impact analysis of the tool. Usually business spends much time and money to make decisions based on discovery and research on impacts to make changes or to add new data structures or remove structures in data management of the organization. With a structured and well maintained repository, moving the product from ideation to delivery takes the least amount of time (considering other variables are constant). To sum it up:

  1. Integration of the metadata across the organization
  2. Build relationship between various metadata types
  3. Build relationship between various disparate systems
  4. Define business golden copy of definitions
  5. Version control of the changes at structure level
  6. Interaction with Reference data
  7. Link view to master data
  8. Automatic synchronization with various authorized metadata source systems
  9. More control to business decisions
  10. Validate the structures by overlapping the models
  11. Discovering discrepancies, gaps, lineage, metrics at data structure level

Each database management system (DBMS) and database tools have their own language for the metadata components within. Database applications already have their own repositories or registries that are expected to provide all of the necessary functionality to access the data stored within. Vendors do not want other companies to be capable of easily migrating data away from their products and into competitors products, so they are proprietary with the way they handle metadata. CASE tools, DBMS dictionaries, ETL tools, data cleansing tools, OLAP tools, and data mining tools all handle and store metadata differently. Only a metadata repository can be designed to store the metadata components from all of these tools. [3]

Design

Metadata repositories should store metadata in four classifications: ownership, descriptive characteristics, rules and policies, and physical characteristics. Ownership, showing the data owner and the application owner. The descriptive characteristics, define the names, types and lengths, and definitions describing business data or business processes. Rules and policies, will define security, data cleanliness, timelines for data, and relationships. Physical characteristics define the origin or source, and physical location. [1] :176 Like building a logical data model for creating a database, a logical meta model can help identify the metadata requirements for business data. [1] :185 The metadata repository will be centralized, decentralized, or distributed. A centralized design means that there is one database for the metadata repository that stores metadata for all applications business wide. A centralized metadata repository has the same advantages and disadvantages of a centralized database. Easier to manage because all the data is in one database, but the disadvantage is that bottlenecks may occur.

A decentralized metadata repository stores metadata in multiple databases, either separated by location and or departments of the business. This makes management of the repository more involved than a centralized metadata repository, but the advantage is that the metadata can be broken down into individual departments.

A distributed metadata repository uses a decentralized method, but unlike a decentralized metadata repository the metadata remains in its original application. An XML gateway is created [1] :246 that acts as a directory for accessing the metadata within each different application. The advantages and disadvantages for a distributed metadata repository mirror that of a distributed database.

Design of information model should include various layers of metadata types to be overlapped to create an integrated view of the data. Various metadata types should be stitched with related metadata elements in a top down model linking to business glossary.

Layers of Metadata:

  1. Business Glossary: contains recursive relationship to Business terms.
  2. Business tags: Contains various affiliation to that term or terms.
  3. Data Dictionary: contains information from data model tools for the definition of metadata elements and their technical definitions provided by data or enterprise architecture.
  4. Conceptual data models:
  5. Logical data models
  6. Physical data models
  7. Databases
  8. validation rules and data quality rules
  9. ETL, business rules and their relationship to attributes and entities
  10. Reports
  11. Source to target mapping artifacts (relationships)
  12. Reporting requirements (relationships)
  13. business processes and their relationship to technology
  14. people hierarchy and their relationship
  15. owner relationship

Entity-Relationship/Object-Oriented

Metadata repositories can be designed as either an Entity-relationship model, or an Object-oriented design.

See also

Related Research Articles

<span class="mw-page-title-main">Database</span> Organized collection of data in computing

In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases spans formal techniques and practical considerations, including data modeling, efficient data representation and storage, query languages, security and privacy of sensitive data, and distributed computing issues, including supporting concurrent access and fault tolerance.

<span class="mw-page-title-main">Data model</span> Model that organizes elements of data and how they relate to one another and to real-world entities.

A data model is an abstract model that organizes elements of data and standardizes how they relate to one another and to the properties of real-world entities. For instance, a data model may specify that the data element representing a car be composed of a number of other elements which, in turn, represent the color and size of the car and define its owner.

<span class="mw-page-title-main">Computer-aided software engineering</span>

Computer-aided software engineering (CASE) is the domain of software tools used to design and implement applications. CASE tools are similar to and were partly inspired by Computer-Aided Design (CAD) tools used for designing hardware products. CASE tools were used for developing high-quality, defect-free, and maintainable software. CASE software is often associated with methods for the development of information systems together with automated tools that can be used in the software development process.

A data dictionary, or metadata repository, as defined in the IBM Dictionary of Computing, is a "centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format". Oracle defines it as a collection of tables with metadata. The term can have one of several closely related meanings pertaining to databases and database management systems (DBMS):

<span class="mw-page-title-main">Data modeling</span> Creating a model of the data in a system

Data modeling in software engineering is the process of creating a data model for an information system by applying certain formal techniques.

Enterprise content management (ECM) extends the concept of content management by adding a timeline for each content item and, possibly, enforcing processes for its creation, approval, and distribution. Systems using ECM generally provide a secure repository for managed items, analog or digital. They also include one methods for importing content to bring manage new items, and several presentation methods to make items available for use. Although ECM content may be protected by digital rights management (DRM), it is not required. ECM is distinguished from general content management by its cognizance of the processes and procedures of the enterprise for which it is created.

A metadata registry is a central location in an organization where metadata definitions are stored and maintained in a controlled method.

The ISO/IEC 11179 Metadata Registry (MDR) standard is an international ISO/IEC standard for representing metadata for an organization in a metadata registry. It documents the standardization and registration of metadata to make data understandable and shareable.

The semantic spectrum is a series of increasingly precise or rather semantically expressive definitions for data elements in knowledge representations, especially for machine use.

<span class="mw-page-title-main">Uniface (programming language)</span> Low-code development platform

Uniface is a low-code development and deployment platform for enterprise applications that can run in a large range of runtime environments, including mobile, mainframe, web, Service-oriented architecture (SOA), Windows, Java EE, and .NET. Uniface is used to create mission-critical applications.

A configuration management database (CMDB) is an ITIL term for a database used by an organization to store information about hardware and software assets. It is useful to break down configuration items into logical layers. This database acts as a data warehouse for the organization and also stores information regarding the relationships among its assets. The CMDB provides a means of understanding the organization's critical assets and their relationships, such as information systems, upstream sources or dependencies of assets, and the downstream targets of assets.

IMS VDEX, which stands for IMS Vocabulary Definition Exchange, in data management, is a mark-up language – or grammar – for controlled vocabularies developed by IMS Global as an open specification, with the Final Specification being approved in February 2004.

Entity–attribute–value model (EAV) is a data model to encode, in a space-efficient manner, entities where the number of attributes that can be used to describe them is potentially vast, but the number that will actually apply to a given entity is relatively modest. Such entities correspond to the mathematical notion of a sparse matrix.

A data steward is an oversight or data governance role within an organization, and is responsible for ensuring the quality and fitness for purpose of the organization's data assets, including the metadata for those data assets. A data steward may share some responsibilities with a data custodian, such as the awareness, accessibility, release, appropriate use, security and management of data. A data steward would also participate in the development and implementation of data assets. A data steward may seek to improve the quality and fitness for purpose of other data assets their organization depends upon but is not responsible for.

LANSA is an integrated development environment (IDE) for building desktop, web and mobile software applications that can be deployed to Cloud, Windows, Linux and IBM i server platforms. The main feature of the LANSA environment is the 'RDML / RDMLX' language–which is classified as a 4GL. RDML closely follows the syntax of IBM CL, or Control Language. CL is the "scripting language" equivalent of the OS/400 operating system. In recent years RDML has been extended to become RDMLX. This new version of the language has extra features, commands, types, and functions that are used in component development. RDML, on Microsoft Windows, integrates with ActiveX.

Microsoft SQL Server Master Data Services (MDS) is a Master Data Management (MDM) product from Microsoft that ships as a part of the Microsoft SQL Server relational database management system. Master data management (MDM) allows an organization to discover and define non-transactional lists of data, and compile maintainable, reliable master lists. Master Data Services first shipped with Microsoft SQL Server 2008 R2. Microsoft SQL Server 2016 introduced enhancements to Master Data Services, such as improved performance and security, and the ability to clear transaction logs, create custom indexes, share entity data between different models, and support for many-to-many relationships.

A document-oriented database, or document store, is a computer program and data storage system designed for storing, retrieving and managing document-oriented information, also known as semi-structured data.

<span class="mw-page-title-main">Metadata</span> Data about data

Metadata is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself. There are many distinct types of metadata, including:

IBM WebSphere Service Registry and Repository (WSRR) is a service registry for use in a Service-oriented architecture.

The following is provided as an overview of and topical guide to databases:

References

  1. 1 2 3 4 5 Moss, L. T.; Atre, S. (2003). Business Intelligence Roadmap: The Complete Project Lifecycle for Decision-Support Applications. Addison-Wesley Professional. ISBN   0-201-78420-3.
  2. Marco, D.; Jennings, M. (2004). Universal Metadata Models . Wiley. pp.  36–43. ISBN   0-471-08177-9.
  3. Marco, D. (2000). Building and Managing the Metadata Repository: A Full Lifecycle Guide. Wiley. ISBN   978-0471355236.