Data classification (data management)

Last updated

Data classification is the process of organizing data into categories based on attributes like file type, content, or metadata. The data is then assigned class labels that describe a set of attributes for the corresponding data sets. The goal is to provide meaningful class attributes to former less structured information.

Contents

Data classification can be viewed as a multitude of labels that are used to define the type of data, especially on confidentiality and integrity issues. [1] Data classification is typically a manual process; however, there are tools that can help gather information about the data. [2] Data sensitivity levels are often proposed to be considered. [2]

Approaches

A corporate data classification policy sets out how employees are required to treat the different types of data they handle. Automated classification techniques are sometimes applied by software algorithms based on keywords or phrases in the content to analyze and classify it. It might be used for reports generated by ERP systems or where the data includes specific personal information that is identified. In some cases, employees might be responsible for deciding which label is appropriate. [3]

See also

Related Research Articles

<span class="mw-page-title-main">HTML</span> HyperText Markup Language

Hypertext Markup Language (HTML) is the standard markup language for documents designed to be displayed in a web browser. It defines the content and structure of web content. It is often assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScript.

Multiprotocol Label Switching (MPLS) is a routing technique in telecommunications networks that directs data from one node to the next based on labels rather than network addresses. Whereas network addresses identify endpoints, the labels identify established paths between endpoints. MPLS can encapsulate packets of various network protocols, hence the multiprotocol component of the name. MPLS supports a range of access technologies, including T1/E1, ATM, Frame Relay, and DSL.

<span class="mw-page-title-main">XML</span> Markup language by the W3C for encoding of data

Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML.

<span class="mw-page-title-main">Authentication</span> Act of proving an assertion

Authentication is the act of proving an assertion, such as the identity of a computer system user. In contrast with identification, the act of indicating a person or thing's identity, authentication is the process of verifying that identity. It might involve validating personal identity documents, verifying the authenticity of a website with a digital certificate, determining the age of an artifact by carbon dating, or ensuring that a product or document is not counterfeit.

A virtual local area network (VLAN) is any broadcast domain that is partitioned and isolated in a computer network at the data link layer. In this context, virtual refers to a physical object recreated and altered by additional logic, within the local area network. Basically, a VLAN behaves like a virtual switch or network link that can share the same physical structure with other VLANs while staying logically separate from them. Between network devices, VLANs work by applying tags to network frames and handling these tags in networking systems – creating the appearance and functionality of network traffic that is physically on a single network but acts as if it is split between separate networks. In this way, VLANs can keep network applications separate despite being connected to the same physical network, and without requiring multiple sets of cabling and networking devices to be deployed.

An HTML element is a type of HTML document component, one of several types of HTML nodes. The first used version of HTML was written by Tim Berners-Lee in 1993 and there have since been many versions of HTML. The current de facto standard is governed by the industry group WHATWG and is known as the HTML Living Standard.

Decision tree learning is a supervised learning approach used in statistics, data mining and machine learning. In this formalism, a classification or regression decision tree is used as a predictive model to draw conclusions about a set of observations.

Automatic summarization is the process of shortening a set of data computationally, to create a subset that represents the most important or relevant information within the original content. Artificial intelligence algorithms are commonly developed and employed to achieve this, specialized for different types of data.

The High Level Architecture (HLA) is a standard for distributed simulation, used when building a simulation for a larger purpose by combining (federating) several simulations. The standard was developed in the 1990s under the leadership of the US Department of Defense and was later transitioned to become an open international IEEE standard. It is a recommended standard within NATO through STANAG 4603. Today the HLA is used in a number of domains including defense and security and civilian applications.

<span class="mw-page-title-main">XBRL</span> Exchange format for business information

XBRL is a freely available and global framework for exchanging business information. XBRL allows the expression of semantics commonly required in business reporting. The standard was originally based on XML, but now additionally supports reports in JSON and CSV formats, as well as the original XML-based syntax. XBRL is also increasingly used in its Inline XBRL variant, which embeds XBRL tags into an HTML document. One common use of XBRL is the exchange of financial information, such as in a company's annual financial report. The XBRL standard is developed and published by XBRL International, Inc. (XII).

In the context of software engineering, software quality refers to two related but distinct notions:

IEEE 802.1Q, often referred to as Dot1q, is the networking standard that supports virtual local area networking (VLANs) on an IEEE 802.3 Ethernet network. The standard defines a system of VLAN tagging for Ethernet frames and the accompanying procedures to be used by bridges and switches in handling such frames. The standard also contains provisions for a quality-of-service prioritization scheme commonly known as IEEE 802.1p and defines the Generic Attribute Registration Protocol.

Microformats (μF) are a set of defined HTML classes created to serve as consistent and descriptive metadata about an element, designating it as representing a certain type of data. They allow software to process the information reliably by having set classes refer to a specific type of data rather than being arbitrary.

Unstructured data is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well. This results in irregularities and ambiguities that make it difficult to understand using traditional programs as compared to data stored in fielded form in databases or annotated in documents.

<span class="mw-page-title-main">Metadata</span> Data about data

Metadata is "data that provides information about other data", but not the content of the data itself, such as the text of a message or the image itself. There are many distinct types of metadata, including:

Folksonomy is a classification system in which end users apply public tags to online items, typically to make those items easier for themselves or others to find later. Over time, this can give rise to a classification system based on those tags and how often they are applied or searched for, in contrast to a taxonomic classification designed by the owners of the content and specified when it is published. This practice is also known as collaborative tagging, social classification, social indexing, and social tagging. Folksonomy was originally "the result of personal free tagging of information [...] for one's own retrieval", but online sharing and interaction expanded it into collaborative forms. Social tagging is the application of tags in an open online environment where the tags of other users are available to others. Collaborative tagging is tagging performed by a group of users. This type of folksonomy is commonly used in cooperative and collaborative projects such as research, content repositories, and social bookmarking.

Cloud computing security or, more simply, cloud security, refers to a broad set of policies, technologies, applications, and controls utilized to protect virtualized IP, data, applications, services, and the associated infrastructure of cloud computing. It is a sub-domain of computer security, network security, and, more broadly, information security.

In computer security, general access control includes identification, authorization, authentication, access approval, and audit. A more narrow definition of access control would cover only access approval, whereby the system makes a decision to grant or reject an access request from an already authenticated subject, based on what the subject is authorized to access. Authentication and access control are often combined into a single operation, so that access is approved based on successful authentication, or based on an anonymous access token. Authentication methods and tokens include passwords, biometric scans, physical keys, electronic keys and devices, hidden paths, social barriers, and monitoring by humans and automated systems.

Privacy engineering is an emerging field of engineering which aims to provide methodologies, tools, and techniques to ensure systems provide acceptable levels of privacy. Its focus lies in organizing and assessing methods to identify and tackle privacy concerns within the engineering of information systems.

Zero-shot learning (ZSL) is a problem setup in deep learning where, at test time, a learner observes samples from classes which were not observed during training, and needs to predict the class that they belong to. The name is a play on words based on the earlier concept of one-shot learning, in which classification can be learned from only one, or a few, examples.

References

  1. Bar-Sinai, Michael; Sweeney, Latanya; Crosas, Merce (May 2016). "DataTags, Data Handling Policy Spaces and the Tags Language". 2016 IEEE Security and Privacy Workshops (SPW). IEEE. pp. 1–8. doi:10.1109/spw.2016.11. ISBN   978-1-5090-3690-5.
  2. 1 2 "What is Data Classification? | Best Practices & Data Types | Imperva". Learning Center. Retrieved 2024-02-03.
  3. "What Is Data Classification And What Can It Do For My Business?". www.boldonjames.com. Retrieved 2019-03-05.