Data classification (data management)

Last updated April 17, 2024

Data classification is the process of organizing data into categories based on its attributes, e.g. file type, contents, other metadata. The data is then assigned class labels that describe a set of attributes that hold true for the corresponding data sets. The goal is to provide meaningful class attributes to raw unstructured information.

How to start process of data classification?
Basic criteria for semi-structured or poly-structured data classification
Basic criteria for relational or Tabular data classification
Benefits of data classification
Business data classification approaches
Paper-Based Classification Policy
Automated Classification Policy
User-Driven Classification Policy
See also
References

Big data analytics has demonstrated the importance of data classification in organizations today.^[1] In the field of data management, data classification as a part of the Information Lifecycle Management (ILM) process can be defined as a tool for categorization of data to enable/help organizations to effectively answer the following questions:

What data types are available?
What level of sensitivity is needed?
Where is the data currently stored?
What access levels are implemented?
What protection level is implemented and does it adhere to compliance regulations?^[2]

Typically, data classification is viewed as a multitude of label that are used to define the type of data, especially on confidentiality and integrity issues.^[3] When implemented, it provides a bridge between IT professionals and process or application owners. IT staff are informed about the value of data, and management (usually application owners) better understand which part of the data center needs investment to keep operations running effectively. This can be particularly important in risk management, legal discovery, and compliance with government regulations. Data classification is typically a manual process; however, there are many tools from different vendors that can help gather information about the data.^[4]

Data classification needs to take into account the following:

Regulatory requirements
Strategic or proprietary worth
Organization specific policies
Ethical and privacy considerations
Contractual agreements^[5]

Data sensitivity levels must be considered.^[4]

How to start process of data classification?

Note that this classification structure is written from a Data Management perspective and therefore has a focus on text and text convertible binary data sources. Images, videos, and audio files are highly structured formats built for industry standard API's and do not readily fit within the classification scheme outlined below.

Evaluation and a division of the various data applications and data into their respective categories is needed to start the data classification process. For example, the process may look like:

Relational or Tabular data (around 15% of non audio/video data)
- Generally describes proprietary data which can be accessible only through application or application programming interfaces (API).
- Applications that produce structured data are usually database applications.
- This type of data often brings complex procedures of data evaluation and migration between the storage tiers.
- To ensure adequate quality standards, the classification process has to be monitored by subject-matter experts.
Semi-structured or Poly-structured data (all other non-audio/video data that does not conform to a system or platform defined Relational or Tabular form).
- Typically describes data files that have a dynamic or non-relational semantic structure (e.g., documents, XML, JSON, Device or System Log output, Sensor Output, etc.).
- Relatively simple process of data classification is criteria assignment.
- Simple process of data migration between assigned segments of predefined storage tiers.

There are different types of data classification used. Please note that this designation is entirely orthogonal to the application-centric designation outlined above. Regardless of structure inherited from the application, data may be of a certain type, such as:

1. Geographical

2. Chronological

3. Qualitative

4. Quantitative

It should also be evaluated across three dimensions:

Identifiability: how easily can this data be used to identify an individual?
Sensitivity: how much damage could be done if this data reached the wrong hands?
Scarcity: how readily available is this data?^[6]

Basic criteria for semi-structured or poly-structured data classification

Time criteria are the simplest and most commonly used, where different types of data are evaluated by time of creation, time of access, time of update, etc.
Metadata criteria as type, name, owner, location, and so on can be used to create more advanced classification policy.
Content criteria which involve usage of advanced content classification algorithms are the most advanced forms of unstructured data classification.

Note that any of these criteria may also apply to Tabular or Relational data as "Basic Criteria.” These criteria are application specific, rather than inherent aspects of the form in which the data is presented.

Basic criteria for relational or Tabular data classification

These criteria are usually initiated by application requirements, such as:

Disaster recovery and Business Continuity rules
Data center resources optimization and consolidation
Hardware performance limitations and possible improvements by reorganization

Note that any of these criteria may also apply to semi/poly structured data as "Basic Criteria.” These criteria are application specific, rather than inherent aspects of the form in which the data is presented.

Benefits of data classification

Benefits of effective implementation of appropriate data classification can significantly improve ILM process and save data center storage resources. If implemented systemically, it can generate improvements in data center performance and utilization. Data classification can also reduce costs and administration overhead. “Good enough” data classification can produce these results:

Data compliance and easier risk management. Data are located where expected on predefined storage tier and "point in time”
Simplification of data encryption because all data need not be encrypted. This saves valuable processor cycles and all related tasks.
Data indexing to improve user access times
Data protection is redefined, where RTO (Recovery Time Objective) is improved.

Business data classification approaches

There are three different approaches to data classification within a business environment, each of these techniques – paper-based classification, automated classification and user-driven (or user-applied) classification^[7] — has its own benefits and pitfalls.

Paper-Based Classification Policy

A corporate data classification policy will set out how employees are required to treat the different types of data they handle, aligned with the organization's overall data security policy and strategy. A well-written policy will enable users to make fast and intuitive decisions about the value of a piece of information, and what the appropriate handling rules are, for example; who can access the data, and should a rights' management template be invoked. The challenge, without any supporting technology, is ensuring that everyone is aware of the policy and implements it correctly.

Automated Classification Policy

This technique bypasses the users’ involvement, enforcing a classification policy to be consistently applied across all touchpoints, without the need for major communication and education programs.

Classifications are applied by solutions that use software algorithms based on keywords or phrases in the content to analyze and classify it. This approach comes into its own where certain types of data are created with no user involvement – for example, reports generated by ERP systems, or where the data includes specific personal information which is easily identified, such as credit card details.

However, automated solutions do not understand context and are therefore susceptible to inaccuracies, giving false positive results that can frustrate users and impede business processes, as well as false negative errors that expose organizations to sensitive data loss.

User-Driven Classification Policy

The data classification process can be completely automated, but it is most effective when the user is placed in the driving seat.

The user-driven classification technique makes employees themselves responsible for deciding which label is appropriate, and attaching it using a software tool at the point of creating, editing, sending, or saving. The advantage of involving the user in the process is that their insight into the context, business value and sensitivity of a piece of data enables them to make informed and accurate decisions about which label to apply. User-driven classification is an additional security layer often used to complement automated classification.

Involving users in classification also leads to other organizational benefits including increased security awareness, an improved culture and the ability to monitor user behavior, which aids reporting and provides the ability to demonstrate compliance. Furthermore, managers can use this behavioral data to identify a possible insider threat, and address any concerns by providing additional guidance to users as appropriate, for example through additional training or by tightening up policy.

Related Research Articles

In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and analyze the data. The DBMS additionally encompasses the core facilities provided to administer the database. The sum total of the database, the DBMS and the associated applications can be referred to as a database system. Often the term "database" is also used loosely to refer to any of the DBMS, the database system or an application associated with the database.

A relational database (RDB) is a database based on the relational model of data, as proposed by E. F. Codd in 1970. A database management system used to maintain relational databases is a relational database management system (RDBMS). Many relational database systems are equipped with the option of using SQL for querying and updating the database.

In telecommunication, provisioning involves the process of preparing and equipping a network to allow it to provide new services to its users. In National Security/Emergency Preparedness telecommunications services, "provisioning" equates to "initiation" and includes altering the state of an existing priority service or capability.

Data engineering refers to the building of systems to enable the collection and usage of data. This data is usually used to enable subsequent analysis and data science; which often involves machine learning. Making the data usable usually involves substantial compute and storage, as well as data processing.

Data modeling in software engineering is the process of creating a data model for an information system by applying certain formal techniques. It may be applied as part of broader Model-driven engineering (MDE) concept.

Identity management (IdM), also known as identity and access management, is a framework of policies and technologies to ensure that the right users have the appropriate access to technology resources. IdM systems fall under the overarching umbrellas of IT security and data management. Identity and access management systems not only identify, authenticate, and control access for individuals who will be utilizing IT resources but also the hardware and applications employees need to access.

Records management, also known as records and information management, is an organizational function devoted to the management of information in an organization throughout its life cycle, from the time of creation or receipt to its eventual disposition. This includes identifying, classifying, storing, securing, retrieving, tracking and destroying or permanently preserving records. The ISO 15489-1: 2001 standard defines records management as "[the] field of management responsible for the efficient and systematic control of the creation, receipt, maintenance, use and disposition of records, including the processes for capturing and maintaining evidence of and information about business activities and transactions in the form of records".

Database security concerns the use of a broad range of information security controls to protect databases against compromises of their confidentiality, integrity and availability. It involves various types or categories of controls, such as technical, procedural or administrative, and physical.

Database administration is the function of managing and maintaining database management systems (DBMS) software. Mainstream DBMS software such as Oracle, IBM Db2 and Microsoft SQL Server need ongoing management. As such, corporations that use DBMS software often hire specialized information technology personnel called database administrators or DBAs.

Information technology risk, IT risk, IT-related risk, or cyber risk is any risk relating to information technology. While information has long been appreciated as a valuable and important asset, the rise of the knowledge economy and the Digital Revolution has led to organizations becoming increasingly dependent on information, information processing and especially IT. Various events or incidents that compromise IT in some way can therefore cause adverse impacts on the organization's business processes or mission, ranging from inconsequential to catastrophic in scale.

Documentum is an enterprise content management platform developed by OpenText. EMC acquired Documentum for US$1.7 billion in December 2003. The Documentum platform was part of EMC's Enterprise Content Division (ECD) business unit, one of EMC's four operating divisions.

Identity correlation is, in information systems, a process that reconciles and validates the proper ownership of disparate user account login IDs that reside on systems and applications throughout an organization and can permanently link ownership of those user account login IDs to particular individuals by assigning a unique identifier to all validated account login IDs.

Business process management (BPM) is the discipline in which people use various methods to discover, model, analyze, measure, improve, optimize, and automate business processes. Any combination of methods used to manage a company's business processes is BPM. Processes can be structured and repeatable or unstructured and variable. Though not required, enabling technologies are often used with BPM.

NoSQL is an approach to database design that focuses on providing a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Instead of the typical tabular structure of a relational database, NoSQL databases house data within one data structure. Since this non-relational database design does not require a schema, it offers rapid scalability to manage large and typically unstructured data sets. NoSQL systems are also sometimes called "Not only SQL" to emphasize that they may support SQL-like query languages or sit alongside SQL databases in polyglot-persistent architectures.

Cloud computing security or, more simply, cloud security, refers to a broad set of policies, technologies, applications, and controls utilized to protect virtualized IP, data, applications, services, and the associated infrastructure of cloud computing. It is a sub-domain of computer security, network security, and, more broadly, information security.

Digital mailroom is the automation of incoming mail processes. Using document scanning and document capture technologies, companies can digitise incoming mail and automate the classification and distribution of mail within the organization. Both paper and electronic mail (email) can be managed through the same process allowing companies to standardize their internal mail distribution procedures and adhere to company compliance policies.

Security information and event management (SIEM) is a field within the field of computer security, where software products and services combine security information management (SIM) and security event management (SEM). SIEM is the core component of any typical Security Operations Center (SOC), which is the centralized response team addressing security issues within an organization.

Privacy engineering is an emerging field of engineering which aims to provide methodologies, tools, and techniques to ensure systems provide acceptable levels of privacy. Its focus lies in organizing and assessing methods to identify and tackle privacy concerns within the engineering of information systems.

System and Organization Controls as defined by the American Institute of Certified Public Accountants (AICPA), is the name of a suite of reports produced during an audit. It is intended for use by service organizations to issue validated reports of internal controls over those information systems to the users of those services. The reports focus on controls grouped into five categories called Trust Service Criteria. The Trust Services Criteria were established by The AICPA through its Assurance Services Executive Committee (ASEC) in 2017. These control criteria are to be used by the practitioner/examiner in attestation or consulting engagements to evaluate and report on controls of information systems offered as a service. The engagements can be done on an entity wide, subsidiary, division, operating unit, product line or functional area basis. The Trust Services Criteria were modeled in conformity to The Committee of Sponsoring Organizations of the Treadway Commission (COSO) Internal Control - Integrated Framework. In addition, the Trust Services Criteria can be mapped to NIST SP 800 - 53 criteria and to EU General Data Protection Regulation (GDPR) Articles. The AICPA auditing standard Statement on Standards for Attestation Engagements no. 18, section 320, "Reporting on an Examination of Controls at a Service Organization Relevant to User Entities' Internal Control Over Financial Reporting", defines two levels of reporting, type 1 and type 2. Additional AICPA guidance materials specify three types of reporting: SOC 1, SOC 2, and SOC 3.

Data sanitization involves the secure and permanent erasure of sensitive data from datasets and media to guarantee that no residual data can be recovered even through extensive forensic analysis. Data sanitization has a wide range of applications but is mainly used for clearing out end-of-life electronic devices or for the sharing and use of large datasets that contain sensitive information. The main strategies for erasing personal data from devices are physical destruction, cryptographic erasure, and data erasure. While the term data sanitization may lead some to believe that it only includes data on electronic media, the term also broadly covers physical media, such as paper copies. These data types are termed soft for electronic files and hard for physical media paper copies. Data sanitization methods are also applied for the cleaning of sensitive data, such as through heuristic-based methods, machine-learning based methods, and k-source anonymity.

References

↑ Grover, Purva; Kar, Arpan Kumar (2017-06-13). "Big Data Analytics: A Review on Theoretical Contributions and Tools Used in Literature". Global Journal of Flexible Systems Management. 18 (3): 203–229. doi:10.1007/s40171-017-0159-3. ISSN 0972-2696.
↑ Knight, Michelle (2021-08-26). "What Are Data Regulations?". DATAVERSITY. Retrieved 2022-10-26.
↑ Bar-Sinai, Michael; Sweeney, Latanya; Crosas, Merce (May 2016). "DataTags, Data Handling Policy Spaces and the Tags Language". 2016 IEEE Security and Privacy Workshops (SPW). IEEE. pp. 1–8. doi:10.1109/spw.2016.11. ISBN 978-1-5090-3690-5.
1 2 "What is Data Classification? | Best Practices & Data Types | Imperva". Learning Center. Retrieved 2024-02-03.
↑ "Get the scoop on data classification and GDPR before you're too late - LightsOnData". LightsOnData. 2018-05-23. Retrieved 2018-05-23.
↑ Khatibloo, Fatemeh (May 2017). "How Dirty Is Your Data? Strategic Plan: The Customer Trust And Privacy Playbook". The Customer Trust and Privacy Playbook for 2018.
↑ "What Is Data Classification And What Can It Do For My Business? | Boldon James". www.boldonjames.com. Retrieved 2019-03-05.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Grover, Purva; Kar, Arpan Kumar (2017-06-13). "Big Data Analytics: A Review on Theoretical Contributions and Tools Used in Literature". Global Journal of Flexible Systems Management. 18 (3): 203–229. doi:10.1007/s40171-017-0159-3. ISSN 0972-2696.

[2] Knight, Michelle (2021-08-26). "What Are Data Regulations?". DATAVERSITY. Retrieved 2022-10-26.

[3] Bar-Sinai, Michael; Sweeney, Latanya; Crosas, Merce (May 2016). "DataTags, Data Handling Policy Spaces and the Tags Language". 2016 IEEE Security and Privacy Workshops (SPW). IEEE. pp. 1–8. doi:10.1109/spw.2016.11. ISBN 978-1-5090-3690-5.

[:0-4] 1 2 "What is Data Classification? | Best Practices & Data Types | Imperva". Learning Center. Retrieved 2024-02-03.

[5] "Get the scoop on data classification and GDPR before you're too late - LightsOnData". LightsOnData. 2018-05-23. Retrieved 2018-05-23.

[6] Khatibloo, Fatemeh (May 2017). "How Dirty Is Your Data? Strategic Plan: The Customer Trust And Privacy Playbook". The Customer Trust and Privacy Playbook for 2018.

[7] "What Is Data Classification And What Can It Do For My Business? | Boldon James". www.boldonjames.com. Retrieved 2019-03-05.

[1]

[2]

[3]

[4]

[5]

[6]

[7]