Data classification (data management)

Last updated


Data classification is the process of organizing data into categories based on its attributes, e.g. file type, contents, other metadata. The data is then assigned class labels that describe a set of attributes that hold true for the corresponding data sets. The goal is to provide meaningful class attributes to raw unstructured information.

Contents

Big data analytics has demonstrated the importance of data classification in organizations today. [1] In the field of data management, data classification as a part of the Information Lifecycle Management (ILM) process can be defined as a tool for categorization of data to enable/help organizations to effectively answer the following questions:

Typically, data classification is viewed as a multitude of label that are used to define the type of data, especially on confidentiality and integrity issues. [3] When implemented, it provides a bridge between IT professionals and process or application owners. IT staff are informed about the value of data, and management (usually application owners) better understand which part of the data center needs investment to keep operations running effectively. This can be particularly important in risk management, legal discovery, and compliance with government regulations. Data classification is typically a manual process; however, there are many tools from different vendors that can help gather information about the data. [4]

Data classification needs to take into account the following:

Data sensitivity levels must be considered. [4]

How to start process of data classification?

Note that this classification structure is written from a Data Management perspective and therefore has a focus on text and text convertible binary data sources. Images, videos, and audio files are highly structured formats built for industry standard API's and do not readily fit within the classification scheme outlined below.

Evaluation and a division of the various data applications and data into their respective categories is needed to start the data classification process. For example, the process may look like:

There are different types of data classification used. Please note that this designation is entirely orthogonal to the application-centric designation outlined above. Regardless of structure inherited from the application, data may be of a certain type, such as:

1. Geographical

2. Chronological

3. Qualitative

4. Quantitative

It should also be evaluated across three dimensions:

  1. Identifiability: how easily can this data be used to identify an individual?
  2. Sensitivity: how much damage could be done if this data reached the wrong hands?
  3. Scarcity: how readily available is this data? [6]

Basic criteria for semi-structured or poly-structured data classification

Note that any of these criteria may also apply to Tabular or Relational data as "Basic Criteria.” These criteria are application specific, rather than inherent aspects of the form in which the data is presented.

Basic criteria for relational or Tabular data classification

These criteria are usually initiated by application requirements, such as:

Note that any of these criteria may also apply to semi/poly structured data as "Basic Criteria.” These criteria are application specific, rather than inherent aspects of the form in which the data is presented.

Benefits of data classification

Benefits of effective implementation of appropriate data classification can significantly improve ILM process and save data center storage resources. If implemented systemically, it can generate improvements in data center performance and utilization. Data classification can also reduce costs and administration overhead. “Good enough” data classification can produce these results:

Business data classification approaches

There are three different approaches to data classification within a business environment, each of these techniques – paper-based classification, automated classification and user-driven (or user-applied) classification [7] — has its own benefits and pitfalls.

Paper-Based Classification Policy

A corporate data classification policy will set out how employees are required to treat the different types of data they handle, aligned with the organization's overall data security policy and strategy. A well-written policy will enable users to make fast and intuitive decisions about the value of a piece of information, and what the appropriate handling rules are, for example; who can access the data, and should a rights' management template be invoked. The challenge, without any supporting technology, is ensuring that everyone is aware of the policy and implements it correctly.

Automated Classification Policy

This technique bypasses the users’ involvement, enforcing a classification policy to be consistently applied across all touchpoints, without the need for major communication and education programs.

Classifications are applied by solutions that use software algorithms based on keywords or phrases in the content to analyze and classify it. This approach comes into its own where certain types of data are created with no user involvement – for example, reports generated by ERP systems, or where the data includes specific personal information which is easily identified, such as credit card details.

However, automated solutions do not understand context and are therefore susceptible to inaccuracies, giving false positive results that can frustrate users and impede business processes, as well as false negative errors that expose organizations to sensitive data loss.

User-Driven Classification Policy

The data classification process can be completely automated, but it is most effective when the user is placed in the driving seat.

The user-driven classification technique makes employees themselves responsible for deciding which label is appropriate, and attaching it using a software tool at the point of creating, editing, sending, or saving. The advantage of involving the user in the process is that their insight into the context, business value and sensitivity of a piece of data enables them to make informed and accurate decisions about which label to apply. User-driven classification is an additional security layer often used to complement automated classification.

Involving users in classification also leads to other organizational benefits including increased security awareness, an improved culture and the ability to monitor user behavior, which aids reporting and provides the ability to demonstrate compliance. Furthermore, managers can use this behavioral data to identify a possible insider threat, and address any concerns by providing additional guidance to users as appropriate, for example through additional training or by tightening up policy.

See also

Related Research Articles

<span class="mw-page-title-main">Database</span> Organized collection of data in computing

In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and analyze the data. The DBMS additionally encompasses the core facilities provided to administer the database. The sum total of the database, the DBMS and the associated applications can be referred to as a database system. Often the term "database" is also used loosely to refer to any of the DBMS, the database system or an application associated with the database.

A relational database (RDB) is a database based on the relational model of data, as proposed by E. F. Codd in 1970. A database management system used to maintain relational databases is a relational database management system (RDBMS). Many relational database systems are equipped with the option of using SQL for querying and updating the database.

In telecommunication, provisioning involves the process of preparing and equipping a network to allow it to provide new services to its users. In National Security/Emergency Preparedness telecommunications services, "provisioning" equates to "initiation" and includes altering the state of an existing priority service or capability.

Data engineering refers to the building of systems to enable the collection and usage of data. This data is usually used to enable subsequent analysis and data science; which often involves machine learning. Making the data usable usually involves substantial compute and storage, as well as data processing.

<span class="mw-page-title-main">Data modeling</span> Creating a model of the data in a system

Data modeling in software engineering is the process of creating a data model for an information system by applying certain formal techniques. It may be applied as part of broader Model-driven engineering (MDE) concept.

Identity management (IdM), also known as identity and access management, is a framework of policies and technologies to ensure that the right users have the appropriate access to technology resources. IdM systems fall under the overarching umbrellas of IT security and data management. Identity and access management systems not only identify, authenticate, and control access for individuals who will be utilizing IT resources but also the hardware and applications employees need to access.

Records management, also known as records and information management, is an organizational function devoted to the management of information in an organization throughout its life cycle, from the time of creation or receipt to its eventual disposition. This includes identifying, classifying, storing, securing, retrieving, tracking and destroying or permanently preserving records. The ISO 15489-1: 2001 standard defines records management as "[the] field of management responsible for the efficient and systematic control of the creation, receipt, maintenance, use and disposition of records, including the processes for capturing and maintaining evidence of and information about business activities and transactions in the form of records".

Database security concerns the use of a broad range of information security controls to protect databases against compromises of their confidentiality, integrity and availability. It involves various types or categories of controls, such as technical, procedural or administrative, and physical.

Database administration is the function of managing and maintaining database management systems (DBMS) software. Mainstream DBMS software such as Oracle, IBM Db2 and Microsoft SQL Server need ongoing management. As such, corporations that use DBMS software often hire specialized information technology personnel called database administrators or DBAs.

Information technology risk, IT risk, IT-related risk, or cyber risk is any risk relating to information technology. While information has long been appreciated as a valuable and important asset, the rise of the knowledge economy and the Digital Revolution has led to organizations becoming increasingly dependent on information, information processing and especially IT. Various events or incidents that compromise IT in some way can therefore cause adverse impacts on the organization's business processes or mission, ranging from inconsequential to catastrophic in scale.

Documentum is an enterprise content management platform developed by OpenText. EMC acquired Documentum for US$1.7 billion in December 2003. The Documentum platform was part of EMC's Enterprise Content Division (ECD) business unit, one of EMC's four operating divisions.

Identity correlation is, in information systems, a process that reconciles and validates the proper ownership of disparate user account login IDs that reside on systems and applications throughout an organization and can permanently link ownership of those user account login IDs to particular individuals by assigning a unique identifier to all validated account login IDs.

Business process management (BPM) is the discipline in which people use various methods to discover, model, analyze, measure, improve, optimize, and automate business processes. Any combination of methods used to manage a company's business processes is BPM. Processes can be structured and repeatable or unstructured and variable. Though not required, enabling technologies are often used with BPM.

NoSQL is an approach to database design that focuses on providing a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Instead of the typical tabular structure of a relational database, NoSQL databases house data within one data structure. Since this non-relational database design does not require a schema, it offers rapid scalability to manage large and typically unstructured data sets. NoSQL systems are also sometimes called "Not only SQL" to emphasize that they may support SQL-like query languages or sit alongside SQL databases in polyglot-persistent architectures.

Cloud computing security or, more simply, cloud security, refers to a broad set of policies, technologies, applications, and controls utilized to protect virtualized IP, data, applications, services, and the associated infrastructure of cloud computing. It is a sub-domain of computer security, network security, and, more broadly, information security.

<span class="mw-page-title-main">Digital mailroom</span> Automation of incoming mail processes

Digital mailroom is the automation of incoming mail processes. Using document scanning and document capture technologies, companies can digitise incoming mail and automate the classification and distribution of mail within the organization. Both paper and electronic mail (email) can be managed through the same process allowing companies to standardize their internal mail distribution procedures and adhere to company compliance policies.

Security information and event management (SIEM) is a field within the field of computer security, where software products and services combine security information management (SIM) and security event management (SEM). SIEM is the core component of any typical Security Operations Center (SOC), which is the centralized response team addressing security issues within an organization.

Privacy engineering is an emerging field of engineering which aims to provide methodologies, tools, and techniques to ensure systems provide acceptable levels of privacy. Its focus lies in organizing and assessing methods to identify and tackle privacy concerns within the engineering of information systems.

System and Organization Controls as defined by the American Institute of Certified Public Accountants (AICPA), is the name of a suite of reports produced during an audit. It is intended for use by service organizations to issue validated reports of internal controls over those information systems to the users of those services. The reports focus on controls grouped into five categories called Trust Service Criteria. The Trust Services Criteria were established by The AICPA through its Assurance Services Executive Committee (ASEC) in 2017. These control criteria are to be used by the practitioner/examiner in attestation or consulting engagements to evaluate and report on controls of information systems offered as a service. The engagements can be done on an entity wide, subsidiary, division, operating unit, product line or functional area basis. The Trust Services Criteria were modeled in conformity to The Committee of Sponsoring Organizations of the Treadway Commission (COSO) Internal Control - Integrated Framework. In addition, the Trust Services Criteria can be mapped to NIST SP 800 - 53 criteria and to EU General Data Protection Regulation (GDPR) Articles. The AICPA auditing standard Statement on Standards for Attestation Engagements no. 18, section 320, "Reporting on an Examination of Controls at a Service Organization Relevant to User Entities' Internal Control Over Financial Reporting", defines two levels of reporting, type 1 and type 2. Additional AICPA guidance materials specify three types of reporting: SOC 1, SOC 2, and SOC 3.

Data sanitization involves the secure and permanent erasure of sensitive data from datasets and media to guarantee that no residual data can be recovered even through extensive forensic analysis. Data sanitization has a wide range of applications but is mainly used for clearing out end-of-life electronic devices or for the sharing and use of large datasets that contain sensitive information. The main strategies for erasing personal data from devices are physical destruction, cryptographic erasure, and data erasure. While the term data sanitization may lead some to believe that it only includes data on electronic media, the term also broadly covers physical media, such as paper copies. These data types are termed soft for electronic files and hard for physical media paper copies. Data sanitization methods are also applied for the cleaning of sensitive data, such as through heuristic-based methods, machine-learning based methods, and k-source anonymity.

References

  1. Grover, Purva; Kar, Arpan Kumar (2017-06-13). "Big Data Analytics: A Review on Theoretical Contributions and Tools Used in Literature". Global Journal of Flexible Systems Management. 18 (3): 203–229. doi:10.1007/s40171-017-0159-3. ISSN   0972-2696.
  2. Knight, Michelle (2021-08-26). "What Are Data Regulations?". DATAVERSITY. Retrieved 2022-10-26.
  3. Bar-Sinai, Michael; Sweeney, Latanya; Crosas, Merce (May 2016). "DataTags, Data Handling Policy Spaces and the Tags Language". 2016 IEEE Security and Privacy Workshops (SPW). IEEE. pp. 1–8. doi:10.1109/spw.2016.11. ISBN   978-1-5090-3690-5.
  4. 1 2 "What is Data Classification? | Best Practices & Data Types | Imperva". Learning Center. Retrieved 2024-02-03.
  5. "Get the scoop on data classification and GDPR before you're too late - LightsOnData". LightsOnData. 2018-05-23. Retrieved 2018-05-23.
  6. Khatibloo, Fatemeh (May 2017). "How Dirty Is Your Data? Strategic Plan: The Customer Trust And Privacy Playbook". The Customer Trust and Privacy Playbook for 2018.
  7. "What Is Data Classification And What Can It Do For My Business? | Boldon James". www.boldonjames.com. Retrieved 2019-03-05.