Profiling (information science)

Last updated

In information science, profiling refers to the process of construction and application of user profiles generated by computerized data analysis.

Contents

This is the use of algorithms or other mathematical techniques that allow the discovery of patterns or correlations in large quantities of data, aggregated in databases. When these patterns or correlations are used to identify or represent people, they can be called profiles. Other than a discussion of profiling technologies or population profiling, the notion of profiling in this sense is not just about the construction of profiles, but also concerns the application of group profiles to individuals, e. g., in the cases of credit scoring, price discrimination, or identification of security risks ( Hildebrandt & Gutwirth 2008 )( Elmer 2004 ).

Profiling is being used in fraud prevention, ambient intelligence, and consumer analytics. Statistical methods of profiling include Knowledge Discovery in Databases (KDD).

The profiling process

The technical process of profiling can be separated in several steps:

Data collection, preparation and mining all belong to the phase in which the profile is under construction. However, profiling also refers to the application of profiles, meaning the usage of profiles for the identification or categorization of groups or individual persons. As can be seen in step six (application), the process is circular. There is a feedback loop between the construction and the application of profiles. The interpretation of profiles can lead to the reiterant – possibly real-time – fine-tuning of specific previous steps in the profiling process. The application of profiles to people whose data were not used to construct the profile is based on data matching, which provides new data that allows for further adjustments. The process of profiling is both dynamic and adaptive. A good illustration of the dynamic and adaptive nature of profiling is the Cross-Industry Standard Process for Data Mining (CRISP-DM).

Types of profiling practices

In order to clarify the nature of profiling technologies, some crucial distinctions have to be made between different types of profiling practices, apart from the distinction between the construction and the application of profiles. The main distinctions are those between bottom-up and top-down profiling (or supervised and unsupervised learning), and between individual and group profiles.

Supervised and unsupervised learning

Profiles can be classified according to the way they have been generated ( Fayyad, Piatetsky-Shapiro & Smyth 1996 )( Zarsky & 2002-3 ). On the one hand, profiles can be generated by testing a hypothesized correlation. This is called top-down profiling or supervised learning. This is similar to the methodology of traditional scientific research in that it starts with a hypothesis and consists of testing its validity. The result of this type of profiling is the verification or refutation of the hypothesis. One could also speak of deductive profiling. On the other hand, profiles can be generated by exploring a data base, using the data mining process to detect patterns in the data base that were not previously hypothesized. In a way, this is a matter of generating hypothesis: finding correlations one did not expect or even think of. Once the patterns have been mined, they will enter the loop – described above – and will be tested with the use of new data. This is called unsupervised learning.

Two things are important with regard to this distinction. First, unsupervised learning algorithms seem to allow the construction of a new type of knowledge, not based on hypothesis developed by a researcher and not based on causal or motivational relations but exclusively based on stochastical correlations. Second, unsupervised learning algorithms thus seem to allow for an inductive type of knowledge construction that does not require theoretical justification or causal explanation ( Custers 2004 ).

Some authors claim that if the application of profiles based on computerized stochastical pattern recognition 'works', i.e. allows for reliable predictions of future behaviours, the theoretical or causal explanation of these patterns does not matter anymore ( Anderson 2008 ). However, the idea that 'blind' algorithms provide reliable information does not imply that the information is neutral. In the process of collecting and aggregating data into a database (the first three steps of the process of profile construction), translations are made from real-life events to machine-readable data. These data are then prepared and cleansed to allow for initial computability. Potential bias will have to be located at these points, as well as in the choice of algorithms that are developed. It is not possible to mine a database for all possible linear and non-linear correlations, meaning that the mathematical techniques developed to search for patterns will be determinate of the patterns that can be found. In the case of machine profiling, potential bias is not informed by common sense prejudice or what psychologists call stereotyping, but by the computer techniques employed in the initial steps of the process. These techniques are mostly invisible for those to whom profiles are applied (because their data match the relevant group profiles).

Individual and group profiles

Profiles must also be classified according to the kind of subject they refer to. This subject can either be an individual or a group of people. When a profile is constructed with the data of a single person, this is called individual profiling ( Jaquet-Chiffelle 2008 ). This kind of profiling is used to discover the particular characteristics of a certain individual, to enable unique identification or the provision of personalized services. However, personalized servicing is most often also based on group profiling, which allows categorisation of a person as a certain type of person, based on the fact that her profile matches with a profile that has been constructed on the basis of massive amounts of data about massive numbers of other people. A group profile can refer to the result of data mining in data sets that refer to an existing community that considers itself as such, like a religious group, a tennis club, a university, a political party etc. In that case it can describe previously unknown patterns of behaviour or other characteristics of such a group (community). A group profile can also refer to a category of people that do not form a community, but are found to share previously unknown patterns of behaviour or other characteristics ( Custers 2004 ). In that case the group profile describes specific behaviours or other characteristics of a category of people, like for instance women with blue eyes and red hair, or adults with relatively short arms and legs. These categories may be found to correlate with health risks, earning capacity, mortality rates, credit risks, etc.

If an individual profile is applied to the individual that it was mined from, then that is direct individual profiling. If a group profile is applied to an individual whose data match the profile, then that is indirect individual profiling, because the profile was generated using data of other people. Similarly, if a group profile is applied to the group that it was mined from, then that is direct group profiling ( Jaquet-Chiffelle 2008 ). However, in as far as the application of a group profile to a group implies the application of the group profile to individual members of the group, it makes sense to speak of indirect group profiling, especially if the group profile is non-distributive.

Distributive and non-distributive profiling

Group profiles can also be divided in terms of their distributive character ( Vedder 1999 ). A group profile is distributive when its properties apply equally to all the members of its group: all bachelors are unmarried, or all persons with a specific gene have 80% chance to contract a specific disease. A profile is non-distributive when the profile does not necessarily apply to all the members of the group: the group of persons with a specific postal code have an average earning capacity of XX, or the category of persons with blue eyes has an average chance of 37% to contract a specific disease. Note that in this case the chance of an individual to have a particular earning capacity or to contract the specific disease will depend on other factors, e.g. sex, age, background of parents, previous health, education. It should be obvious that, apart from tautological profiles like that of bachelors, most group profiles generated by means of computer techniques are non-distributive. This has far-reaching implications for the accuracy of indirect individual profiling based on data matching with non-distributive group profiles. Quite apart from the fact that the application of accurate profiles may be unfair or cause undue stigmatisation, most group profiles will not be accurate.

Applications

In the financial sector, institutions use profiling technologies for fraud prevention and credit scoring. Banks want to minimize the risks in giving credit to their customers. On the basis of the extensive group, profiling customers are assigned a certain scoring value that indicates their creditworthiness. Financial institutions like banks and insurance companies also use group profiling to detect fraud or money-laundering. Databases with transactions are searched with algorithms to find behaviors that deviate from the standard, indicating potentially suspicious transactions. [1]

In the context of employment, profiles can be of use for tracking employees by monitoring their online behavior, for the detection of fraud by them, and for the deployment of human resources by pooling and ranking their skills. ( Leopold & Meints 2008 ) [2]

Profiling can also be used to support people at work, and also for learning, by intervening in the design of adaptive hypermedia systems personalizing the interaction. For instance, this can be useful for supporting the management of attention ( Nabeth 2008 ).

In forensic science, the possibility exists of linking different databases of cases and suspects and mining these for common patterns. This could be used for solving existing cases or for the purpose of establishing risk profiles of potential suspects ( Geradts & Sommer 2008 )( Harcourt 2006 ).

Consumer profiling

Consumer profiling is a form of customer analytics, where customer data is used to make decisions on product promotion, the pricing of products, as well as personalized advertising. [3] When the aim is to find the most profitable customer segment, consumer analytics draws on demographic data, data on consumer behavior, data on the products purchased, payment method, and surveys to establish consumer profiles. To establish predictive models on the basis of existing databases, the Knowledge Discovery in Databases (KDD) statistical method is used. KDD groups similar customer data to predict future consumer behavior. Other methods of predicting consumer behaviour are correlation and pattern recognition. Consumer profiles describe customers based on a set of attributes [4] and typically consumers are grouped according to income, living standard, age and location. Consumer profiles may also include behavioural attributes that assess a customer's motivation in the buyer decision process. Well known examples of consumer profiles are Experian's Mosaic geodemographic classification of households, CACI's Acorn, and Acxiom's Personicx. [5]

Ambient intelligence

In a built environment with ambient intelligence everyday objects have built-in sensors and embedded systems that allow objects to recognise and respond to the presence and needs of individuals. Ambient intelligence relies on automated profiling and human–computer interaction designs. [6] Sensors monitor an individual's action and behaviours, therefore generating, collecting, analysing, processing and storing personal data. Early examples of consumer electronics with ambient intelligence include mobile apps, augmented reality and location-based service. [7]

Risks and issues

Profiling technologies have raised a host of ethical, legal and other issues including privacy, equality, due process, security and liability. Numerous authors have warned against the affordances of a new technological infrastructure that could emerge on the basis of semi-autonomic profiling technologies ( Lessig 2006 )( Solove 2004 )( Schwartz 2000 ).

Privacy is one of the principal issues raised. Profiling technologies make possible a far-reaching monitoring of an individual's behaviour and preferences. Profiles may reveal personal or private information about individuals that they might not even be aware of themselves ( Hildebrandt & Gutwirth 2008 ).

Profiling technologies are by their very nature discriminatory tools. They allow unparalleled kinds of social sorting and segmentation which could have unfair effects. The people that are profiled may have to pay higher prices, [8] they could miss out on important offers or opportunities, and they may run increased risks because catering to their needs is less profitable ( Lyon 2003 ). In most cases they will not be aware of this, since profiling practices are mostly invisible and the profiles themselves are often protected by intellectual property or trade secret. This poses a threat to the equality of and solidarity of citizens. On a larger scale, it might cause the segmentation of society. [9]

One of the problems underlying potential violations of privacy and non-discrimination is that the process of profiling is more often than not invisible for those that are being profiled. This creates difficulties in that it becomes hard, if not impossible, to contest the application of a particular group profile. This disturbs principles of due process: if a person has no access to information on the basis of which they are withheld benefits or attributed certain risks, they cannot contest the way they are being treated ( Steinbock 2005 ).

Profiles can be used against people when they end up in the hands of people who are not entitled to access or use the information. An important issue related to these breaches of security is identity theft.

When the application of profiles causes harm, the liability for this harm has to be determined who is to be held accountable. Is the software programmer, the profiling service provider, or the profiled user to be held accountable? This issue of liability is especially complex in the case the application and decisions on profiles have also become automated like in Autonomic Computing or ambient intelligence decisions of automated decisions based on profiling.

See also

Related Research Articles

<span class="mw-page-title-main">Data mining</span> Process of extracting and discovering patterns in large data sets

Data mining is the process of extracting and discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and statistics with an overall goal of extracting information from a data set and transforming the information into a comprehensible structure for further use. Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD. Aside from the raw analysis step, it also involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.

<span class="mw-page-title-main">Association rule learning</span> Method for discovering interesting relations between variables in databases

Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness. In any given transaction with a variety of items, association rules are meant to discover the rules that determine how or why certain items are connected.

In marketing, geodemographic segmentation is a multivariate statistical classification technique for discovering whether the individuals of a population fall into different groups by making quantitative comparisons of multiple characteristics with the assumption that the differences within any group should be less than the differences between groups.

Sequential pattern mining is a topic of data mining concerned with finding statistically relevant patterns between data examples where the values are delivered in a sequence. It is usually presumed that the values are discrete, and thus time series mining is closely related, but usually considered a different activity. Sequential pattern mining is a special case of structured data mining.

Sociomapping is a method developed for processing and visualization of relational data. It is most commonly used for mapping the social structure within small teams. Sociomapping uses the landscape metaphor to display complex multi-dimensional data in a 3D map, where individual objects are localized in such way that their distance on the map corresponds to their distance in the underlying data.

Customer analytics is a process by which data from customer behavior is used to help make key business decisions via market segmentation and predictive analytics. This information is used by businesses for direct marketing, site selection, and customer relationship management. Marketing provides services in order to satisfy customers. With that in mind, the productive system is considered from its beginning at the production level, to the end of the cycle at the consumer. Customer analytics plays an important role in the prediction of customer behavior.

A click path or clickstream is the sequence of hyperlinks one or more website visitors follows on a given site, presented in the order viewed. A visitor's click path may start within the website or at a separate third party website, often a search engine results page, and it continues as a sequence of successive webpages visited by the user. Click paths take call data and can match it to ad sources, keywords, and/or referring domains, in order to capture data.

Privacy-enhancing technologies (PET) are technologies that embody fundamental data protection principles by minimizing personal data use, maximizing data security, and empowering individuals. PETs allow online users to protect the privacy of their personally identifiable information (PII), which is often provided to and handled by services or applications. PETs use techniques to minimize an information system's possession of personal data without losing functionality. Generally speaking, PETs can be categorized as hard and soft privacy technologies.

<span class="mw-page-title-main">Targeted advertising</span> Form of advertising

Targeted advertising is a form of advertising, including online advertising, that is directed towards an audience with certain traits, based on the product or person the advertiser is promoting. These traits can either be demographic with a focus on race, economic status, sex, age, generation, level of education, income level, and employment, or psychographic focused on the consumer values, personality, attitude, opinion, lifestyle and interest. This focus can also entail behavioral variables, such as browser history, purchase history, and other recent online activities. The process of algorithm targeting eliminates waste.

<span class="mw-page-title-main">Forensic profiling</span> Study of trace evidence in criminal investigations

Forensic profiling is the study of trace evidence in order to develop information which can be used by police authorities. This information can be used to identify suspects and convict them in a court of law.

Fraud represents a significant problem for governments and businesses and specialized analysis techniques for discovering fraud using them are required. Some of these methods include knowledge discovery in databases (KDD), data mining, machine learning and statistics. They offer applicable and successful solutions in different areas of electronic fraud crimes.

The social data revolution is the shift in human communication patterns towards increased personal information sharing and its related implications, made possible by the rise of social networks in the early 2000s. This phenomenon has resulted in the accumulation of unprecedented amounts of public data.

Educational data mining (EDM) is a research field concerned with the application of data mining, machine learning and statistics to information generated from educational settings. At a high level, the field seeks to develop and improve methods for exploring this data, which often has multiple levels of meaningful hierarchy, in order to discover new insights about how people learn in the context of such settings. In doing so, EDM has contributed to theories of learning investigated by researchers in educational psychology and the learning sciences. The field is closely tied to that of learning analytics, and the two have been compared and contrasted.

The fields of marketing and artificial intelligence converge in systems which assist in areas such as market forecasting, and automation of processes and decision making, along with increased efficiency of tasks which would usually be performed by humans. The science behind these systems can be explained through neural networks and expert systems, computer programs that process input and provide valuable output for marketers.

Social media mining is the process of obtaining big data from user-generated content on social media sites and mobile apps in order to extract actionable patterns, form conclusions about users, and act upon the information, often for the purpose of advertising to users or conducting research. The term is an analogy to the resource extraction process of mining for rare minerals. Resource extraction mining requires mining companies to shift through vast quantities of raw ore to find the precious minerals; likewise, social media mining requires human data analysts and automated software programs to shift through massive amounts of raw social media data in order to discern patterns and trends relating to social media usage, online behaviours, sharing of content, connections between individuals, online buying behaviour, and more. These patterns and trends are of interest to companies, governments and not-for-profit organizations, as these organizations can use these patterns and trends to design their strategies or introduce new programs, new products, processes or services.

Data mining, the process of discovering patterns in large data sets, has been used in many applications.

Cross-device tracking refers to technology that enables the tracking of users across multiple devices such as smartphones, television sets, smart TVs, and personal computers.

Data re-identification or de-anonymization is the practice of matching anonymous data with publicly available information, or auxiliary data, in order to discover the person the data belong to. This is a concern because companies with privacy policies, health care providers, and financial institutions may release the data they collect after the data has gone through the de-identification process.

Click tracking is when user click behavior or user navigational behavior is collected in order to derive insights and fingerprint users. Click behavior is commonly tracked using server logs which encompass click paths and clicked URLs. This log is often presented in a standard format including information like the hostname, date, and username. However, as technology develops, new software allows for in depth analysis of user click behavior using hypervideo tools. Given that the internet can be considered a risky environment, research strives to understand why users click certain links and not others. Research has also been conducted to explore the user experience of privacy with making user personal identification information individually anonymized and improving how data collection consent forms are written and structured.

<span class="mw-page-title-main">Hancock (programming language)</span> Programming language intended for data mining

Hancock is a C-based programming language, first developed by researchers at AT&T Labs in 1998, to analyze data streams. The language was intended by its creators to improve the efficiency and scale of data mining. Hancock works by creating profiles of individuals, utilizing data to provide behavioral and social network information.

References

Notes and other references

  1. Canhoto, A.I. (2007). "Profiling behaviour: the social construction of categories in the detection of financial crime, dissertation at London School of Economics" (PDF). lse.ac.uk.
  2. Electronic Privacy Information Center. "EPIC - Workplace Privacy". epic.org.{{cite web}}: |author= has generic name (help)
  3. Reyes, Matthew (2020). Consumer Behavior and Marketing. IntechOpen. p. 10. ISBN   9781789238556.
  4. Reyes, Matthew (2020). Consumer Behavior and Marketing. IntechOpen. p. 11. ISBN   9781789238556.
  5. Reyes, Matthew (2020). Consumer Behavior and Marketing. IntechOpen. p. 12. ISBN   9781789238556.
  6. De Hert, Paul; Leenes, Ronald; Gutwirth, Serge; Poullet, Yves (2011). Computers, Privacy and Data Protection: an Element of Choice. Springer Netherlands. p. 80. ISBN   9789400706415.
  7. De Hert, Paul; Leenes, Ronald; Gutwirth, Serge; Poullet, Yves (2011). Computers, Privacy and Data Protection: an Element of Choice. Springer Netherlands. p. 80. ISBN   9789400706415.
  8. Odlyzko, A. (2003). "Privacy, economics, and price discrimination on the Internet, A. M. Odlyzko. ICEC2003: Fifth International Conference on Electronic Commerce, N. Sadeh, ed., ACM, pp. 355–366" (PDF).
  9. Gandy, O. (2002). "Data Mining and Surveillance in the post 9/11 environment, Presentation at IAMCR, Barcelona" (PDF). asc.upenn.edu.