Link analysis

Last updated

In network theory, link analysis is a data-analysis technique used to evaluate relationships (Tap link[ clarification needed ]) between nodes. Relationships may be identified among various types of nodes (100k[ clarification needed ]), including organizations, people and transactions. Link analysis has been used for investigation of criminal activity (fraud, counterterrorism, and intelligence), computer security analysis, search engine optimization, market research, medical research, and art.

Contents

Knowledge discovery

Knowledge discovery is an iterative and interactive process used to identify, analyze and visualize patterns in data. [1] Network analysis, link analysis and social network analysis are all methods of knowledge discovery, each a corresponding subset of the prior method. Most knowledge discovery methods follow these steps (at the highest level): [2]

  1. Data processing
  2. Transformation
  3. Analysis
  4. Visualization

Data gathering and processing requires access to data and has several inherent issues, including information overload and data errors. Once data is collected, it will need to be transformed into a format that can be effectively used by both human and computer analyzers. Manual or computer-generated visualizations tools may be mapped from the data, including network charts. Several algorithms exist to help with analysis of data – Dijkstra's algorithm, breadth-first search, and depth-first search.

Link analysis focuses on analysis of relationships among nodes through visualization methods (network charts, association matrix). Here is an example of the relationships that may be mapped for crime investigations: [3]

Relationship/NetworkData Sources
1. TrustPrior contacts in family, neighborhood, school, military, club or organization. Public and court records. Data may only be available in suspect's native country.
2. TaskLogs and records of phone calls, electronic mail, chat rooms, instant messages, Web site visits. Travel records. Human intelligence: observation of meetings and attendance at common events.
3. Money & ResourcesBank account and money transfer records. Pattern and location of credit card use. Prior court records. Human intelligence: observation of visits to alternate banking resources such as Hawala.
4. Strategy & GoalsWeb sites. Videos and encrypted disks delivered by courier. Travel records. Human intelligence: observation of meetings and attendance at common events.

Link analysis is used for 3 primary purposes: [4]

  1. Find matches in data for known patterns of interest;
  2. Find anomalies where known patterns are violated;
  3. Discover new patterns of interest (social network analysis, data mining).

History

Klerks categorized link analysis tools into 3 generations. [5] The first generation was introduced in 1975 as the Anacpapa Chart of Harper and Harris. [6] This method requires that a domain expert review data files, identify associations by constructing an association matrix, create a link chart for visualization and finally analyze the network chart to identify patterns of interest. This method requires extensive domain knowledge and is extremely time-consuming when reviewing vast amounts of data.

Association Matrix Association Matrix.png
Association Matrix

In addition to the association matrix, the activities matrix can be used to produce actionable information, which has practical value and use to law-enforcement. The activities matrix, as the term might imply, centers on the actions and activities of people with respect to locations. Whereas the association matrix focuses on the relationships between people, organizations, and/or properties. The distinction between these two types of matrices, while minor, is nonetheless significant in terms of the output of the analysis completed or rendered. [7] [8] [9] [10]

Second generation tools consist of automatic graphics-based analysis tools such as IBM i2 Analyst's Notebook, Netmap, ClueMaker and Watson. These tools offer the ability to automate the construction and updates of the link chart once an association matrix is manually created, however, analysis of the resulting charts and graphs still requires an expert with extensive domain knowledge.

The third generation of link-analysis tools like DataWalk allow the automatic visualization of linkages between elements in a data set, that can then serve as the canvas for further exploration or manual updates.

Applications

Information overload

With the vast amounts of data and information that are stored electronically, users are confronted with multiple unrelated sources of information available for analysis. Data analysis techniques are required to make effective and efficient use of the data. Palshikar classifies data analysis techniques into two categories – (statistical models, time-series analysis, clustering and classification, matching algorithms to detect anomalies) and artificial intelligence (AI) techniques (data mining, expert systems, pattern recognition, machine learning techniques, neural networks). [14]

Bolton & Hand define statistical data analysis as either supervised or unsupervised methods. [15] Supervised learning methods require that rules are defined within the system to establish what is expected or unexpected behavior. Unsupervised learning methods review data in comparison to the norm and detect statistical outliers. Supervised learning methods are limited in the scenarios that can be handled as this method requires that training rules are established based on previous patterns. Unsupervised learning methods can provide detection of broader issues, however, may result in a higher false-positive ratio if the behavioral norm is not well established or understood.

Data itself has inherent issues including integrity (or lack of) and continuous changes. Data may contain "errors of omission and commission because of faulty collection or handling, and when entities are actively attempting to deceive and/or conceal their actions". [4] Sparrow [16] highlights incompleteness (inevitability of missing data or links), fuzzy boundaries (subjectivity in deciding what to include) and dynamic changes (recognition that data is ever-changing) as the three primary problems with data analysis. [3]

Once data is transformed into a usable format, open texture and cross referencing issues may arise. Open texture was defined by Waismann as the unavoidable uncertainty in meaning when empirical terms are used in different contexts. [17] Uncertainty in meaning of terms presents problems when attempting to search and cross reference data from multiple sources. [18]

The primary method for resolving data analysis issues is reliance on domain knowledge from an expert. This is a very time-consuming and costly method of conducting link analysis and has inherent problems of its own. McGrath et al. conclude that the layout and presentation of a network diagram have a significant impact on the user's "perceptions of the existence of groups in networks". [19] Even using domain experts may result in differing conclusions as analysis may be subjective.

Prosecution vs. crime prevention

Link analysis techniques have primarily been used for prosecution, as it is far easier to review historical data for patterns than it is to attempt to predict future actions.

Krebs demonstrated the use of an association matrix and link chart of the terrorist network associated with the 19 hijackers responsible for the September 11th attacks by mapping publicly available details made available following the attacks. [3] Even with the advantages of hindsight and publicly available information on people, places and transactions, it is clear that there is missing data.

Alternatively, Picarelli argued that use of link analysis techniques could have been used to identify and potentially prevent illicit activities within the Aum Shinrikyo network. [20] "We must be careful of 'guilt by association'. Being linked to a terrorist does not prove guilt – but it does invite investigation." [3] Balancing the legal concepts of probable cause, right to privacy and freedom of association become challenging when reviewing potentially sensitive data with the objective to prevent crime or illegal activity that has not yet occurred.

Proposed solutions

There are four categories of proposed link analysis solutions: [21]

  1. Heuristic-based
  2. Template-based
  3. Similarity-based
  4. Statistical

Heuristic-based tools utilize decision rules that are distilled from expert knowledge using structured data. Template-based tools employ Natural Language Processing (NLP) to extract details from unstructured data that are matched to pre-defined templates. Similarity-based approaches use weighted scoring to compare attributes and identify potential links. Statistical approaches identify potential links based on lexical statistics.

CrimeNet explorer

J.J. Xu and H. Chen propose a framework for automated network analysis and visualization called CrimeNet Explorer. [22] This framework includes the following elements:

Related Research Articles

Data mining is the process of extracting and discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and statistics with an overall goal of extracting information from a data set and transforming the information into a comprehensible structure for further use. Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD. Aside from the raw analysis step, it also involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.

Pattern recognition is the task of assigning a class to an observation based on patterns extracted from data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess (PR) capabilities but their primary function is to distinguish and create emergent patterns. PR has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning. Pattern recognition has its origins in statistics and engineering; some modern approaches to pattern recognition include the use of machine learning, due to the increased availability of big data and a new abundance of processing power.

Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data and thus perform tasks without explicit instructions. Recently, artificial neural networks have been able to surpass many previous approaches in performance.

Unsupervised learning is a framework in machine learning where, in contrast to supervised learning, algorithms learn patterns exclusively from unlabeled data. Other frameworks in the spectrum of supervisions include weak- or semi-supervision, where a small portion of the data is tagged, and self-supervision. Some researchers consider self-supervised learning a form of unsupervised learning.

Text mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources." Written resources may include websites, books, emails, reviews, and articles. High-quality information is typically obtained by devising patterns and trends by means such as statistical pattern learning. According to Hotho et al. (2005) we can distinguish between three different perspectives of text mining: information extraction, data mining, and a knowledge discovery in databases (KDD) process. Text mining usually involves the process of structuring the input text, deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interest. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling.

<span class="mw-page-title-main">Social network analysis</span> Analysis of social structures using network and graph theory

Social network analysis (SNA) is the process of investigating social structures through the use of networks and graph theory. It characterizes networked structures in terms of nodes and the ties, edges, or links that connect them. Examples of social structures commonly visualized through social network analysis include social media networks, meme spread, information circulation, friendship and acquaintance networks, peer learner networks, business networks, knowledge networks, difficult working relationships, collaboration graphs, kinship, disease transmission, and sexual relationships. These networks are often visualized through sociograms in which nodes are represented as points and ties are represented as lines. These visualizations provide a means of qualitatively assessing networks by varying the visual representation of their nodes and edges to reflect attributes of interest.

<span class="mw-page-title-main">Graph drawing</span> Visualization of node-link graphs

Graph drawing is an area of mathematics and computer science combining methods from geometric graph theory and information visualization to derive two-dimensional depictions of graphs arising from applications such as social network analysis, cartography, linguistics, and bioinformatics.

<span class="mw-page-title-main">Orange (software)</span> Open-source data analysis software

Orange is an open-source data visualization, machine learning and data mining toolkit. It features a visual programming front-end for exploratory qualitative data analysis and interactive data visualization.

<span class="mw-page-title-main">Crime analysis</span>

Crime analysis is a law enforcement function that involves systematic analysis for identifying and analyzing patterns and trends in crime and disorder. Information on patterns can help law enforcement agencies deploy resources in a more effective manner, and assist detectives in identifying and apprehending suspects. Crime analysis also plays a role in devising solutions to crime problems, and formulating crime prevention strategies. Quantitative social science data analysis methods are part of the crime analysis process, though qualitative methods such as examining police report narratives also play a role.

<span class="mw-page-title-main">Weka (software)</span> Suite of machine learning software written in Java

Waikato Environment for Knowledge Analysis (Weka) is a collection of machine learning and data analysis free software licensed under the GNU General Public License. It was developed at the University of Waikato, New Zealand and is the companion software to the book "Data Mining: Practical Machine Learning Tools and Techniques".

In data analysis, anomaly detection is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority of the data and do not conform to a well defined notion of normal behavior. Such examples may arouse suspicions of being generated by a different mechanism, or appear inconsistent with the remainder of that set of data.

Consensus clustering is a method of aggregating results from multiple clustering algorithms. Also called cluster ensembles or aggregation of clustering, it refers to the situation in which a number of different (input) clusterings have been obtained for a particular dataset and it is desired to find a single (consensus) clustering which is a better fit in some sense than the existing clusterings. Consensus clustering is thus the problem of reconciling clustering information about the same data set coming from different sources or from different runs of the same algorithm. When cast as an optimization problem, consensus clustering is known as median partition, and has been shown to be NP-complete, even when the number of input clusterings is three. Consensus clustering for unsupervised learning is analogous to ensemble learning in supervised learning.

Fraud represents a significant problem for governments and businesses and specialized analysis techniques for discovering fraud using them are required. Some of these methods include knowledge discovery in databases (KDD), data mining, machine learning and statistics. They offer applicable and successful solutions in different areas of electronic fraud crimes.

<span class="mw-page-title-main">Elastic map</span>

Elastic maps provide a tool for nonlinear dimensionality reduction. By their construction, they are a system of elastic springs embedded in the data space. This system approximates a low-dimensional manifold. The elastic coefficients of this system allow the switch from completely unstructured k-means clustering to the estimators located closely to linear PCA manifolds. With some intermediate values of the elasticity coefficients, this system effectively approximates non-linear principal manifolds. This approach is based on a mechanical analogy between principal manifolds, that are passing through "the middle" of the data distribution, and elastic membranes and plates. The method was developed by A.N. Gorban, A.Y. Zinovyev and A.A. Pitenko in 1996–1998.

Learning analytics is the measurement, collection, analysis and reporting of data about learners and their contexts, for purposes of understanding and optimizing learning and the environments in which it occurs. The growth of online learning since the 1990s, particularly in higher education, has contributed to the advancement of Learning Analytics as student data can be captured and made available for analysis. When learners use an LMS, social media, or similar online tools, their clicks, navigation patterns, time on task, social networks, information flow, and concept development through discussions can be tracked. The rapid development of massive open online courses (MOOCs) offers additional data for researchers to evaluate teaching and learning in online environments.

<span class="mw-page-title-main">Feature learning</span> Set of learning techniques in machine learning

In machine learning, feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.

oneAPI Data Analytics Library, is a library of optimized algorithmic building blocks for data analysis stages most commonly associated with solving Big Data problems.

Data mining, the process of discovering patterns in large data sets, has been used in many applications.

This glossary of artificial intelligence is a list of definitions of terms and concepts relevant to the study of artificial intelligence, its sub-disciplines, and related fields. Related glossaries include Glossary of computer science, Glossary of robotics, and Glossary of machine vision.

The following outline is provided as an overview of and topical guide to machine learning:

References

  1. Inc., The Tor Project. "Tor Project: Overview". Archived from the original on 2015-06-06. Retrieved 2023-02-04.{{cite web}}: |last= has generic name (help)
  2. Ahonen, H., Features of Knowledge Discovery Systems Archived 2012-12-08 at the Wayback Machine .
  3. 1 2 3 4 Krebs, V. E. 2001, Mapping networks of terrorist cells Archived 2011-07-20 at the Wayback Machine , Connections 24, 43–52.
  4. Klerks, P. (2001). "The network paradigm applied to criminal organizations: Theoretical nitpicking or a relevant doctrine for investigators? Recent developments in the Netherlands". Connections. 24: 53–65. CiteSeerX   10.1.1.129.4720 .
  5. Harper and Harris, The Analysis of Criminal Intelligence, Human Factors and Ergonomics Society Annual Meeting Proceedings, 19(2), 1975, pp. 232-238.
  6. Pike, John. "FMI 3-07.22 Appendix F Intelligence Analysis Tools and Indicators". Archived from the original on 2014-03-08. Retrieved 2014-03-08.
  7. Social Network Analysis and Other Analytical Tools Archived 2014-03-08 at the Wayback Machine
  8. MSFC, Rebecca Whitaker (10 July 2009). "Aeronautics Educator Guide - Activity Matrices". Archived from the original on 17 January 2008.
  9. Personality/Activity Matrix Archived 2014-03-08 at the Wayback Machine
  10. "Homicide Investigation Tracking System (HITS)". Archived from the original on 2010-10-21. Retrieved 2010-10-31.
  11. "New Jersey State Police - Investigations Section". Archived from the original on 2009-03-25. Retrieved 2010-10-31.
  12. "Violent Crime Linkage System (ViCLAS)". Archived from the original on 2010-12-02. Retrieved 2010-10-31.
  13. Palshikar, G. K., The Hidden Truth Archived 2008-05-15 at the Wayback Machine , Intelligent Enterprise, May 2002.
  14. Bolton, R. J. & Hand, D. J., Statistical Fraud Detection: A Review, Statistical Science, 2002, 17(3), pp. 235-255.
  15. Sparrow M.K. 1991. Network Vulnerabilities and Strategic Intelligence in Law Enforcement', International Journal of Intelligence and CounterIntelligence Vol. 5 #3.
  16. Friedrich Waismann, Verifiability (1945), p.2.
  17. Lyons, D., Open Texture and the Possibility of Legal Interpretation (2000).
  18. McGrath, C., Blythe, J., Krackhardt, D., Seeing Groups in Graph Layouts Archived 2013-10-03 at the Wayback Machine .
  19. Picarelli, J. T., Transnational Threat Indications and Warning: The Utility of Network Analysis, Military and Intelligence Analysis Group Archived 2016-03-11 at the Wayback Machine .
  20. Schroeder et al., Automated Criminal Link Analysis Based on Domain Knowledge, Journal of the American Society for Information Science and Technology, 58:6 (842), 2007.
  21. 1 2 3 4 Xu, J.J. & Chen, H., CrimeNet Explorer: A Framework for Criminal Network Knowledge Discovery, ACM Transactions on Information Systems, 23(2), April 2005, pp. 201-226.