Co-occurrence network

Last updated
A co-occurrence network created with KH Coder Khcoder net e.png
A co-occurrence network created with KH Coder

Co-occurrence network, sometimes referred to as a semantic network, [1] is a method to analyze text that includes a graphic visualization of potential relationships between people, organizations, concepts, biological organisms like bacteria [2] or other entities represented within written material. The generation and visualization of co-occurrence networks has become practical with the advent of electronically stored text compliant to text mining.

Contents

By way of definition, co-occurrence networks are the collective interconnection of terms based on their paired presence within a specified unit of text. Networks are generated by connecting pairs of terms using a set of criteria defining co-occurrence. For example, terms A and B may be said to “co-occur” if they both appear in a particular article. Another article may contain terms B and C. Linking A to B and B to C creates a co-occurrence network of these three terms. Rules to define co-occurrence within a text corpus can be set according to desired criteria. For example, a more stringent criteria for co-occurrence may require a pair of terms to appear in the same sentence. Co-occurrence networks were found to be particularly useful to analyze large text and big data, when identifying the main themes and topics (such as in a large number of social media posts), revealing biases in the text (such as biases in news coverage), or even mapping an entire research field. [3]

Methods and development

The process of constructing co-occurrence networks includes identifying keywords in the text, calculating the frequencies of co-occurrences, and analyzing the networks to find central words and clusters of themes in the network. [4]

Word co-occurrence network (range 3 words) for the following sentence: "The dawn is the appearance of light - usually golden, pink or purple - before sunrise" Word co-occurrence network (range 3 words) - ENG.jpg
Word co-occurrence network (range 3 words) for the following sentence: "The dawn is the appearance of light - usually golden, pink or purple - before sunrise"
Co-occurrence network of a bacterial community
in a stream Co-occurrence networks of bacterial communities in a stream.png
Co-occurrence network of a bacterial community
in a stream

Co-occurrence networks can be created for any given list of terms (any dictionary) in relation to any collection of texts (any text corpus). Co-occurring pairs of terms can be called “neighbors” and these often group into “neighborhoods” based on their interconnections. Individual terms may have several neighbors. Neighborhoods may connect to one another through at least one individual term or may remain unconnected.

Individual terms are, within the context of text mining, symbolically represented as text strings. In the real world, the entity identified by a term normally has several symbolic representations. It is therefore useful to consider terms as being represented by one primary symbol and up to several synonymous alternative symbols. Occurrence of an individual term is established by searching for each known symbolic representations of the term. The process can be augmented through NLP (natural language processing) algorithms that interrogate segments of text for possible alternatives such as word order, spacing and hyphenation. NLP can also be used to identify sentence structure and categorize text strings according to grammar (for example, categorizing a string of text as a noun based on a preceding string of text known to be an article).

Graphic representation of co-occurrence networks allow them to be visualized and inferences drawn regarding relationships between entities in the domain represented by the dictionary of terms applied to the text corpus. Meaningful visualization normally requires simplifications of the network. For example, networks may be drawn such that the number of neighbors connecting to each term is limited. The criteria for limiting neighbors might be based on the absolute number of co-occurrences or more subtle criteria such as “probability” of co-occurrence or the presence of an intervening descriptive term.

Quantitative aspects of the underlying structure of a co-occurrence network might also be informative, such as the overall number of connections between entities, clustering of entities representing sub-domains, detecting synonyms, [6] etc.

Applications and use

Some working applications of the co-occurrence approach are available to the public through the internet. PubGene is an example of an application that addresses the interests of biomedical community by presenting networks based on the co-occurrence of genetics related terms as these appear in MEDLINE records. [7] [8] PubGene's CoreMine Medical has been used in studies relating genes/proteins to potentially effective drugs and drug candidates in multiple sclerosis, [9] fibrosis, [10] and hepatitis. [11] CoreMine Medical was also used in a study of genes implicated in post-traumatic stress disorder. [12]

The website NameBase is an example of how human relationships can be inferred by examining networks constructed from the co-occurrence of personal names in newspapers and other texts (as in Ozgur et al. [13] ).

Networks of information are also used to facilitate efforts to organize and focus publicly available information for law enforcement and intelligence purposes (so called "open source intelligence" or OSINT). Related techniques include co-citation networks as well as the analysis of hyperlink and content structure on the internet (such as in the analysis of web sites connected to terrorism [14] ).

See also

Related Research Articles

<span class="mw-page-title-main">Bioinformatics</span> Computational analysis of large, complex sets of biological data

Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The subsequent process of analyzing and interpreting data is referred to as computational biology.

<span class="mw-page-title-main">Semantic network</span> Knowledge base that represents semantic relations between concepts in a network

A semantic network, or frame network is a knowledge base that represents semantic relations between concepts in a network. This is often used as a form of knowledge representation. It is a directed or undirected graph consisting of vertices, which represent concepts, and edges, which represent semantic relations between concepts, mapping or connecting semantic fields. A semantic network may be instantiated as, for example, a graph database or a concept map. Typical standardized semantic networks are expressed as semantic triples.

<span class="mw-page-title-main">DNA microarray</span> Collection of microscopic DNA spots attached to a solid surface

A DNA microarray is a collection of microscopic DNA spots attached to a solid surface. Scientists use DNA microarrays to measure the expression levels of large numbers of genes simultaneously or to genotype multiple regions of a genome. Each DNA spot contains picomoles of a specific DNA sequence, known as probes. These can be a short section of a gene or other DNA element that are used to hybridize a cDNA or cRNA sample under high-stringency conditions. Probe-target hybridization is usually detected and quantified by detection of fluorophore-, silver-, or chemiluminescence-labeled targets to determine relative abundance of nucleic acid sequences in the target. The original nucleic acid arrays were macro arrays approximately 9 cm × 12 cm and the first computerized image based analysis was published in 1981. It was invented by Patrick O. Brown. An example of its application is in SNPs arrays for polymorphisms in cardiovascular diseases, cancer, pathogens and GWAS analysis. It is also used for the identification of structural variations and the measurement of gene expression.

Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per document is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by cosine similarity between any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents.

<span class="mw-page-title-main">Interactome</span> Complete set of molecular interactions in a biological cell

In molecular biology, an interactome is the whole set of molecular interactions in a particular cell. The term specifically refers to physical interactions among molecules but can also describe sets of indirect interactions among genes.

Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature. The term semantic similarity is often confused with semantic relatedness. Semantic relatedness includes any relation between two terms, while semantic similarity only includes "is a" relations. For example, "car" is similar to "bus", but is also related to "road" and "driving".

Biomedical text mining refers to the methods and study of how text mining may be applied to texts and literature of the biomedical domain. As a field of research, biomedical text mining incorporates ideas from natural language processing, bioinformatics, medical informatics and computational linguistics. The strategies in this field have been applied to the biomedical literature available through services such as PubMed.

<span class="mw-page-title-main">Biological network inference</span>

Biological network inference is the process of making inferences and predictions about biological networks. By using these networks to analyze patterns in biological systems, such as food-webs, we can visualize the nature and strength of these interactions between species, DNA, proteins, and more.

The National Centre for Text Mining (NaCTeM) is a publicly funded text mining (TM) centre. It was established to provide support, advice and information on TM technologies and to disseminate information from the larger TM community, while also providing services and tools in response to the requirements of the United Kingdom academic community.

DAVID is a free online bioinformatics resource developed by the Laboratory of Human Retrovirology and Immunoinformatics. All tools in the DAVID Bioinformatics Resources aim to provide functional interpretation of large lists of genes derived from genomic studies, e.g. microarray and proteomics studies. DAVID can be found at https://david.ncifcrf.gov/

<span class="mw-page-title-main">Pan-genome</span> All genes of all strains in a clade

In the fields of molecular biology and genetics, a pan-genome is the entire set of genes from all strains within a clade. More generally, it is the union of all the genomes of a clade. The pan-genome can be broken down into a "core pangenome" that contains genes present in all individuals, a "shell pangenome" that contains genes present in two or more strains, and a "cloud pangenome" that contains genes only found in a single strain. Some authors also refer to the cloud genome as "accessory genome" containing 'dispensable' genes present in a subset of the strains and strain-specific genes. Note that the use of the term 'dispensable' has been questioned, at least in plant genomes, as accessory genes play "an important role in genome evolution and in the complex interplay between the genome and the environment". The field of study of pangenomes is called pangenomics.

<span class="mw-page-title-main">DNA annotation</span> The process of describing the structure and function of a genome

In molecular biology and genetics, DNA annotation or genome annotation is the process of describing the structure and function of the components of a genome, by analyzing and interpreting them in order to extract their biological significance and understand the biological processes in which they participate. Among other things, it identifies the locations of genes and all the coding regions in a genome and determines what those genes do.

<span class="mw-page-title-main">Literature-based discovery</span> Research method using published knowledge as data

Literature-based discovery (LBD), also called literature-related discovery (LRD) is a form of knowledge extraction and automated hypothesis generation that uses papers and other academic publications to find new relationships between existing knowledge. Literature-based discovery aims to discover new knowledge by connecting information which have been explicitly stated in literature to deduce connections which have not been explicitly stated.

<span class="mw-page-title-main">Word embedding</span> Method in natural language processing

In natural language processing (NLP), a word embedding is a representation of a word. The embedding is used in text analysis. Typically, the representation is a real-valued vector that encodes the meaning of the word in such a way that words that are closer in the vector space are expected to be similar in meaning. Word embeddings can be obtained using language modeling and feature learning techniques, where words or phrases from the vocabulary are mapped to vectors of real numbers.

<span class="mw-page-title-main">Gene set enrichment analysis</span> Bioinformatics method

Gene set enrichment analysis (GSEA) (also called functional enrichment analysis or pathway enrichment analysis) is a method to identify classes of genes or proteins that are over-represented in a large set of genes or proteins, and may have an association with different phenotypes (e.g. different organism growth patterns or diseases). The method uses statistical approaches to identify significantly enriched or depleted groups of genes. Transcriptomics technologies and proteomics results often identify thousands of genes which are used for the analysis.

In bioinformatics, a Gene Disease Database is a systematized collection of data, typically structured to model aspects of reality, in a way to comprehend the underlying mechanisms of complex diseases, by understanding multiple composite interactions between phenotype-genotype relationships and gene-disease mechanisms. Gene Disease Databases integrate human gene-disease associations from various expert curated databases and text mining derived associations including Mendelian, complex and environmental diseases.

DisGeNET is a discovery platform designed to address a variety of questions concerning the genetic underpinning of human diseases. DisGeNET is one of the largest and comprehensive repositories of human gene-disease associations (GDAs) currently available. It also offers a set of bioinformatic tools to facilitate the analysis of these data by different user profiles. It is maintained by the Integrative Biomedical Informatics (IBI) Group, of the (GRIB)-IMIM/UPF, based at the Barcelona Biomedical Research Park (PRBB), Barcelona, Spain.

<span class="mw-page-title-main">Pathway analysis</span>

Pathway is the term from molecular biology for a curated schematic representation of a well characterized segment of the molecular physiological machinery, such as a metabolic pathway describing an enzymatic process within a cell or tissue or a signaling pathway model representing a regulatory process that might, in its turn, enable a metabolic or another regulatory process downstream. A typical pathway model starts with an extracellular signaling molecule that activates a specific receptor, thus triggering a chain of molecular interactions. A pathway is most often represented as a relatively small graph with gene, protein, and/or small molecule nodes connected by edges of known functional relations. While a simpler pathway might appear as a chain, complex pathway topologies with loops and alternative routes are much more common. Computational analyses employ special formats of pathway representation. In the simplest form, however, a pathway might be represented as a list of member molecules with order and relations unspecified. Such a representation, generally called Functional Gene Set (FGS), can also refer to other functionally characterised groups such as protein families, Gene Ontology (GO) and Disease Ontology (DO) terms etc. In bioinformatics, methods of pathway analysis might be used to identify key genes/ proteins within a previously known pathway in relation to a particular experiment / pathological condition or building a pathway de novo from proteins that have been identified as key affected elements. By examining changes in e.g. gene expression in a pathway, its biological activity can be explored. However most frequently, pathway analysis refers to a method of initial characterization and interpretation of an experimental condition that was studied with omics tools or genome-wide association study. Such studies might identify long lists of altered genes. A visual inspection is then challenging and the information is hard to summarize, since the altered genes map to a broad range of pathways, processes, and molecular functions. In such situations, the most productive way of exploring the list is to identify enrichment of specific FGSs in it. The general approach of enrichment analyses is to identify FGSs, members of which were most frequently or most strongly altered in the given condition, in comparison to a gene set sampled by chance. In other words, enrichment can map canonical prior knowledge structured in the form of FGSs to the condition represented by altered genes.

<span class="mw-page-title-main">Machine learning in bioinformatics</span>

Machine learning in bioinformatics is the application of machine learning algorithms to bioinformatics, including genomics, proteomics, microarrays, systems biology, evolution, and text mining.

Biocuration is the field of life sciences dedicated to organizing biomedical data, information and knowledge into structured formats, such as spreadsheets, tables and knowledge graphs. The biocuration of biomedical knowledge is made possible by the cooperative work of biocurators, software developers and bioinformaticians and is at the base of the work of biological databases.

References

  1. Segev, Elad (2021). Semantic Network Analysis in Social Sciences. London: Routledge. ISBN   9780367636524.
  2. Freilich, Shiri; Kreimer, Anat; Meilijson, Isacc; Gophna, Uri; Sharan, Roded; Ruppin, Eytan (2010-02-27). "The large-scale organization of the bacterial network of ecological co-occurrence interactions". Nucleic Acids Research. 38 (12): 3857–3868. doi:10.1093/nar/gkq118. ISSN   1362-4962. PMC   2896517 . PMID   20194113.
  3. Segev, Elad (2021). Semantic Network Analysis in Social Sciences. London: Routledge. ISBN   9780367636524.
  4. Segev, Elad (2020). "Textual network analysis: Detecting prevailing themes and biases in international news and social media". Sociology Compass. 14 (4). doi:10.1111/soc4.12779. S2CID   212890998.
  5. Liu, Yang; Qu, Xiaodong; Elser, James J.; Peng, Wenqi; Zhang, Min; Ren, Ze; Zhang, Haiping; Zhang, Yuhang; Yang, Hua (2019). "Impact of Nutrient and Stoichiometry Gradients on Microbial Assemblages in Erhai Lake and Its Input Streams". Water. 11 (8): 1711. doi: 10.3390/w11081711 .
  6. Cohen, AM; Hersh, WR; Dubay, C; Spackman, K (2005). "Using co-occurrence network structure to extract synonymous gene and protein names from MEDLINE abstracts". BMC Bioinformatics. 6 (1): 103. doi: 10.1186/1471-2105-6-103 . ISSN   1471-2105. PMC   1090552 . PMID   15847682.
  7. Jenssen, Tor-Kristian; Lægreid, Astrid; Komorowski, Jan; Hovig, Eivind (2001-05-01). "A literature network of human genes for high-throughput analysis of gene expression". Nature Genetics. 28 (1): 21–28. doi:10.1038/ng0501-21. ISSN   1061-4036. PMID   11326270. S2CID   8889284.
  8. Grivell, L. (2002-03-01). "Mining the bibliome: searching for a needle in a haystack?: New computing tools are needed to effectively scan the growing amount of scientific literature for useful information". EMBO Reports. 3 (3): 200–203. doi:10.1093/embo-reports/kvf059. ISSN   1469-221X. PMC   1084023 . PMID   11882534.
  9. Dadashkhan, Sadaf; Seyed Amir, Mirmotalebisohi; Poursheykhi, Hossein; Sameni, Marzieh; Ghani, Sepideh; Abbasi, Maryam; Kalantari, Sima; Zali, Hakimeh (2023). "Deciphering crucial genes in multiple sclerosis pathogenesis and drug repurposing: A systems biology approach". J Proteomics. 280 (104890). doi:10.1016/j.jprot.2023.104890. PMID   36966969.
  10. Wilson, Ava C; Chiles, Joe; Ashish, Shah; Chanda, Diptiman; Kumar, Preeti L; Mobley, James A; Neptune, Enid R; Thannickal, Victor J; McDonald, Merry-Lynn N (2022). "Integrated bioinformatics analysis identifies established and novel TGFβ1-regulated genes modulated by anti-fibrotic drugs". Sci Rep. 12 (1): 3080. doi:10.1038/s41598-022-07151-1. PMC   8866468 . PMID   35197532.
  11. Li, Shenghao; Hao, Liyuan; Hu, Xiaoyu; Li, Luya (2023). "A systematic study on the treatment of hepatitis B-related hepatocellular carcinoma with drugs based on bioinformatics and key target reverse network pharmacology and experimental verification". Infect Agent Cancer. 18 (1): 41. doi: 10.1186/s13027-023-00520-z . PMC   10315056 . PMID   37393234.
  12. Bian, Yao-Yao; Yang, Li-Li; Zhang, Bin; Li, Wen; Li, Zheng-Jun; Li, Wen-Lin; Zeng, Li (2020). "Identification of key genes involved in post-traumatic stress disorder: Evidence from bioinformatics analysis". World J Psychiatry. 10 (12): 286–298. doi: 10.5498/wjp.v10.i12.286 . PMC   7754529 . PMID   33392005.
  13. Ozgur A, Cetin B, Bingol H: “Co-occurrence Network of Reuters News” (15 Dec 2007) https://arxiv.org/abs/0712.2491
  14. Yilu Zhou; Reid, E.; Jialun Qin; Hsinchun Chen; Guanpi Lai (2018-05-22). "US domestic extremist groups on the Web: link and content analysis". IEEE Intelligent Systems. 20 (5): 44–51. doi:10.1109/MIS.2005.96. S2CID   15687907.