High-performance Integrated Virtual Environment

Last updated August 08, 2024

HIVE Logo

The High-performance Integrated Virtual Environment (HIVE) is a distributed computing environment used for healthcare-IT and biological research, including analysis of Next Generation Sequencing (NGS) data, preclinical, clinical and post market data, adverse events, metagenomic data, etc.^[1] Currently it is supported and continuously developed by US Food and Drug Administration (government domain), George Washington University (academic domain), and by DNA-HIVE, WHISE-Global and Embleema (commercial domain). HIVE currently operates fully functionally within the US FDA supporting wide variety (+60) of regulatory research and regulatory review projects as well as for supporting MDEpiNet medical device postmarket registries. Academic deployments of HIVE are used for research activities and publications in NGS analytics, cancer research, microbiome research and in educational programs for students at GWU. Commercial enterprises use HIVE for oncology, microbiology, vaccine manufacturing, gene editing, healthcare-IT, harmonization of real-world data, in preclinical research and clinical studies.

Infrastructure

HIVE is a massively parallel distributed computing environment where the distributed storage library and the distributed computational powerhouse are linked seamlessly.^[2] The system is both robust and flexible due to maintaining both storage and the metadata database on the same network.^[3] The distributed storage layer of software is the key component for file and archive management and is the backbone for the deposition pipeline. The data deposition back-end allows automatic uploads and downloads of external datasets into HIVE data repositories. The metadata database can be used to maintain specific information about extremely large files ingested into the system (big data) as well as metadata related to computations run on the system. This metadata then allows details of a computational pipeline to be brought up easily in the future in order to validate or replicate experiments. Since the metadata is associated with the computation, it stores the parameters of any computation in the system eliminating manual record keeping.^{[ citation needed ]}

Differentiating HIVE from other object oriented databases is that HIVE implements a set of unified APIs to search, view, and manipulate data of all types. The system also facilitates a highly secure hierarchical access control and permission system, allowing determination of data access privileges in a finely granular manner without creating a multiplicity of rules in the security subsystem. The security model, designed for sensitive data, provides comprehensive control and auditing functionality in compliance with HIVE's designation as a FISMA Moderate system.^[4]

HIVE technological capabilities

Data-retrieval: the HIVE is capable of retrieving data from a variety of sources such as local, cloud-based or network storage, sequencing instruments, and from http, ftp and sftp repositories. Additionally, HIVE implements the sophisticated handshake protocols with existing large scale data platforms such as NIH/NCBI to download large amounts of reference genomic or sequence read data on behalf of users in an easy and accurate manner.
Data-warehousing: HIVE honeycomb data model was specifically created for adopting complex hierarchy of scientific datatypes, providing a platform for standardization and provenance of data within the framework of object-oriented data models. By using an integrated data-engine, honeycomb, HIVE contributes to the veracity of biomedical computations and helps ensure reproducibility, and harmonization of bio-computational processes.
Security: HIVE-honeycomb employs a hierarchical security control system, allowing the determination of access privileges in an acutely granular manner without overwhelming the security subsystem with a multiplicity of rules. It provides on the fly encryption/decryption of PII and is compliant with the highest security protocols as requested for systems authorized to operate in regulatory FISMA moderate environments.

HIVE-visualizations HIVE-visualizations.png — HIVE-visualizations

Integration: HIVE provides unified Application Program Interface (API) to search, edit, view, secure, share and manipulate data and computations of all types. As an Integrator platform HIVE provides developers means to develop (C/C++, Python, Perl, JavaScript, R) and integrate existing almost any open source or commercial tools using generic adaptation framework to integrate command line tools. Additionally session controlled web-API provides means to drive HIVE to perform data quality control and complex computations on behalf of remote users. Currently there are tens of big data analytics tools in production HIVE and dozens more being developed; these include but are not limited to DNA-, RNA-, Transposon-, Chip-, Immune-sequencing), de novo assembly, population genomics metagenomic sequencing, differential profiling, statistical, classification and clusterization utilities to study bacteria, viruses, human germline and somatic profiles, quasispecies, infections, pathogens.
Computations: Unlike many virtual computing environments, HIVE virtualizes services, not processes: it provides computations as a service by introducing agnostic abstraction layer between hardware, software and the computational tasks requested by users. The novel paradigm of relocating computations closer to the data, instead of moving data to computing cores has proven to be the key for optimal flow of tasks and data through network infrastructure.
Visualization: HIVE provides number of scientific visualization components using technologies as HTML5, SVG, D3JS within its Data Driven Document context. The native data and metadata and computational results provided in JSON, CSV-based communication protocols, which are used to generate interactive, user driven, customizable tools allow bioinformaticians to manipulate terabytes of extra-large data using only an Internet browser.

HIVE open source

FDA launched HIVE Open Source as a platform to support end to end needs for NGS analytics. https://github.com/FDA/fda-hive

HIVE biocompute harmonization platform is at the core of High-throughput Sequencing Computational Standards for Regulatory Sciences (HTS-CSRS) project. Its mission is to provide the scientific community with a framework to harmonize biocomputing, promote interoperability, and verify bioinformatics protocols (https://hive.biochemistry.gwu.edu/htscsrs). For more information, see the project description on the FDA Extramural Research page (https://www.fda.gov/ScienceResearch/SpecialTopics/RegulatoryScience/ucm491893.htm

HIVE architecture

HIVE-hardware HIVE-hardware.png — HIVE-hardware

Hardware architecture: At the core of HIVE there exists a solid backbone hardware made of few redundant critical components and scalable compute and storage units. The diagram at the right demonstrates the connectivity and components assignations for such HIVE cluster. Core components providing the vital functions for HIVE cloud include
- web servers facing outside through the high-end secure firewall to support web-portal functionality;
- cloud servers are the core functional units orchestrating distributed storage and computations workflows through complex queuing and prioritization schemas;
- high availability drone hardware serves as a computational unit for scientific visualization and user interface support functionalities;
- ultra-fast inter-process communication storage units organize distributed computations data interchange staging arena.
- switches and firewall hardware organize the secure high performance network environment for HIVE cloud.
- permanent storage units each are designed to store hundreds of terabytes of NGS data and reference genomes as well as storage for computational results and personal user files.

Sub-clusters of scalable high performance high density compute cores are there to serve as a powerhouse for extra-large distributed parallelized computations of NGS algorithmics. System is extremely scalable and has deployment instances ranging from a single HIVE in a box appliance to massive enterprise level systems of thousands of compute units.

Software architecture: HIVE software infrastructure consists of layers incrementally providing more functionality.

HIVE software layers HIVE-software.png — HIVE software layers

- The Kernel backbone layer provides integration with heterogeneous hardware and operating system platforms.
- HIVE cloud backbone supports distributed storage, security and computing environment.
- Science backbone represents set of low level scientific libraries to perform variety of scientific computations, mathematical apparatus for chemical, biological, statistical and other purely scientific concepts
- CGI and Java-script layers provide web-portal and web-application compatibility layers.
- Low level libraries provide Application Programming Interface (API) for developing tools and utilities.
- Integrated apps provide major NGS tool arsenal
- Web-apps and HIVE –portal provide web-portal functionality

Public Presentations

Dr. Vahan Simonyan and Dr. Raja Mazumder presented at the NIH Frontiers in Data Science^[5] about HIVE acting as a bridge between research and regulatory analytics.^[6]^[7] Simonyan also presented on the topic at the 2014 Bio-IT World Expo.^[8]
HIVE was additionally discussed in FedScoop.^[9]
Inside the HIVE, the FDA's Multi-Omics Compute Architecture, BioIT World.^[10]

Related Research Articles

Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The subsequent process of analyzing and interpreting data is referred to as computational biology.

Checkpointing is a technique that provides fault tolerance for computing systems. It basically consists of saving a snapshot of the application's state, so that applications can restart from that point in case of failure. This is particularly important for long running applications that are executed in failure-prone computing systems.

Utility computing, or computer utility, is a service provisioning model in which a service provider makes computing resources and infrastructure management available to the customer as needed, and charges them for specific usage rather than a flat rate. Like other types of on-demand computing, the utility model seeks to maximize the efficient use of resources and/or minimize associated costs. Utility is the packaging of system resources, such as computation, storage and services, as a metered service. This model has the advantage of a low or no initial cost to acquire computer resources; instead, resources are essentially rented.

Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use. It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

Desktop virtualization is a software technology that separates the desktop environment and associated application software from the physical client device that is used to access it.

Ceph is a free and open-source software-defined storage platform that provides object storage, block storage, and file storage built on a common distributed cluster foundation. Ceph provides completely distributed operation without a single point of failure and scalability to the exabyte level, and is freely available. Since version 12 (Luminous), Ceph does not rely on any other conventional filesystem and directly manages HDDs and SSDs with its own storage backend BlueStore and can expose a POSIX filesystem.

In computing, virtualization or virtualisation in British English is the act of creating a virtual version of something at the same abstraction level, including virtual computer hardware platforms, storage devices, and computer network resources.

Cloud computing is the on-demand availability of computer system resources, especially data storage and computing power, without direct active management by the user. Large clouds often have functions distributed over multiple locations, each of which is a data center. Cloud computing relies on sharing of resources to achieve coherence and typically uses a pay-as-you-go model, which can help in reducing capital expenses but may also lead to unexpected operating expenses for users.

Eucalyptus is a paid and open-source computer software for building Amazon Web Services (AWS)-compatible private and hybrid cloud computing environments, originally developed by the company Eucalyptus Systems. Eucalyptus is an acronym for Elastic Utility Computing Architecture for Linking Your Programs To Useful Systems. Eucalyptus enables pooling compute, storage, and network resources that can be dynamically scaled up or down as application workloads change. Mårten Mickos was the CEO of Eucalyptus. In September 2014, Eucalyptus was acquired by Hewlett-Packard and then maintained by DXC Technology. After DXC stopped developing the product in late 2017, AppScale Systems forked the code and started supporting Eucalyptus customers.

Luminex Software, Inc. is a developer and provider of mainframe connectivity, storage and data protection solutions, including virtual tape and data integration products.

Data-intensive computing is a class of parallel computing applications which use a data parallel approach to process large volumes of data typically terabytes or petabytes in size and typically referred to as big data. Computing applications that devote most of their execution time to computational requirements are deemed compute-intensive, whereas applications are deemed data-intensive require large volumes of data and devote most of their processing time to I/O and manipulation of data.

openQRM is a free and open-source cloud-computing management platform for managing heterogeneous data centre infrastructures.

Software-defined storage (SDS) is a marketing term for computer data storage software for policy-based provisioning and management of data storage independent of the underlying hardware. Software-defined storage typically includes a form of storage virtualization to separate the storage hardware from the software that manages it. The software enabling a software-defined storage environment may also provide policy management for features such as data deduplication, replication, thin provisioning, snapshots and backup.

The Neuroimaging Tools and Resources Collaboratory is a neuroimaging informatics knowledge environment for MR, PET/SPECT, CT, EEG/MEG, optical imaging, clinical neuroinformatics, imaging genomics, and computational neuroscience tools and resources.

Cloud management is the management of cloud computing products and services.

Computation offloading is the transfer of resource intensive computational tasks to a separate processor, such as a hardware accelerator, or an external platform, such as a cluster, grid, or a cloud. Offloading to a coprocessor can be used to accelerate applications including: image rendering and mathematical calculations. Offloading computing to an external platform over a network can provide computing power and overcome hardware limitations of a device, such as limited computational power, storage, and energy.

PrecisionFDA is a secure, collaborative, high-performance computing platform that has established a growing community of experts around the analysis of biological datasets in order to advance precision medicine, inform regulatory science, and enable improvements in health outcomes. This cloud-based platform is developed and served by the United States Food and Drug Administration (FDA). PrecisionFDA connects experts, citizen scientists, and scholars from around the world and provides them with a library of computational tools, workflow features, and reference data. The platform allows researchers to upload and compare data against reference genomes, and execute bioinformatic pipelines. The variant call file (VCF) comparator tool also enables users to compare their genetic test results to reference genomes. The platform's code is open source and available on GitHub. The platform also features a crowdsourcing model to sponsor community challenges in order to stimulate the development of innovative analytics that inform precision medicine and regulatory science. Community members from around the world come together to participate in scientific challenges, solving problems that demonstrate the effectiveness of their tools, testing the capabilities of the platform, sharing their results, and engaging the community in discussions. Globally, precisionFDA has more than 5,000 users.

The BioCompute Object (BCO) project is a community-driven initiative to build a framework for standardizing and sharing computations and analyses generated from High-throughput sequencing. The project has since been standardized as IEEE 2791-2020, and the project files are maintained in an open source repository. The July 22nd, 2020 edition of the Federal Register announced that the FDA now supports the use of BioCompute in regulatory submissions, and the inclusion of the standard in the Data Standards Catalog for the submission of HTS data in NDAs, ANDAs, BLAs, and INDs to CBER, CDER, and CFSAN.

Alluxio is an open-source virtual distributed file system (VDFS). Initially as research project "Tachyon", Alluxio was created at the University of California, Berkeley's AMPLab as Haoyuan Li's Ph.D. Thesis, advised by Professor Scott Shenker & Professor Ion Stoica. Alluxio sits between computation and storage in the big data analytics stack. It provides a data abstraction layer for computation frameworks, enabling applications to connect to numerous storage systems through a common interface. The software is published under the Apache License.

References

↑ Simonyan, Vahan; Mazumder, Raja (2014). "High-Performance Integrated Virtual Environment (HIVE) Tools and Applications for Big Data Analysis". Genes. 5 (4): 957–81. doi: 10.3390/genes5040957 . PMC 4276921 . PMID 25271953.
↑ https://hive.biochemistry.gwu.edu/help/HIVEWhitePaper_12_16_2014.pdf%5B%5D
↑ https://hive.biochemistry.gwu.edu/help/HIVEInfrastructuresUK.pdf%5B%5D
↑ Wilson, C. A.; Simonyan, V. (2014). "FDA's Activities Supporting Regulatory Application of 'Next Gen' Sequencing Technologies". PDA Journal of Pharmaceutical Science and Technology. 68 (6): 626–30. doi:10.5731/pdajpst.2014.01024. PMID 25475637. S2CID 37583755.
↑ "NIH Login User Name and Password or PIV Card Authentication". Archived from the original on 2016-01-01. Retrieved 2016-02-01.
↑ "NIH VideoCast - High-Performance Integrated Virtual Environment (HIVE): A regulatory NGS data analysis platform". 29 January 2016.
↑ "NIH Login User Name and Password or PIV Card Authentication". Archived from the original on 2016-01-01. Retrieved 2016-02-01.
↑ Staff (2014). "2014-BIT-Brochure" (PDF). 2014 Bio-IT World Expo. Cambridge Healthtech Institute. p. 6 (col 2). Retrieved 15 June 2016. (title) High-Performance Integrated Virtual Environment (HIVE) Infrastructure for Big-Data Analysis: Applications to Next-Gen Sequencing Informatics
↑ http://fedscoop.com/fdas-examines-nextgen-sequencing-too%5B%5Dl
↑ "Bio-IT World".

External links

The public version of HIVE is at https://hive.biochemistry.gwu.edu/dna.cgi?cmd=about

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Simonyan, Vahan; Mazumder, Raja (2014). "High-Performance Integrated Virtual Environment (HIVE) Tools and Applications for Big Data Analysis". Genes. 5 (4): 957–81. doi: 10.3390/genes5040957 . PMC 4276921 . PMID 25271953.

[2] ttps://hive.biochemistry.gwu.edu/help/HIVEWhitePaper_12_16_2014.pdf%5B%5D

[3] ttps://hive.biochemistry.gwu.edu/help/HIVEInfrastructuresUK.pdf%5B%5D

[4] Wilson, C. A.; Simonyan, V. (2014). "FDA's Activities Supporting Regulatory Application of 'Next Gen' Sequencing Technologies". PDA Journal of Pharmaceutical Science and Technology. 68 (6): 626–30. doi:10.5731/pdajpst.2014.01024. PMID 25475637. S2CID 37583755.

[5] "NIH Login User Name and Password or PIV Card Authentication". Archived from the original on 2016-01-01. Retrieved 2016-02-01.

[6] "NIH VideoCast - High-Performance Integrated Virtual Environment (HIVE): A regulatory NGS data analysis platform". 29 January 2016.

[7] "NIH Login User Name and Password or PIV Card Authentication". Archived from the original on 2016-01-01. Retrieved 2016-02-01.

[8] Staff (2014). "2014-BIT-Brochure" (PDF). 2014 Bio-IT World Expo. Cambridge Healthtech Institute. p. 6 (col 2). Retrieved 15 June 2016. (title) High-Performance Integrated Virtual Environment (HIVE) Infrastructure for Big-Data Analysis: Applications to Next-Gen Sequencing Informatics

[9] ttp://fedscoop.com/fdas-examines-nextgen-sequencing-too%5B%5Dl

[10] "Bio-IT World".

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]