Journal of Big Data

Last updated

Related Research Articles

<span class="mw-page-title-main">IBM Db2</span> Relational model database server

Db2 is a family of data management products, including database servers, developed by IBM. It initially supported the relational model, but was extended to support object–relational features and non-relational structures like JSON and XML. The brand name was originally styled as DB2 until 2017, when it changed to its present form. In the early days, it was sometimes wrongly styled as DB/2 in a false derivation from the operating system OS/2.

Data engineering refers to the building of systems to enable the collection and usage of data. This data is usually used to enable subsequent analysis and data science, which often involves machine learning. Making the data usable usually involves substantial compute and storage, as well as data processing.

In computing, the Global File System 2 (GFS2) is a shared-disk file system for Linux computer clusters. GFS2 allows all members of a cluster to have direct concurrent access to the same shared block storage, in contrast to distributed file systems which distribute data throughout the cluster. GFS2 can also be used as a local file system on a single computer.

<span class="mw-page-title-main">File system</span> Computer filing system

In computing, a file system or filesystem governs file organization and access. A local file system is a capability of an operating system that services the applications running on the same computer. A distributed file system is a protocol that provides file access between networked computers.

Google File System is a proprietary distributed file system developed by Google to provide efficient, reliable access to data using large clusters of commodity hardware. Google file system was replaced by Colossus in 2010.

Novell Storage Services (NSS) is a file system used by the Novell NetWare network operating system. Support for NSS was introduced in 2004 to SUSE Linux via low-level network NCPFS protocol. It has some unique features that make it especially useful for setting up shared volumes on a file server in a local area network.

Replication in computing refers to maintaining multiple copies of data, processes, or resources to ensure consistency across redundant components. This fundamental technique spans databases, file systems, and distributed systems, serving to improve availability, fault-tolerance, accessibility, and performance. Through replication, systems can continue operating when components fail (failover), serve requests from geographically distributed locations, and balance load across multiple machines. The challenge lies in maintaining consistency between replicas while managing the fundamental tradeoffs between data consistency, system availability, and network partition tolerance – constraints known as the CAP theorem.

Apache Hadoop is a collection of open-source software utilities for reliable, scalable, distributed computing. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use. It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

<span class="mw-page-title-main">Dell EMC Isilon</span> Network-attached storage

Dell EMC Isilon is a scale out network-attached storage platform offered by Dell EMC for high-volume storage, backup and archiving of unstructured data. It provides a cluster-based storage array based on industry standard hardware, and is scalable to 50 petabytes in a single filesystem using its FreeBSD-derived OneFS file system.

<span class="mw-page-title-main">Data</span> Units of information

Data are a collection of discrete or continuous values that convey information, describing the quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted formally. A datum is an individual value in a collection of data. Data are usually organized into structures such as tables that provide additional context and meaning, and may themselves be used as data in larger structures. Data may be used as variables in a computational process. Data may represent abstract ideas or concrete measurements. Data are commonly used in scientific research, economics, and virtually every other form of human organizational activity. Examples of data sets include price indices, unemployment rates, literacy rates, and census data. In this context, data represent the raw facts and figures from which useful information can be extracted.

<span class="mw-page-title-main">Vertica</span> Software company

Vertica is an analytic database management software company. Vertica was founded in 2005 by the database researcher Michael Stonebraker with Andrew Palmer as the founding CEO. Ralph Breslauer and Christopher P. Lynch served as CEOs later on.

<span class="mw-page-title-main">Big data</span> Extremely large or complex datasets

Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data-processing software. Data with many entries (rows) offer greater statistical power, while data with higher complexity may lead to a higher false discovery rate.

<span class="mw-page-title-main">Evercookie</span> JavaScript application programming interface

Evercookie is an open-source JavaScript application programming interface (API) that identifies and reproduces intentionally deleted cookies on the clients' browser storage. This behavior is known as a Zombie cookie. It was created by Samy Kamkar in 2010 to demonstrate the possible infiltration from the websites that use respawning. Websites that have adopted this mechanism can identify users even if they attempt to delete the previously stored cookies.

<span class="mw-page-title-main">Open Compute Project</span> Organization that shares designs of data center products

The Open Compute Project (OCP) is an organization that facilitates the sharing of data center product designs and industry best practices among companies. Founded in 2011, OCP has significantly influenced the design and operation of large-scale computing facilities worldwide.

In computing, a distributed file system (DFS) or network file system is any file system that allows access from multiple hosts to files shared via a computer network. This makes it possible for multiple users on multiple machines to share files and storage resources.

<span class="mw-page-title-main">HP Cloud</span> Set of cloud computing services

HP Cloud was a set of cloud computing services available from Hewlett-Packard. It was the combination of the previous HP Converged Cloud business unit and HP Cloud Services, an OpenStack-based public cloud. It was marketed to enterprise organizations to combine public cloud services with internal IT resources to create hybrid clouds, or a mix of private and public cloud environments, from around 2011 to 2016.

<span class="mw-page-title-main">Apache Drill</span> Open-source software framework

Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Built chiefly by contributions from developers from MapR, Drill is inspired by Google's Dremel system. Drill is an Apache top-level project. Tom Shiran is the founder of the Apache Drill Project. It was designated an Apache Software Foundation top-level project in December 2016.

The Yahoo! Cloud Serving Benchmark (YCSB) is an open-source specification and program suite for evaluating retrieval and maintenance capabilities of computer programs. It is often used to compare the relative performance of NoSQL database management systems.

Apple File System (APFS) is a proprietary file system developed and deployed by Apple Inc. for macOS Sierra (10.12.4) and later, iOS 10.3, tvOS 10.2, watchOS 3.2, and all versions of iPadOS. It aims to fix core problems of HFS+, APFS's predecessor on these operating systems. APFS is optimized for solid-state drive storage and supports encryption, snapshots, and increased data integrity, among other capabilities.

Alluxio is an open-source virtual distributed file system (VDFS). Initially as research project "Tachyon", Alluxio was created at the University of California, Berkeley's AMPLab as Haoyuan Li's Ph.D. Thesis, advised by Professor Scott Shenker & Professor Ion Stoica. Alluxio sits between computation and storage in the big data analytics stack. It provides a data abstraction layer for computation frameworks, enabling applications to connect to numerous storage systems through a common interface. The software is published under the Apache License.

References