Data-centric computing

Last updated

Data-centric computing is an emerging concept that has relevance in information architecture and data center design. It describes an information system where data is stored independently of the applications, which can be upgraded without costly and complicated data migration. This is a radical shift in information systems that will be needed to address organizational needs for storing, retrieving, moving and processing exponentially growing data sets. [1]

Background

Traditional information system architectures are based on an application-centric mindset. Traditionally, applications were installed, kept relatively static, updated infrequently, and utilized a fixed set of compute, storage, and networking elements to cope with a relatively small set of structured data. [2]

This approach functioned well for decades, but over the past decade, data growth, particularly unstructured data growth, put new pressures on organizations, information architectures and data center infrastructure. 90% of new data is unstructured and, according to a 2018 report, 59% of organizations manage over 10 billion files and objects [3] spread over large numbers of servers and storage nodes. Organizations are struggling to cope with exponential data growth while seeking better approaches to extracting insights from that data using services including Big Data analytics and machine learning. However, existing architectures aren't built to address service requirements at petabyte scale and beyond without significant performance limits. [4]

Traditional architectures fail to fully store, retrieve, move and utilize that data because due to limitations of hardware infrastructure as well as application-centric systems design, development, and management. [5]

Data-centric workloads

There are two problems data-centric computing aims to address.

  1. Organizations need to utilize all available data but traditional applications aren't sufficiently agile or flexible. New shifts toward constant service innovation, supported by emerging approaches to service delivery (including microservices and containers) open new possibilities that step away from traditional application-centric mindsets.
  2. Existing limits of data center hardware also restricts complete movement, management and utilization of unstructured data sets. Conventional CPUs are impeding performance because they do not include specialized capabilities needed for storage, networking, and analysis. [6] Slow storage, including hard drives and SAS/SATA solid state drives over the network can reduce performance and limit data accessibility. [7] New hardware capabilities are needed.

Data-centric computing

Data-centric computing is an approach that merges innovative hardware and software to treat data, not applications, as the permanent source of value. [8] Data-centric computing aims to rethink both hardware and software to extract as much value as possible from existing and new data sources. It increases agility by prioritizing data transfer and data computation over static application performance and resilience.

Data-centric hardware and software

To meet the goals of data-centric computing, data center hardware infrastructure will evolve to address massive scale, rapid growth, the need for very high performance data movement, and extensive calculation requirements.

As far as software goes, data-centric computing accelerates the disappearance of traditional static applications. [12] Applications become short-lived, constantly added, updated, or removed as algorithms come and go. Software is redesigned to conduct analysis on all available data instead of subsets. Microservices visit data, conduct calculations and express the results of their process at speeds beyond conventional approaches.

Related Research Articles

<span class="mw-page-title-main">Thin client</span> Non-powerful computer optimized for remote server access

In computer networking, a thin client, sometimes called slim client or lean client, is a simple (low-performance) computer that has been optimized for establishing a remote connection with a server-based computing environment. They are sometimes known as network computers, or in their simplest form as zero clients. The server does most of the work, which can include launching software programs, performing calculations, and storing data. This contrasts with a rich client or a conventional personal computer; the former is also intended for working in a client–server model but has significant local processing power, while the latter aims to perform its function mostly locally.

Quantum Coorperation is a data storage, management, and protection company that provides technology to store, manage, archive, and protect video and unstructured data throughout the data life cycle. Their products are used by enterprises, media and entertainment companies, government agencies, big data companies, and life science organizations. Quantum is headquartered in San Jose, California and has offices around the world, supporting customers globally in addition to working with a network of distributors, VARs, DMRs, OEMs and other suppliers.

NetApp, Inc. is an intelligent data infrastructure company that provides unified data storage, integrated data services, and cloud operations (CloudOps) solutions to enterprise customers. The company is based in San Jose, California. It has ranked in the Fortune 500 from 2012 to 2021. Founded in 1992 with an initial public offering in 1995, NetApp offers cloud data services for management of applications and data both online and physically.

The Pittsburgh Supercomputing Center (PSC) is a high performance computing and networking center founded in 1986 and one of the original five NSF Supercomputing Centers. PSC is a joint effort of Carnegie Mellon University and the University of Pittsburgh in Pittsburgh, Pennsylvania, United States.

The Texas Advanced Computing Center (TACC) at the University of Texas at Austin, United States, is an advanced computing research center that is based on comprehensive advanced computing resources and supports services to researchers in Texas and across the U.S. The mission of TACC is to enable discoveries that advance science and society through the application of advanced computing technologies. Specializing in high performance computing, scientific visualization, data analysis & storage systems, software, research & development and portal interfaces, TACC deploys and operates advanced computational infrastructure to enable the research activities of faculty, staff, and students of UT Austin. TACC also provides consulting, technical documentation, and training to support researchers who use these resources. TACC staff members conduct research and development in applications and algorithms, computing systems design/architecture, and programming tools and environments.

In computing, virtualization or virtualisation in British English is the act of creating a virtual version of something at the same abstraction level, including virtual computer hardware platforms, storage devices, and computer network resources.

<span class="mw-page-title-main">Cloud computing</span> Form of shared internet-based computing

Cloud computing is the on-demand availability of computer system resources, especially data storage and computing power, without direct active management by the user. Large clouds often have functions distributed over multiple locations, each of which is a data center. Cloud computing relies on sharing of resources to achieve coherence and typically uses a pay-as-you-go model, which can help in reducing capital expenses but may also lead to unexpected operating expenses for users.

Dynamic Infrastructure is an information technology concept related to the design of data centers, whereby the underlying hardware and software can respond dynamically and more efficiently to changing levels of demand. In other words, data center assets such as storage and processing power can be provisioned to meet surges in user's needs. The concept has also been referred to as Infrastructure 2.0 and Next Generation Data Center.

<span class="mw-page-title-main">Converged infrastructure</span> Way of structuring an IT system

Converged infrastructure is a way of structuring an information technology (IT) system which groups multiple components into a single optimized computing package. Components of a converged infrastructure may include servers, data storage devices, networking equipment and software for IT infrastructure management, automation and orchestration.

Data-intensive computing is a class of parallel computing applications which use a data parallel approach to process large volumes of data typically terabytes or petabytes in size and typically referred to as big data. Computing applications that devote most of their execution time to computational requirements are deemed compute-intensive, whereas applications are deemed data-intensive require large volumes of data and devote most of their processing time to I/O and manipulation of data.

<span class="mw-page-title-main">Converged storage</span>

Converged storage is a storage architecture that combines storage and computing resources into a single entity. This can result in the development of platforms for server centric, storage centric or hybrid workloads where applications and data come together to improve application performance and delivery. The combination of storage and compute differs to the traditional IT model in which computation and storage take place in separate or siloed computer equipment. The traditional model requires discrete provisioning changes, such as upgrades and planned migrations, in the face of server load changes, which are increasingly dynamic with virtualization, where converged storage increases the supply of resources along with new VM demands in parallel.

Software-defined storage (SDS) is a marketing term for computer data storage software for policy-based provisioning and management of data storage independent of the underlying hardware. Software-defined storage typically includes a form of storage virtualization to separate the storage hardware from the software that manages it. The software enabling a software-defined storage environment may also provide policy management for features such as data deduplication, replication, thin provisioning, snapshots and backup.

<span class="mw-page-title-main">HP Cloud</span> Set of cloud computing services

HP Cloud was a set of cloud computing services available from Hewlett-Packard. It was the combination of the previous HP Converged Cloud business unit and HP Cloud Services, an OpenStack-based public cloud. It was marketed to enterprise organizations to combine public cloud services with internal IT resources to create hybrid clouds, or a mix of private and public cloud environments, from around 2011 to 2016.

Data defined storage is a marketing term for managing, protecting, and realizing value from data by combining application, information and storage tiers.

DOME is a Dutch government-funded project between IBM and ASTRON in form of a public-private-partnership focussing on the Square Kilometre Array (SKA), the world's largest planned radio telescope. SKA will be built in Australia and South Africa. The DOME project objective is technology roadmap development that applies both to SKA and IBM. The 5-year project was started in 2012 and is co-funded by the Dutch government and IBM Research in Zürich, Switzerland and ASTRON in the Netherlands. The project ended officially on 30 September 2017.

In software engineering, a microservice architecture is a variant of the service-oriented architecture structural style. It is an architectural pattern that arranges an application as a collection of loosely coupled, fine-grained services, communicating through lightweight protocols. One of its goals is that teams can develop and deploy their services independently of others. This is achieved by the reduction of several dependencies in the code base, allowing developers to evolve their services with limited restrictions from users, and for additional complexity to be hidden from users. As a consequence, organizations are able to develop software with fast growth and size, as well as use off-the-shelf services more easily. Communication requirements are reduced. These benefits come at a cost to maintaining the decoupling. Interfaces need to be designed carefully and treated as a public API. One technique that is used is having multiple interfaces on the same service, or multiple versions of the same service, so as to not disrupt existing users of the code.

In computing, energy proportionality is a measure of the relationship between power consumed in a computer system, and the rate at which useful work is done. If the overall power consumption is proportional to the computer's utilization, then the machine is said to be energy proportional. Equivalently stated, for an idealized energy proportional computer, the overall energy per operation is constant for all possible workloads and operating conditions.

<span class="mw-page-title-main">Dell Technologies PowerFlex</span> Software-defined storage product

Dell Technologies PowerFlex, is a commercial software-defined storage product from Dell Technologies that creates a server-based storage area network (SAN) from local server storage using x86 servers. It converts this direct-attached storage into shared block storage that runs over an IP-based network.

In-situ processing also known as in-storage processing (ISP) is a computer science term that refers to processing data where it resides. In-situ means "situated in the original, natural, or existing place or position." An in-situ process processes data where it is stored, such as in solid-state drives (SSDs) or memory devices like NVDIMM, rather than sending the data to a computer's central processing unit (CPU).

Disaggregated storage is a type of data storage within computer data centers. It allows compute resources within a computer server to be separated from storage resources without modifying any physical connections.

References

  1. "The Data-Centric Revolution". TDAN.com. September 2015. Retrieved 2019-12-07.
  2. Bhageshpur, Kiran (2016-10-06). "The Emergence Of Data-Centric Computing". The Next Platform. Retrieved 2019-12-07.
  3. Bhagheshpur, Kiran. "2018 State of Unstructured Data Management" (PDF). Igneous. Archived from the original (PDF) on July 18, 2020. Retrieved December 7, 2019.
  4. "Requirements for Unstructured Data at Petabyte Scale". StorageSwiss.com - The Home of Storage Switzerland. 2019-10-14. Retrieved 2019-12-07.
  5. George S. Davidson, Kevin W. Boyack, Ron A. Zacharski, Stephen C. Helmreich, and Jim R. Cowie (April 2006). "Data-Centric Computing with the Netezza Architecture" (PDF). sandia.gov. Retrieved December 7, 2019.{{cite web}}: CS1 maint: multiple names: authors list (link)
  6. Why We Need Open, Data-Centric Computing Architectures , retrieved 2019-12-07
  7. States, Austin TX United (2016-11-10). "The Network is the New Storage Bottleneck". Datanami. Retrieved 2019-12-07.
  8. "Data-Centric Manifesto". datacentricmanifesto.org. Retrieved 2019-12-07.
  9. Simonite, Tom. "The foundation of the computing industry's innovation is faltering. What can replace it?". MIT Technology Review. Retrieved 2019-12-07.
  10. "DPU: Data Processing Unit Programmable Processor". Fungible. Archived from the original on 2020-08-05. Retrieved 2019-12-07.
  11. Kieran, Mike (2019-03-21). "When You're Implementing NVMe Over Fabrics, the Fabric Really Matters". NetApp Blog. Retrieved 2019-12-07.
  12. "Microservices Momentum Accelerates". DevOps.com. 2018-05-10. Retrieved 2019-12-07.