In-memory processing

Last updated

The term is used for two different things:

  1. In computer science, in-memory processing (PIM) is a computer architecture in which data operations are available directly on the data memory, rather than having to be transferred to CPU registers first. [1] This may improve the power usage and performance of moving data between the processor and the main memory.
  2. In software engineering, in-memory processing is a software architecture where a database is kept entirely in random-access memory (RAM) or flash memory so that usual accesses, in particular read or query operations, do not require access to disk storage. [2] This may allow faster data operations such as "joins", and faster reporting and decision-making in business. [3]

Extremely large datasets may be divided between co-operating systems as in-memory data grids.

Contents

Hardware (PIM)

PIM could be implemented by: [4]

Application of in-memory technology in everyday life

In-memory processing techniques are frequently used by modern smartphones and tablets to improve application performance. This can result in speedier app loading times and more enjoyable user experiences.

Software

Disk-based data access

Data structures

With disk-based technology, data is loaded on to the computer's hard disk in the form of multiple tables and multi-dimensional structures against which queries are run. Disk-based technologies are often relational database management systems (RDBMS), often based on the structured query language (SQL), such as SQL Server, MySQL, Oracle and many others. RDBMS are designed for the requirements of transactional processing. Using a database that supports insertions and updates as well as performing aggregations, joins (typical in BI solutions) are typically very slow. Another drawback is that SQL is designed to efficiently fetch rows of data, while BI queries usually involve fetching of partial rows of data involving heavy calculations.

To improve query performance, multidimensional databases or OLAP cubes - also called multidimensional online analytical processing (MOLAP) - may be constructed. Designing a cube may be an elaborate and lengthy process, and changing the cube's structure to adapt to dynamically changing business needs may be cumbersome. Cubes are pre-populated with data to answer specific queries and although they increase performance, they are still not optimal for answering all ad-hoc queries. [9]

Information technology (IT) staff may spend substantial development time on optimizing databases, constructing indexes and aggregates, designing cubes and star schemas, data modeling, and query analysis. [10]

Processing speed

Reading data from the hard disk is much slower (possibly hundreds of times) when compared to reading the same data from RAM. Especially when analyzing large volumes of data, performance is severely degraded. Though SQL is a very powerful tool, arbitrary complex queries with a disk-based implementation take a relatively long time to execute and often result in bringing down the performance of transactional processing. In order to obtain results within an acceptable response time, many data warehouses have been designed to pre-calculate summaries and answer specific queries only. Optimized aggregation algorithms are needed to increase performance.

In-memory data access

With both in-memory database and data grid, all information is initially loaded into memory RAM or flash memory instead of hard disks. With a data grid processing occurs at three order of magnitude faster than relational databases which have advanced functionality such as ACID which degrade performance in compensation for the additional functionality. The arrival of column centric databases, which store similar information together, allow data to be stored more efficiently and with greater compression ratios. This allows huge amounts of data to be stored in the same physical space, reducing the amount of memory needed to perform a query and increasing processing speed. Many users and software vendors have integrated flash memory into their systems to allow systems to scale to larger data sets more economically.

Users query the data loaded into the system's memory, thereby avoiding slower database access and performance bottlenecks. This differs from caching, a very widely used method to speed up query performance, in that caches are subsets of very specific pre-defined organized data. With in-memory tools, data available for analysis can be as large as a data mart or small data warehouse which is entirely in memory. This can be accessed quickly by multiple concurrent users or applications at a detailed level and offers the potential for enhanced analytics and for scaling and increasing the speed of an application. Theoretically, the improvement in data access speed is 10,000 to 1,000,000 times compared to the disk.[ citation needed ] It also minimizes the need for performance tuning by IT staff and provides faster service for end users.

Advantages of in-memory processing technology

Certain developments in computer technology and business needs have tended to increase the relative advantages of in-memory technology. [11]

  • Following Moore's law, the number of transistors per square unit doubles every two or so years. This is reflected in changes to price, performance, packaging and capabilities of the components. Random-access memory price and CPU computing power in particular have improved over the decades. CPU processing, memory and disk storage are all subject to some variation of this law. As well, hardware innovations such as multi-core architecture, NAND flash memory, parallel servers, and increased memory processing capability, have contributed to the technical and economic feasibility of in-memory approaches.
  • In turn, software innovations such as column centric databases, compression techniques and handling aggregate tables, enable efficient in-memory products. [12]
  • The advent of 64-bit operating systems , which allow access to far more RAM (up to 100 GB or more) than the 2 or 4 GB accessible on 32-bit systems. By providing Terabytes (1 TB = 1,024 GB) of space for storage and analysis, 64-bit operating systems make in-memory processing scalable. The use of flash memory enables systems to scale to many Terabytes more economically.
  • Increasing volumes of data have meant that traditional data warehouses may be less able to process the data in a timely and accurate way. The extract, transform, load (ETL) process that periodically updates disk-based data warehouses with operational data may result in lags and stale data. In-memory processing may enable faster access to terabytes of data for better real time reporting.
  • In-memory processing may be available at a lower cost compared to disk-based processing, and can be more easily deployed and maintained. According to Gartner survey, [13] deploying traditional BI tools can take as long as 17 months.
  • Decreases in power consumption and increases in throughput due to a lower access latency, and greater memory bandwidth and hardware parallelism. [14]

Application in business

A range of in-memory products provide ability to connect to existing data sources and access to visually rich interactive dashboards. This allows business analysts and end users to create custom reports and queries without much training or expertise. Easy navigation and ability to modify queries on the fly is of benefit to many users. Since these dashboards can be populated with fresh data, users have access to real time data and can create reports within minutes. In-memory processing may be of particular benefit in call centers and warehouse management.

With in-memory processing, the source database is queried only once instead of accessing the database every time a query is run, thereby eliminating repetitive processing and reducing the burden on database servers. By scheduling to populate the in-memory database overnight, the database servers can be used for operational purposes during peak hours.

Adoption of in-memory technology

With a large number of users, a large amount of RAM is needed for an in-memory configuration, which in turn affects the hardware costs. The investment is more likely to be suitable in situations where speed of query response is a high priority, and where there is significant growth in data volume and increase in demand for reporting facilities; it may still not be cost-effective where information is not subject to rapid change. Security is another consideration, as in-memory tools expose huge amounts of data to end users. Makers advise ensuring that only authorized users are given access to the data.

See also

Related Research Articles

<span class="mw-page-title-main">Computer data storage</span> Storage of digital data readable by computers

Computer data storage or digital data storage is a technology consisting of computer components and recording media that are used to retain digital data. It is a core function and fundamental component of computers.

<span class="mw-page-title-main">Database</span> Organized collection of data in computing

In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and analyze the data. The DBMS additionally encompasses the core facilities provided to administer the database. The sum total of the database, the DBMS and the associated applications can be referred to as a database system. Often the term "database" is also used loosely to refer to any of the DBMS, the database system or an application associated with the database.

<span class="mw-page-title-main">IBM Db2</span> Relational model database server

Db2 is a family of data management products, including database servers, developed by IBM. It initially supported the relational model, but was extended to support object–relational features and non-relational structures like JSON and XML. The brand name was originally styled as DB2 until 2017, when it changed to its present form.

In computing, online analytical processing, or OLAP, is an approach to quickly answer multi-dimensional analytical (MDA) queries. OLAP is part of the broader category of business intelligence, which also encompasses relational databases, report writing and data mining. Typical applications of OLAP include business reporting for sales, marketing, management reporting, business process management (BPM), budgeting and forecasting, financial reporting and similar areas, with new applications emerging, such as agriculture.

<span class="mw-page-title-main">Extract, transform, load</span> Procedure in computing

In computing, extract, transform, load (ETL) is a three-phase process where data is extracted from an input source, transformed, and loaded into an output data container. The data can be collated from one or more sources and it can also be output to one or more destinations. ETL processing is typically executed using software applications but it can also be done manually by system operators. ETL software typically automates the entire process and can be run manually or on recurring schedules either as single jobs or aggregated into a batch of jobs.

Essbase is a multidimensional database management system (MDBMS) that provides a platform upon which to build analytic applications. Essbase began as a product from Arbor Software, which merged with Hyperion Software in 1998. Oracle Corporation acquired Hyperion Solutions Corporation in 2007. Until late 2005 IBM also marketed an OEM version of Essbase as DB2 OLAP Server.

An in-memory database is a database management system that primarily relies on main memory for computer data storage. It is contrasted with database management systems that employ a disk storage mechanism. In-memory databases are faster than disk-optimized databases because disk access is slower than memory access and the internal optimization algorithms are simpler and execute fewer CPU instructions. Accessing data in memory eliminates seek time when querying the data, which provides faster and more predictable performance than disk.

The Access Database Engine is a database engine on which several Microsoft products have been built. The first version of Jet was developed in 1992, consisting of three modules which could be used to manipulate a database.

Database tuning describes a group of activities used to optimize and homogenize the performance of a database. It usually overlaps with query tuning, but refers to design of the database files, selection of the database management system (DBMS) application, and configuration of the database's environment.

SAP IQ is a column-based, petabyte scale, relational database software system used for business intelligence, data warehousing, and data marts. Produced by Sybase Inc., now an SAP company, its primary function is to analyze large amounts of data in a low-cost, highly available environment. SAP IQ is often credited with pioneering the commercialization of column-store technology.

In computing, the term data warehouse appliance (DWA) was coined by Foster Hinshaw for a computer architecture for data warehouses (DW) specifically marketed for big data analysis and discovery that is simple to use and has a high performance for the workload. A DWA includes an integrated set of servers, storage, operating systems, and databases.

Microsoft SQL Server is a proprietary relational database management system developed by Microsoft. As a database server, it is a software product with the primary function of storing and retrieving data as requested by other software applications—which may run either on the same computer or on another computer across a network. Microsoft markets at least a dozen different editions of Microsoft SQL Server, aimed at different audiences and for workloads ranging from small single-machine applications to large Internet-facing applications with many concurrent users.

<span class="mw-page-title-main">Netezza</span> Provider of Integrated Data Warehouse Hardware and Software

IBM Netezza is a subsidiary of American technology company IBM that designs and markets high-performance data warehouse appliances and advanced analytics applications for the most demanding analytic uses including enterprise data warehousing, business intelligence, predictive analytics and business continuity planning.

BigQuery is a managed, serverless data warehouse product by Google, offering scalable analysis over large quantities of data. It is a Platform as a Service (PaaS) that supports querying using a dialect of SQL. It also has built-in machine learning capabilities. BigQuery was announced in May 2010 and made generally available in November 2011.

eXtremeDB is a high-performance, low-latency, ACID-compliant embedded database management system using an in-memory database system (IMDS) architecture and designed to be linked into C/C++ based programs. It works on Windows, Linux, and other real-time and embedded operating systems.

<span class="mw-page-title-main">Actian Vector</span>

Actian Vector is an SQL relational database management system designed for high performance in analytical database applications. It published record breaking results on the Transaction Processing Performance Council's TPC-H benchmark for database sizes of 100 GB, 300 GB, 1 TB and 3 TB on non-clustered hardware.

<span class="mw-page-title-main">SingleStore</span> Database management system

SingleStore is a proprietary, cloud-native database designed for data-intensive applications. A distributed, relational, SQL database management system (RDBMS) that features ANSI SQL support, it is known for speed in data ingest, transaction processing, and query processing.

Transbase is a relational database management system, developed and maintained by Transaction Software GmbH, Munich. The development of Transbase was started in the 1980s by Rudolf Bayer under the name "Merkur" at the department of Computer Science of the Technical University of Munich (TUM).

<span class="mw-page-title-main">SQream DB</span>

SQream is a relational database management system (RDBMS) that uses graphics processing units (GPUs) from Nvidia. SQream is designed for big data analytics using the Structured Query Language (SQL).

<span class="mw-page-title-main">ClickHouse</span> Open-source database management system

ClickHouse is an open-source column-oriented DBMS for online analytical processing (OLAP) that allows users to generate analytical reports using SQL queries in real-time. ClickHouse Inc. is headquartered in the San Francisco Bay Area with the subsidiary, ClickHouse B.V., based in Amsterdam, Netherlands.

References

  1. Ghose, S. (November 2019). "Processing-in-memory: A workload-driven perspective" (PDF). IBM Journal of Research and Development. 63 (6): 3:1–19. doi:10.1147/JRD.2019.2934048. S2CID   202025511.
  2. Zhang, Hao; Gang Chen; Beng Chin Ooi; Kian-Lee Tan; Meihui Zhang (July 2015). "In-Memory Big Data Management and Processing: A Survey". IEEE Transactions on Knowledge and Data Engineering. 27 (7): 1920–1948. doi: 10.1109/TKDE.2015.2427795 .
  3. Plattner, Hasso; Zeier, Alexander (2012). In-Memory Data Management: Technology and Applications. Springer Science & Business Media. ISBN   9783642295744.
  4. "Processing-in-Memory Course: Lecture 1: Exploring the PIM Paradigm for Future Systems - Spring 2022". YouTube . 10 March 2022.
  5. Park, Kate (2023-07-27). "Samsung extends cut in memory chip production, will focus on high-end AI chips instead". TechCrunch. Retrieved 2023-12-05.
  6. Tan, Kian-Lee; Cai, Qingchao; Ooi, Beng Chin; Wong, Weng-Fai; Yao, Chang; Zhang, Hao (2015-08-12). "In-memory Databases: Challenges and Opportunities From Software and Hardware Perspectives". ACM SIGMOD Record. 44 (2): 35–40. doi:10.1145/2814710.2814717. ISSN   0163-5808. S2CID   14238437.
  7. Fatemieh, Seyed Erfan; Reshadinezhad, Mohammad Reza; Taherinejad, Nima (2022). "Approximate In-Memory Computing using Memristive IMPLY Logic and its Application to Image Processing". 2022 IEEE International Symposium on Circuits and Systems (ISCAS). pp. 3115–3119. doi:10.1109/ISCAS48785.2022.9937475. ISBN   978-1-6654-8485-5. S2CID   253462291 . Retrieved 2023-12-05.
  8. "What is processing in memory (PIM) and how does it work?". Business Analytics. Retrieved 2023-12-05.
  9. Gill, John (2007). "Shifting the BI Paradigm with In-Memory Database Technologies". Business Intelligence Journal. 12 (2): 58–62. Archived from the original on 2015-09-24.
  10. Earls, A (2011). Tips on evaluating, deploying and managing in-memory analytics tools (PDF). Tableau. Archived from the original (PDF) on 2012-04-25.
  11. "In_memory Analytics". yellowfin. p. 6.
  12. Kote, Sparjan. "In-memory computing in Business Intelligence". Archived from the original on April 24, 2011.
  13. "Survey Analysis: Why BI and Analytics Adoption Remains Low and How to Expand Its Reach". Gartner. Retrieved 2023-12-05.
  14. Upchurch, E.; Sterling, T.; Brockman, J. (2004). "Analysis and Modeling of Advanced PIM Architecture Design Tradeoffs". Proceedings of the ACM/IEEE SC2004 Conference. Pittsburgh, PA, USA: IEEE. p. 12. doi:10.1109/SC.2004.11. ISBN   978-0-7695-2153-4. S2CID   9089044.