Cluster Exploratory

Last updated March 20, 2019

Cluster Exploratory (CluE) was a proposed 2008 U.S. National Science Foundation-funded program to use Google-IBM cluster technology to analyze massive amounts of data to search for patterns, part of the Academic Cluster Computing Initiative (ACCI). "The cluster will consist of 1,600 processors, several terabytes of memory, and hundreds of terabytes of storage, along with the software, including IBM's Tivoli and open source versions of Google File System and MapReduce".^[1]^[2] Google and IBM announced the first pilot phase of the ACCI in October 2007.^[3] The program ended in 2011, according to Google.^[4] NSF's call for proposals has been "archived".^[2]

The National Science Foundation (NSF) is a United States government agency that supports fundamental research and education in all the non-medical fields of science and engineering. Its medical counterpart is the National Institutes of Health. With an annual budget of about US$7.0 billion, the NSF funds approximately 24% of all federally supported basic research conducted by the United States' colleges and universities. In some fields, such as mathematics, computer science, economics, and the social sciences, the NSF is the major source of federal backing.

Google LLC is an American multinational technology company that specializes in Internet-related services and products, which include online advertising technologies, search engine, cloud computing, software, and hardware. It is considered one of the Big Four technology companies, alongside Amazon, Apple and Facebook.

International Business Machines Corporation (IBM) is an American multinational information technology company headquartered in Armonk, New York, with operations in over 170 countries. The company began in 1911, founded in Endicott, New York, as the Computing-Tabulating-Recording Company (CTR) and was renamed "International Business Machines" in 1924.

Related Research Articles

Mainframe computers or mainframes are computers used primarily by large organizations for critical applications; bulk data processing, such as census, industry and consumer statistics, enterprise resource planning; and transaction processing. They are larger and have more processing power than some other classes of computers: minicomputers, servers, workstations, and personal computers.

The National Center for Supercomputing Applications (NCSA) is a state-federal partnership to develop and deploy national-scale cyberinfrastructure that advances research, science and engineering based in the United States of America. NCSA operates as a unit of the University of Illinois at Urbana–Champaign, and provides high-performance computing resources to researchers across the country. Support for NCSA comes from the National Science Foundation, the state of Illinois, the University of Illinois, business and industry partners, and other federal agencies.

The Cornell University Center for Advanced Computing (CAC), housed at Frank H. T. Rhodes Hall on the campus of Cornell University, is one of five original centers in the National Science Foundation's Supercomputer Centers Program. It was formerly called the Cornell Theory Center.

Commodity computing involves the use of large numbers of already-available computing components for parallel computing, to get the greatest amount of useful computation at low cost. It is computing done in commodity computers as opposed to in high-cost superminicomputers or in boutique computers. Commodity computers are computer systems - manufactured by multiple vendors - incorporating components based on open standards. Such systems are said to be based on commodity components, since the standardization process promotes lower costs and less differentiation among vendors' products. Standardization and decreased differentiation lower the switching or exit cost from any given vendor, increasing purchasers' leverage and preventing lock-in. A governing principle of commodity computing is that it is preferable to have more low-performance, low-cost hardware working in parallel than to have fewer high-performance, high-cost hardware items. At some point, the number of discrete systems in a cluster will be greater than the mean time between failures (MTBF) for any hardware platform, no matter how reliable, so fault tolerance must be built into the controlling software. Purchases should be optimized on cost-per-unit-of-performance, not just on absolute performance-per-CPU at any cost.

David A. Bader is a Professor, Chair of the School of Computational Science and Engineering, and Executive Director of High-Performance Computing in the Georgia Tech College of Computing. In addition, Bader was selected as the director of the first Sony Toshiba IBM Center of Competence for the Cell Processor at the Georgia Institute of Technology. He is an IEEE Fellow, AAAS Fellow, National Science Foundation CAREER Award recipient and an IEEE Computer Society Distinguished Speaker. Bader is a leading expert in data sciences. His main areas of research are in at the intersection of high-performance computing and real-world applications, including cybersecurity, massive-scale analytics, and computational genomics.

TeraGrid was an e-Science grid computing infrastructure combining resources at eleven partner sites. The project started in 2001 and operated from 2004 through 2011.

Meinolf Sellmann, born in Holzminden, Germany, computer scientist, best known for algorithmic research, with a special focus on self-improving algorithms, automatic algorithm configuration and algorithm portfolios based on artificial intelligence, combinatorial optimization, and the hybridization thereof.

Apache Hadoop is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Originally designed for computer clusters built from commodity hardware—still the common use—it has also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

The Pittsburgh Supercomputing Center (PSC) is a high performance computing and networking center founded in 1986. PSC is a joint effort of Carnegie Mellon University and the University of Pittsburgh together with Westinghouse Electric Company in Pittsburgh, Pennsylvania, United States. The center's Scientific Directors are Dr. Ralph Roskies of the University of Pittsburgh and Dr. Michael Levine of Carnegie Mellon University.

The Texas Advanced Computing Center (TACC) at the University of Texas at Austin, United States, is an advanced computing research center that provides comprehensive advanced computing resources and support services to researchers in Texas and across the USA. The mission of TACC is to enable discoveries that advance science and society through the application of advanced computing technologies. Specializing in high performance computing, scientific visualization, data analysis & storage systems, software, research & development and portal interfaces, TACC deploys and operates advanced computational infrastructure to enable computational research activities of faculty, staff, and students of UT Austin. TACC also provides consulting, technical documentation, and training to support researchers who use these resources. TACC staff members conduct research and development in applications and algorithms, computing systems design/architecture, and programming tools and environments.

Blue Waters is a petascale supercomputer at the National Center for Supercomputing Applications (NCSA) at the University of Illinois at Urbana-Champaign. On August 8, 2007, the National Science Board approved a resolution which authorized the National Science Foundation to fund "the acquisition and deployment of the world's most powerful leadership-class supercomputer." The NSF awarded $208 million for the Blue Waters project.

High Performance Storage System (HPSS) is a flexible, scalable, policy-based Hierarchical Storage Management product developed by the HPSS Collaboration. It provides scalable hierarchical storage management (HSM), archive, and file system services using cluster, LAN and SAN technologies to aggregate the capacity and performance of many computers, disks, disk systems, tape drives and tape libraries.

Cloud computing is the on demand availability of computer system resources, especially data storage and computing power, without direct active management by the user. The term is generally used to describe data centers available to many users over the Internet. Large clouds, predominant today, often have functions distributed over multiple locations from central servers. If the connection to the user is relatively close, it may be designated an edge server.

PERCS is IBM's answer to DARPA's High Productivity Computing Systems (HPCS) initiative. The program resulted in commercial development and deployment of the Power 775, a supercomputer design with extremely high performance ratios in fabric and memory bandwidth, as well as very high performance density and power efficiency.

The National Institute for Computational Sciences (NICS) is funded by the National Science Foundation and managed by the University of Tennessee. NICS was home to Kraken, the most powerful computer in the world managed by academia. The NICS petascale scientific computing environment is housed at Oak Ridge National Laboratory (ORNL), home to the world's most powerful computing complex. The mission of NICS, a member of XSEDE, is to enable the scientific discoveries of researchers nationwide by providing leading-edge computational resources, together with support for their effective use, and leveraging extensive partnership opportunities.

IBM was a 2009 project using the resources developed in 2007's IBM/Google Cloud Computing partnership. This initiative was to provide access to cloud computing for the universities of all countries.

Big data is a field that treats of ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data-processing application software. Data with many cases (rows) offer greater statistical power, while data with higher complexity may lead to a higher false discovery rate. Big data challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating, information privacy and data source. Big data was originally associated with three key concepts: volume, variety, and velocity. Other concepts later attributed with big data are veracity and value.

Data-intensive computing is a class of parallel computing applications which use a data parallel approach to process large volumes of data typically terabytes or petabytes in size and typically referred to as big data. Computing applications which devote most of their execution time to computational requirements are deemed compute-intensive, whereas computing applications which require large volumes of data and devote most of their processing time to I/O and manipulation of data are deemed data-intensive.

The NCAR-Wyoming Supercomputing Center (NWSC) is a high-performance computing (HPC) and data archival facility located in Cheyenne, Wyoming that provides advanced computing services to researchers in the Earth system sciences.

Many universities, vendors, institutes and government organizations are investing in research around the topic of cloud computing:

References

↑ "The End of Theory: The Data Deluge Makes the Scientific Method Obsolete" by Chris Anderson
1 2 Program Solicitation NSF 08-560
↑ "Supporting cluster computing in the research community". Google Blog. Google. 25 February 2008. Retrieved 15 October 2017.
↑ Derrick Harris. No more access to Google’s Hadoop cloud for researchers. Gigaom. 22 Dec 2011

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] "The End of Theory: The Data Deluge Makes the Scientific Method Obsolete" by Chris Anderson

[nsf-2] 1 2 Program Solicitation NSF 08-560

[3] "Supporting cluster computing in the research community". Google Blog. Google. 25 February 2008. Retrieved 15 October 2017.

[4] Derrick Harris. No more access to Google’s Hadoop cloud for researchers. Gigaom. 22 Dec 2011