Collective Knowledge (software)

Collective Knowledge (CK)
Developer(s)	Grigori Fursin and the cTuning foundation
Initial release	2015;9 years ago
Stable release	2.6.3 (discontinued for the new Collective Mind framework ) / November 30, 2022
Written in	Python
Operating system	Linux, Mac OS X, Microsoft Windows, Android
Type	Knowledge management, FAIR data, MLOps, Data management, Artifact Evaluation, Package management system, Scientific workflow system, DevOps, Continuous integration, Reproducibility
License	Apache License for version 2.0 and BSD License 3-clause for version 1.0
Website	github.com/ctuning/ck , cknow.io

Last updated December 02, 2024

The Collective Knowledge (CK) project is an open-source framework and repository to enable collaborative, reproducible and sustainable research and development of complex computational systems.^[2] CK is a small, portable, customizable and decentralized infrastructure helping researchers and practitioners:

share their code, data and models as reusable Python components and automation actions^[3] with unified JSON API, JSON meta information, and a UID based on FAIR principles ^[2]
assemble portable workflows from shared components (such as multi-objective autotuning and Design space exploration ^[4])
automate, crowdsource and reproduce benchmarking of complex computational systems^[5]
unify predictive analytics (scikit-learn, R, DNN)
enable reproducible and interactive papers^[6]

Notable usages

ARM uses CK to accelerate computer engineering^[7]
Several ACM-sponsored conferences use CK to automate the Artifact Evaluation process^[8]^[9]
Imperial College (London) uses CK to automate and crowdsource compiler bug detection^[10]
Researchers from the University of Cambridge used CK to help the community reproduce results of their publication in the International Symposium on Code Generation and Optimization (CGO'17) during Artifact Evaluation^[11]
General Motors (USA) uses CK to crowd-benchmark convolutional neural network optimizations ^[12]^[13]
The Raspberry Pi Foundation and the cTuning foundation released a CK workflow with a reproducible "live" paper to enable collaborative research into multi-objective autotuning and machine learning techniques^[4]
IBM uses CK to reproduce quantum results from nature^[14]
CK is used to automate MLPerf benchmark^[15]^[16]

Portable package manager for portable workflows

CK has an integrated cross-platform package manager with Python scripts, JSON API and JSON meta-description to automatically rebuild software environment on a user machine required to run a given research workflow.^[17]

Reproducibility of experiments

CK enables reproducibility of experimental results via community involvement similar to Wikipedia and physics. Whenever a new workflow with all components is shared via GitHub, anyone can try it on a different machine, with different environment and using slightly different choices (compilers, libraries, data sets). Whenever an unexpected or wrong behavior is encountered, the community explains it, fixes components and shares them back as described in.^[4]

Related Research Articles

Enterprise content management (ECM) extends the concept of content management by adding a timeline for each content item and, possibly, enforcing processes for its creation, approval, and distribution. Systems using ECM generally provide a secure repository for managed items, analog or digital. They also include one methods for importing content to manage new items, and several presentation methods to make items available for use. Although ECM content may be protected by digital rights management (DRM), it is not required. ECM is distinguished from general content management by its cognizance of the processes and procedures of the enterprise for which it is created.

Knowledge Discovery Metamodel (KDM) is a publicly available specification from the Object Management Group (OMG). KDM is a common intermediate representation for existing software systems and their operating environments, that defines common metadata required for deep semantic integration of Application Lifecycle Management tools. KDM was designed as the OMG's foundation for software modernization, IT portfolio management and software assurance. KDM uses OMG's Meta-Object Facility to define an XMI interchange format between tools that work with existing software as well as an abstract interface (API) for the next-generation assurance and modernization tools. KDM standardizes existing approaches to knowledge discovery in software engineering artifacts, also known as software mining.

Kepler is a free software system for designing, executing, reusing, evolving, archiving, and sharing scientific workflows. Kepler's facilities provide process and data monitoring, provenance information, and high-speed data movement. Workflows in general, and scientific workflows in particular, are directed graphs where the nodes represent discrete computational components, and the edges represent paths along which data and results can flow between components. In Kepler, the nodes are called 'Actors' and the edges are called 'channels'. Kepler includes a graphical user interface for composing workflows in a desktop environment, a runtime engine for executing workflows within the GUI and independently from a command-line, and a distributed computing option that allows workflow tasks to be distributed among compute nodes in a computer cluster or computing grid. The Kepler system principally targets the use of a workflow metaphor for organizing computational tasks that are directed towards particular scientific analysis and modeling goals. Thus, Kepler scientific workflows generally model the flow of data from one step to another in a series of computations that achieve some scientific goal.

The Interactive Compilation Interface (ICI) is a plugin system with a high-level compiler-independent and low-level compiler-dependent API to transform production compilers into interactive research toolsets. It was developed by Grigori Fursin during the MILEPOST project. The ICI framework acts as a "middleware" interface between the compiler and the user-definable plugins. It opens up and reuses the production-quality compiler infrastructure to enable program analysis and instrumentation, fine-grain program optimizations, simple prototyping of new development and research ideas while avoiding building new compilation tools from scratch. For example, it is used in MILEPOST GCC to automate compiler and architecture design and program optimizations based on statistical analysis and machine learning, and predict profitable optimization to improve program execution time, code size and compilation time.

MILEPOST GCC is a free, community-driven, open-source, adaptive, self-tuning compiler that combines stable production-quality GCC, Interactive Compilation Interface and machine learning plugins to adapt to any given architecture and program automatically and predict profitable optimizations to improve program execution time, code size and compilation time. It is currently used and supported by academia and industry and is intended to open up research opportunities to automate compiler and architecture design and optimization.

The Collective Tuning Initiative is a community-driven initiative started by Grigori Fursin to develop free and open-source research tools with a unified API for collaborative characterization, optimization and co-design of computer systems. They enable sharing of benchmarks, data sets and optimization cases from the community in the Collective Optimization Database through unified web services to predict better optimizations or architecture designs. Using common research-and-development tools should help to improve the quality and reproducibility of computer systems' research and development and accelerate innovation in this area. This approach helped establish Reproducibility Initiatives and Artifact Evaluation at several ACM-sponsored conferences to encourage sharing of artifacts and validation of experimental results from accepted papers.

KNIME, the Konstanz Information Miner, is a free and open-source data analytics, reporting and integration platform. KNIME integrates various components for machine learning and data mining through its modular data pipelining "Building Blocks of Analytics" concept. A graphical user interface and use of JDBC allows assembly of nodes blending different data sources, including preprocessing, for modeling, data analysis and visualization without, or with minimal, programming.

Google Closure Tools is a set of tools built with the goal of helping developers optimize rich web applications with JavaScript. It was developed by Google for use in their web applications such as Gmail, Google Docs and Google Maps. As of 2023, the project had over 230K LOCs not counting the embedded Mozilla Rhino compiler.

<span class="mw-page-title-main">Tivoli Service Automation Manager</span>

Tivoli Service Automation Manager is the Cloud management package from IBM in the Tivoli Software brand. Unofficial abbreviations are TSAM and TivSAM.

ModelCenter, developed by Phoenix Integration, is a software package that aids in the design and optimization of systems. It enables users to conduct trade studies, as well as optimize designs. It interacts with other common modeling tools, including Systems Tool Kit, PTC Integrity Modeler, IBM Rhapsody, No Magic, Matlab, Nastran, Microsoft Excel, and Wolfram SystemModeler. ModelCenter also has tools to enable collaboration among design team members.

Imixs Workflow is an open-source project, providing technologies for building Business Process Management solutions. The project focus on human based workflows used to execute and control workflows in organisations and enterprises. In difference to task-oriented workflow engines, which focus on automated program flow control (tasks), Imixs Workflow is a representative of an event-based workflow engine. Here, the engine controls the status of a process instance within a defined state-diagram. By entering an event, the state of a process instance can be abandoned or changed. In human-centric workflow engines, events usually occur by an interaction of the actor with the system, for example by approving or rejecting a business transaction. They can also be triggered by scheduled events. An example of this is an escalation of an unfinished task.

The cTuning Foundation is a global non-profit organization developing a common methodology and open-source tools to support sustainable, collaborative and reproducible research in Computer science and organize and automate artifact evaluation and reproducibility inititiaves at machine learning and systems conferences and journals.

Grigori Fursin is a British computer scientist, president of the non-profit CTuning foundation, founding member of MLCommons, and co-chair of the MLCommons Task Force on Automation and Reproducibility. His research group created open-source machine learning based self-optimizing compiler, MILEPOST GCC, considered to be the first in the world. At the end of the MILEPOST project he established cTuning foundation to crowdsource program optimisation and machine learning across diverse devices provided by volunteers. His foundation also developed Collective Knowledge Framework and Collective Mind to support open research. Since 2015 Fursin leads Artifact Evaluation at several ACM and IEEE computer systems conferences. He is also a founding member of the ACM taskforce on Data, Software, and Reproducibility in Publication.

Neural architecture search (NAS) is a technique for automating the design of artificial neural networks (ANN), a widely used model in the field of machine learning. NAS has been used to design networks that are on par with or outperform hand-designed architectures. Methods for NAS can be categorized according to the search space, search strategy and performance estimation strategy used:

<span class="mw-page-title-main">ModelOps</span>

ModelOps, as defined by Gartner, "is focused primarily on the governance and lifecycle management of a wide range of operationalized artificial intelligence (AI) and decision models, including machine learning, knowledge graphs, rules, optimization, linguistic and agent-based models" in Multi-Agent Systems. "ModelOps lies at the heart of any enterprise AI strategy". It orchestrates the model lifecycles of all models in production across the entire enterprise, from putting a model into production, then evaluating and updating the resulting application according to a set of governance rules, including both technical and business key performance indicators (KPI's). It grants business domain experts the capability to evaluate AI models in production, independent of data scientists.

The Open Knowledgebase of Interatomic Models (OpenKIM). is a cyberinfrastructure funded by the United States National Science Foundation (NSF) focused on improving the reliability and reproducibility of molecular and multi-scale simulations in computational materials science. It includes a repository of interatomic potentials that are exhaustively tested with user-developed integrity tests, tools to help select among existing potentials and develop new ones, extensive metadata on potentials and their developers, and standard integration methods for using interatomic potentials in major simulation codes. OpenKIM is a member of DataCite and provides unique DOIs (Digital object identifier) for all archived content on the site (fitted models, validation tests, etc.) in order to properly document and provide recognition to content contributors. OpenKIM is also an eXtreme Science and Engineering Discovery Environment (XSEDE) Science Gateway, and all content on openkim.org is available under open source licenses in support of the open science initiative.

DVC is a free and open-source, platform-agnostic version system for data, machine learning models, and experiments. It is designed to make ML models shareable, experiments reproducible, and to track versions of models, data, and pipelines. DVC works on top of Git repositories and cloud storage.

References

↑ CK package at PYPI
1 2 Fursin, Grigori (29 March 2021). Collective Knowledge: organizing research projects as a database of reusable components and portable workflows with common APIs. Philosophical Transactions of the Royal Society. arXiv: 2011.01149 . doi:10.1098/rsta.2020.0211.
↑ Reusable CK components and actions to automate common research tasks
1 2 3 Live paper with reproducible experiments to enable collaborative research into multi-objective autotuning and machine learning techniques
↑ Online repository with reproduced results
↑ Index of reproduced papers
↑ Ed Plowman; Grigori Fursin, ARM TechCon'16 presentation "Know Your Workloads: Design more efficient systems!"
↑ Artifact Evaluation for systems and machine learning conferences
↑ ACM TechTalk about reproducing 150 research papers and testing them in the real world
↑ EU TETRACOM project to combine CK and CLSmith (PDF), archived from the original (PDF) on 2017-03-05, retrieved 2016-09-15
↑ Artifact Evaluation Reproduction for "Software Prefetching for Indirect Memory Accesses", CGO 2017, using CK, 16 October 2022
↑ GitHub development website for CK-powered Caffe, 11 October 2022
↑ Open-source Android application to let the community participate in collaborative benchmarking and optimization of various DNN libraries and models
↑ Reproducing quantum results from nature – how hard could it be?
↑ MLPerf crowd-benchmarking
↑ MLPerf inference benchmark automation guide, 17 October 2022
↑ List of shared CK packages

External links

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] CK package at PYPI

[pt2020-2] 1 2 Fursin, Grigori (29 March 2021). Collective Knowledge: organizing research projects as a database of reusable components and portable workflows with common APIs. Philosophical Transactions of the Royal Society. arXiv: 2011.01149 . doi:10.1098/rsta.2020.0211.

[3] Reusable CK components and actions to automate common research tasks

[rpi2018-4] 1 2 3 Live paper with reproducible experiments to enable collaborative research into multi-objective autotuning and machine learning techniques

[5] Online repository with reproduced results

[6] Index of reproduced papers

[7] Ed Plowman; Grigori Fursin, ARM TechCon'16 presentation "Know Your Workloads: Design more efficient systems!"

[8] Artifact Evaluation for systems and machine learning conferences

[9] ACM TechTalk about reproducing 150 research papers and testing them in the real world

[10] EU TETRACOM project to combine CK and CLSmith (PDF), archived from the original (PDF) on 2017-03-05, retrieved 2016-09-15

[11] Artifact Evaluation Reproduction for "Software Prefetching for Indirect Memory Accesses", CGO 2017, using CK, 16 October 2022

[12] GitHub development website for CK-powered Caffe, 11 October 2022

[13] Open-source Android application to let the community participate in collaborative benchmarking and optimization of various DNN libraries and models

[14] Reproducing quantum results from nature – how hard could it be?

[15] MLPerf crowd-benchmarking

[16] MLPerf inference benchmark automation guide, 17 October 2022

[17] List of shared CK packages

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]