Collective Knowledge (software)

Collective Knowledge (CK)
Developer(s)	Grigori Fursin and the cTuning foundation
Initial release	2015;9 years ago
Stable release	2.6.3 (discontinued for the new Collective Mind framework ) / November 30, 2022
Written in	Python
Operating system	Linux, Mac OS X, Microsoft Windows, Android
Type	Knowledge management, FAIR data, MLOps, Data management, Artifact Evaluation, Package management system, Scientific workflow system, DevOps, Continuous integration, Reproducibility
License	Apache License for version 2.0 and BSD License 3-clause for version 1.0
Website	github.com/ctuning/ck , cknow.io

Last updated March 16, 2024

The Collective Knowledge (CK) project is an open-source framework and repository to enable collaborative, reproducible and sustainable research and development of complex computational systems.^[2] CK is a small, portable, customizable and decentralized infrastructure helping researchers and practitioners:

share their code, data and models as reusable Python components and automation actions^[3] with unified JSON API, JSON meta information, and a UID based on FAIR principles ^[2]
assemble portable workflows from shared components (such as multi-objective autotuning and Design space exploration ^[4])
automate, crowdsource and reproduce benchmarking of complex computational systems^[5]
unify predictive analytics (scikit-learn, R, DNN)
enable reproducible and interactive papers^[6]

Notable usages

ARM uses CK to accelerate computer engineering^[7]
Several ACM-sponsored conferences use CK to automate the Artifact Evaluation process^[8]^[9]
Imperial College (London) uses CK to automate and crowdsource compiler bug detection^[10]
Researchers from the University of Cambridge used CK to help the community reproduce results of their publication in the International Symposium on Code Generation and Optimization (CGO'17) during Artifact Evaluation^[11]
General Motors (USA) uses CK to crowd-benchmark convolutional neural network optimizations ^[12]^[13]
The Raspberry Pi Foundation and the cTuning foundation released a CK workflow with a reproducible "live" paper to enable collaborative research into multi-objective autotuning and machine learning techniques^[4]
IBM uses CK to reproduce quantum results from nature^[14]
CK is used to automate MLPerf benchmark^[15]^[16]

Portable package manager for portable workflows

CK has an integrated cross-platform package manager with Python scripts, JSON API and JSON meta-description to automatically rebuild software environment on a user machine required to run a given research workflow.^[17]

Reproducibility of experiments

CK enables reproducibility of experimental results via community involvement similar to Wikipedia and physics. Whenever a new workflow with all components is shared via GitHub, anyone can try it on a different machine, with different environment and using slightly different choices (compilers, libraries, data sets). Whenever an unexpected or wrong behavior is encountered, the community explains it, fixes components and shares them back as described in.^[4]

Related Research Articles

A data dictionary, or metadata repository, as defined in the IBM Dictionary of Computing, is a "centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format". Oracle defines it as a collection of tables with metadata. The term can have one of several closely related meanings pertaining to databases and database management systems (DBMS):

In software testing, test automation is the use of software separate from the software being tested to control the execution of tests and the comparison of actual outcomes with predicted outcomes. Test automation can automate some repetitive but necessary tasks in a formalized testing process already in place, or perform additional testing that would be difficult to do manually. Test automation is critical for continuous delivery and continuous testing.

Adobe LiveCycle Enterprise Suite (ES4) is a service-oriented architecture Java EE server software product from Adobe Systems used to build applications that automate a broad range of business processes for enterprises and government agencies. LiveCycle ES4 is an enterprise document and form platform that helps you capture and process information, deliver personalized communications, and protect and track sensitive information. It is used for purposes such as account opening, services, and benefits enrollment, correspondence management, requests for proposal processes, and other manual-based workflows. LiveCycle ES4 incorporates new features with a particular focus on mobile devices. LiveCycle applications also function in both online and offline environments. These capabilities are enabled through the use of Adobe Reader, HTML/PhoneGap, and Flash Player clients to reach desktop computers and mobile devices.

Knowledge Discovery Metamodel (KDM) is a publicly available specification from the Object Management Group (OMG). KDM is a common intermediate representation for existing software systems and their operating environments, that defines common metadata required for deep semantic integration of Application Lifecycle Management tools. KDM was designed as the OMG's foundation for software modernization, IT portfolio management and software assurance. KDM uses OMG's Meta-Object Facility to define an XMI interchange format between tools that work with existing software as well as an abstract interface (API) for the next-generation assurance and modernization tools. KDM standardizes existing approaches to knowledge discovery in software engineering artifacts, also known as software mining.

Kepler is a free software system for designing, executing, reusing, evolving, archiving, and sharing scientific workflows. Kepler's facilities provide process and data monitoring, provenance information, and high-speed data movement. Workflows in general, and scientific workflows in particular, are directed graphs where the nodes represent discrete computational components, and the edges represent paths along which data and results can flow between components. In Kepler, the nodes are called 'Actors' and the edges are called 'channels'. Kepler includes a graphical user interface for composing workflows in a desktop environment, a runtime engine for executing workflows within the GUI and independently from a command-line, and a distributed computing option that allows workflow tasks to be distributed among compute nodes in a computer cluster or computing grid. The Kepler system principally targets the use of a workflow metaphor for organizing computational tasks that are directed towards particular scientific analysis and modeling goals. Thus, Kepler scientific workflows generally model the flow of data from one step to another in a series of computations that achieve some scientific goal.

The Interactive Compilation Interface (ICI) is a plugin system with a high-level compiler-independent and low-level compiler-dependent API to transform production compilers into interactive research toolsets. It was developed by Grigori Fursin during the MILEPOST project. The ICI framework acts as a "middleware" interface between the compiler and the user-definable plugins. It opens up and reuses the production-quality compiler infrastructure to enable program analysis and instrumentation, fine-grain program optimizations, simple prototyping of new development and research ideas while avoiding building new compilation tools from scratch. For example, it is used in MILEPOST GCC to automate compiler and architecture design and program optimizations based on statistical analysis and machine learning, and predict profitable optimization to improve program execution time, code size and compilation time.

MILEPOST GCC is a free, community-driven, open-source, adaptive, self-tuning compiler that combines stable production-quality GCC, Interactive Compilation Interface and machine learning plugins to adapt to any given architecture and program automatically and predict profitable optimizations to improve program execution time, code size and compilation time. It is currently used and supported by academia and industry and is intended to open up research opportunities to automate compiler and architecture design and optimization.

The Collective Tuning Initiative is a community-driven initiative started by Grigori Fursin to develop free and open-source research tools with a unified API for collaborative characterization, optimization and co-design of computer systems. They enable sharing of benchmarks, data sets and optimization cases from the community in the Collective Optimization Database through unified web services to predict better optimizations or architecture designs. Using common research-and-development tools should help to improve the quality and reproducibility of computer systems' research and development and accelerate innovation in this area. This approach helped establish Reproducibility Initiatives and Artifact Evaluation at several ACM-sponsored conferences to encourage sharing of artifacts and validation of experimental results from accepted papers.

Spring Roo is an open-source software tool that uses convention-over-configuration principles to provide rapid application development of Java-based enterprise software. The resulting applications use common Java technologies such as Spring Framework, Java Persistence API, Thymeleaf, Apache Maven and AspectJ. Spring Roo is a member of the Spring portfolio of projects.

Imixs Workflow is an Open-Source-Project, providing technologies for building Business Process Management solutions. The project focus on human based workflows used to execute and control workflows in organisations and enterprises. In difference to task-oriented workflow engines, which focus on automated program flow control (tasks), Imixs Workflow is a representative of an event-based workflow engine. Here, the engine controls the status of a process instance within a defined state-diagram. By entering an event, the state of a process instance can be abandoned or changed. In human-centric workflow engines, events usually occur by an interaction of the actor with the system, for example by approving or rejecting a business transaction. They can also be triggered by scheduled events. An example of this is an escalation of an unfinished task.

An AI accelerator, deep learning processor, or neural processing unit (NPU) is a class of specialized hardware accelerator or computer system designed to accelerate artificial intelligence and machine learning applications, including artificial neural networks and machine vision. Typical applications include algorithms for robotics, Internet of Things, and other data-intensive or sensor-driven tasks. They are often manycore designs and generally focus on low-precision arithmetic, novel dataflow architectures or in-memory computing capability. As of 2024, a typical AI integrated circuit chip contains tens of billions of MOSFET transistors.

The cTuning Foundation is a global non-profit organization developing a common methodology and open-source tools to support sustainable, collaborative and reproducible research in Computer science and organize and automate artifact evaluation and reproducibility inititiaves at machine learning and systems conferences and journals.

Grigori Fursin is a British computer scientist, president of the non-profit CTuning foundation, founding member of MLCommons, co-chair of the MLCommons Task Force on Automation and Reproducibility and founder of cKnowledge. His research group created open-source machine learning based self-optimizing compiler, MILEPOST GCC, considered to be the first in the world. At the end of the MILEPOST project he established cTuning foundation to crowdsource program optimisation and machine learning across diverse devices provided by volunteers. His foundation also developed Collective Knowledge Framework to support open research. Since 2015 Fursin leads Artifact Evaluation at several ACM and IEEE computer systems conferences. He is also a founding member of the ACM taskforce on Data, Software, and Reproducibility in Publication.

The BioCompute Object (BCO) project is a community-driven initiative to build a framework for standardizing and sharing computations and analyses generated from High-throughput sequencing. The project has since been standardized as IEEE 2791-2020, and the project files are maintained in an open source repository. The July 22nd, 2020 edition of the Federal Register announced that the FDA now supports the use of BioCompute in regulatory submissions, and the inclusion of the standard in the Data Standards Catalog for the submission of HTS data in NDAs, ANDAs, BLAs, and INDs to CBER, CDER, and CFSAN.

Originally started as a collaborative contract between the George Washington University and the Food and Drug Administration, the project has grown to include over 20 universities, biotechnology companies, public-private partnerships and pharmaceutical companies including Seven Bridges and Harvard Medical School. The BCO aims to ease the exchange of HTS workflows between various organizations, such as the FDA, pharmaceutical companies, contract research organizations, bioinformatic platform providers, and academic researchers. Due to the sensitive nature of regulatory filings, few direct references to material can be published. However, the project is currently funded to train FDA Reviewers and administrators to read and interpret BCOs, and currently has 4 publications either submitted or nearly submitted.

<span class="mw-page-title-main">ModelOps</span>

ModelOps, as defined by Gartner, "is focused primarily on the governance and lifecycle management of a wide range of operationalized artificial intelligence (AI) and decision models, including machine learning, knowledge graphs, rules, optimization, linguistic and agent-based models". "ModelOps lies at the heart of any enterprise AI strategy". It orchestrates the model lifecycles of all models in production across the entire enterprise, from putting a model into production, then evaluating and updating the resulting application according to a set of governance rules, including both technical and business KPI's. It grants business domain experts the capability to evaluate AI models in production, independent of data scientists.

The Open Knowledgebase of Interatomic Models (OpenKIM). is a cyberinfrastructure funded by the United States National Science Foundation (NSF) focused on improving the reliability and reproducibility of molecular and multi-scale simulations in computational materials science. It includes a repository of interatomic potentials that are exhaustively tested with user-developed integrity tests, tools to help select among existing potentials and develop new ones, extensive metadata on potentials and their developers, and standard integration methods for using interatomic potentials in major simulation codes. OpenKIM is a member of DataCite and provides unique DOIs (Digital object identifier) for all archived content on the site (fitted models, validation tests, etc.) in order to properly document and provide recognition to content contributors. OpenKIM is also an eXtreme Science and Engineering Discovery Environment (XSEDE) Science Gateway, and all content on openkim.org is available under open source licenses in support of the open science initiative.

DVC is a free and open-source, platform-agnostic version system for data, machine learning models, and experiments. It is designed to make ML models shareable, experiments reproducible, and to track versions of models, data, and pipelines. DVC works on top of Git repositories and cloud storage.

Medical open network for AI (MONAI) is an open-source, community-supported framework for Deep learning (DL) in healthcare imaging. MONAI provides a collection of domain-optimized implementations of various DL algorithms and utilities specifically designed for medical imaging tasks. MONAI is used in research and industry, aiding the development of various medical imaging applications, including image segmentation, image classification, image registration, and image generation.

Collective Mind (CM) is collection of portable, extensible and ready-to-use automation recipes with a human-friendly interface to make it easier to compose, benchmark and optimize complex AI, ML and other applications and systems across diverse and continuously changing models, data sets, software and hardware.

References

↑ CK package at PYPI
1 2 Fursin, Grigori (29 March 2021). Collective Knowledge: organizing research projects as a database of reusable components and portable workflows with common APIs. Philosophical Transactions of the Royal Society. arXiv: 2011.01149 . doi:10.1098/rsta.2020.0211.
↑ Reusable CK components and actions to automate common research tasks
1 2 3 Live paper with reproducible experiments to enable collaborative research into multi-objective autotuning and machine learning techniques
↑ Online repository with reproduced results
↑ Index of reproduced papers
↑ Ed Plowman; Grigori Fursin, ARM TechCon'16 presentation "Know Your Workloads: Design more efficient systems!"
↑ Artifact Evaluation for systems and machine learning conferences
↑ ACM TechTalk about reproducing 150 research papers and testing them in the real world
↑ EU TETRACOM project to combine CK and CLSmith (PDF), archived from the original (PDF) on 2017-03-05, retrieved 2016-09-15
↑ Artifact Evaluation Reproduction for "Software Prefetching for Indirect Memory Accesses", CGO 2017, using CK, 16 October 2022
↑ GitHub development website for CK-powered Caffe, 11 October 2022
↑ Open-source Android application to let the community participate in collaborative benchmarking and optimization of various DNN libraries and models
↑ Reproducing quantum results from nature – how hard could it be?
↑ MLPerf crowd-benchmarking
↑ MLPerf inference benchmark automation guide, 17 October 2022
↑ List of shared CK packages

External links

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] CK package at PYPI

[pt2020-2] 1 2 Fursin, Grigori (29 March 2021). Collective Knowledge: organizing research projects as a database of reusable components and portable workflows with common APIs. Philosophical Transactions of the Royal Society. arXiv: 2011.01149 . doi:10.1098/rsta.2020.0211.

[3] Reusable CK components and actions to automate common research tasks

[rpi2018-4] 1 2 3 Live paper with reproducible experiments to enable collaborative research into multi-objective autotuning and machine learning techniques

[5] Online repository with reproduced results

[6] Index of reproduced papers

[7] Ed Plowman; Grigori Fursin, ARM TechCon'16 presentation "Know Your Workloads: Design more efficient systems!"

[8] Artifact Evaluation for systems and machine learning conferences

[9] ACM TechTalk about reproducing 150 research papers and testing them in the real world

[10] EU TETRACOM project to combine CK and CLSmith (PDF), archived from the original (PDF) on 2017-03-05, retrieved 2016-09-15

[11] Artifact Evaluation Reproduction for "Software Prefetching for Indirect Memory Accesses", CGO 2017, using CK, 16 October 2022

[12] GitHub development website for CK-powered Caffe, 11 October 2022

[13] Open-source Android application to let the community participate in collaborative benchmarking and optimization of various DNN libraries and models

[14] Reproducing quantum results from nature – how hard could it be?

[15] MLPerf crowd-benchmarking

[16] MLPerf inference benchmark automation guide, 17 October 2022

[17] List of shared CK packages

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]