Llama.cpp

Last updated
llama.cpp
Original author(s) Georgi Gerganov
Developer(s) Georgi Gerganov and community
Initial releaseMarch 10, 2023;15 months ago (2023-03-10) [1]
Repository github.com/ggerganov/llama.cpp
Written in C++
License MIT License [2]

llama.cpp is an open source software library written in C++, that performs inference on various Large Language Models such as Llama. [3] It is co-developed alongside the ggml library, a general-purpose tensor library. [4]

Contents

History

llama.cpp began development by Georgi Gerganov to implement Llama in pure C++ with no dependencies. The advantage of this method was that it could run on more hardware compared to other inference libraries that depended on hardware dependent closed source libraries like CUDA. [3] It is written in C++. As of June 2024 it has 59 thousand stars on GitHub. [5] Before llama.cpp, Gerganov worked on a similar library called whisper.cpp [6] which implemented Whisper a speech to text model by OpenAI. llama.cpp gained traction from users who did not have specialized hardware as it could run on just a CPU including on Android devices. [7]

Architecture

llama.cpp initially could only run on CPUs but now can run on GPUs using multiple different back-ends including Vulkan and SYCL. These back-ends make up the GGML tensor library which is used by the front-end model-specific llama.cpp code. [8] llama.cpp has its own model format called GGUF (previously referred to as GGML format). [9] llama.cpp supports ahead of time model quantization as opposed to on-the-fly quantization. [10]

Supported models

GGUF file format

GGUF
Filename extension .gguf
Magic number 0x470x470x550x46
Developed byGeorgi Gerganov and community
Initial releaseAugust 22, 2023;10 months ago (2023-08-22) [11]
Type of format Machine-learning

The GGUF file format is a binary format that stores both tensors and metadata. [12] GGUF files are typically created by converting models developed in another file format from a different machine learning library such as PyTorch. It is the intention of GGUF's to make model files easy and fast to load within llama.cpp and other ggml projects. [13]

GGUF was created to replace previous file formats used by the project, which didn't include architecture metadata, and therefore made it difficult to extend the software without breaking backwards compatibility. [13]

The format focuses on supporting different quantization types, which can reduce memory usage, and increase speed at the expense of lower model precision. [14]

Supported data types

GGUF supports common floating-point data formats float32, and float16, in addition to supporting bfloat16.

GGUF also supports various quantized integer types:

Related Research Articles

<span class="mw-page-title-main">UEFI</span> Operating system and firmware specification

Unified Extensible Firmware Interface is a specification that defines the architecture of the platform firmware used for booting the computer hardware and its interface for interaction with the operating system. Examples of firmware that implement the specification are AMI Aptio, Phoenix SecureCore, TianoCore EDK II, InsydeH2O. UEFI replaces the BIOS which was present in the boot ROM of all personal computers that are IBM PC compatible, although it can provide backwards compatibility with the BIOS using CSM booting. Intel developed the original Extensible Firmware Interface (EFI) specification. Some of the EFI's practices and data formats mirror those of Microsoft Windows. In 2005, UEFI deprecated EFI 1.10.

<span class="mw-page-title-main">GUID Partition Table</span> Computer disk partitioning standard

The GUID Partition Table (GPT) is a standard for the layout of partition tables of a physical computer storage device, such as a hard disk drive or solid-state drive, using universally unique identifiers (UUIDs), which are also known as globally unique identifiers (GUIDs). Forming a part of the Unified Extensible Firmware Interface (UEFI) standard, it is nevertheless also used for some BIOSs, because of the limitations of master boot record (MBR) partition tables, which use 32 bits for logical block addressing (LBA) of traditional 512-byte disk sectors.

<span class="mw-page-title-main">VideoCore</span> Low-power mobile multimedia processor

VideoCore is a series of low-power mobile multimedia processors originally developed by Alphamosaic Ltd and now owned by Broadcom. Alphamosaic marketed its first version as a two-dimensional DSP architecture that makes it flexible and efficient enough to decode a number of multimedia codecs in software while maintaining low power usage. The semiconductor intellectual property core has been found so far only on Broadcom SoCs.

In computing, half precision is a binary floating-point computer number format that occupies 16 bits in computer memory. It is intended for storage of floating-point values in applications where higher precision is not essential, in particular image processing and neural networks.

<span class="mw-page-title-main">Raspberry Pi</span> Series of low-cost single-board computers

Raspberry Pi is a series of small single-board computers (SBCs) developed in the United Kingdom by the Raspberry Pi Foundation in association with Broadcom. The Raspberry Pi project originally leaned toward the promotion of teaching basic computer science in schools. The original model became more popular than anticipated, selling outside its target market for diverse uses such as robotics, home and industrial automation, and by computer and electronic hobbyists, because of its low cost, modularity, open design, and its adoption of the HDMI and USB standards.

<span class="mw-page-title-main">Julia (programming language)</span> Dynamic programming language

Julia is a high-level, general-purpose dynamic programming language, most commonly used for numerical analysis and computational science. Distinctive aspects of Julia's design include a type system with parametric polymorphism and the use of multiple dispatch as a core programming paradigm, efficient garbage collection, and a just-in-time (JIT) compiler.

Eclipse Deeplearning4j is a programming library written in Java for the Java virtual machine (JVM). It is a framework with wide support for deep learning algorithms. Deeplearning4j includes implementations of the restricted Boltzmann machine, deep belief net, deep autoencoder, stacked denoising autoencoder and recursive neural tensor network, word2vec, doc2vec, and GloVe. These algorithms all include distributed parallel versions that integrate with Apache Hadoop and Spark.

<span class="mw-page-title-main">TensorFlow</span> Machine learning software library

TensorFlow is a free and open-source software library for machine learning and artificial intelligence. It can be used across a range of tasks but has a particular focus on training and inference of deep neural networks.

<span class="mw-page-title-main">Tensor Processing Unit</span> AI accelerator ASIC by Google

Tensor Processing Unit (TPU) is an AI accelerator application-specific integrated circuit (ASIC) developed by Google for neural network machine learning, using Google's own TensorFlow software. Google began using TPUs internally in 2015, and in 2018 made them available for third-party use, both as part of its cloud infrastructure and by offering a smaller version of the chip for sale.

The Open Neural Network Exchange (ONNX) [] is an open-source artificial intelligence ecosystem of technology companies and research organizations that establish open standards for representing machine learning algorithms and software tools to promote innovation and collaboration in the AI sector. ONNX is available on GitHub.

<span class="mw-page-title-main">ROCm</span> Parallel computing platform: GPGPU libraries and application programming interface

ROCm is an Advanced Micro Devices (AMD) software stack for graphics processing unit (GPU) programming. ROCm spans several domains: general-purpose computing on graphics processing units (GPGPU), high performance computing (HPC), heterogeneous computing. It offers several programming models: HIP, OpenMP/Message Passing Interface (MPI), and OpenCL.

raylib Game programming library

Raylib is a cross-platform open-source software development library. The library was made to create graphical applications and games.

GitHub Copilot is a code completion tool developed by GitHub and OpenAI that assists users of Visual Studio Code, Visual Studio, Neovim, and JetBrains integrated development environments (IDEs) by autocompleting code. Currently available by subscription to individual developers and to businesses, the generative artificial intelligence software was first announced by GitHub on 29 June 2021, and works best for users coding in Python, JavaScript, TypeScript, Ruby, and Go. In March 2023 GitHub announced plans for "Copilot X", which will incorporate a chatbot based on GPT-4, as well as support for voice commands, into Copilot.

Hugging Face, Inc. is a French-American company incorporated under the Delaware General Corporation Law and based in New York City that develops computation tools for building applications using machine learning. It is most notable for its transformers library built for natural language processing applications and its platform that allows users to share machine learning models and datasets and showcase their work.

<span class="mw-page-title-main">ChatGPT</span> Chatbot and virtual assistant developed by OpenAI

ChatGPT is a chatbot and virtual assistant developed by OpenAI and launched on November 30, 2022. Based on large language models (LLMs), it enables users to refine and steer a conversation towards a desired length, format, style, level of detail, and language. Successive user prompts and replies are considered at each conversation stage as context.

A large language model (LLM) is a computational model notable for its ability to achieve general-purpose language generation and other natural language processing tasks such as classification. Based on language models, LLMs acquire these abilities by learning statistical relationships from vast amounts of text during a computationally intensive self-supervised and semi-supervised training process. LLMs can be used for text generation, a form of generative AI, by taking an input text and repeatedly predicting the next token or word.

Llama is a family of autoregressive large language models (LLMs) released by Meta AI starting in February 2023. The latest version is Llama 3, released in April 2024.

Vicuna LLM is an omnibus Large Language Model used in AI research. Its methodology is to enable the public at large to contrast and compare the accuracy of LLMs "in the wild" and to vote on their output; a question-and-answer chat format is used. At the beginning of each round two LLM chatbots from a diverse pool of nine are presented randomly and anonymously, their identities only being revealed upon voting on their answers. The user has the option of either replaying ("regenerating") a round, or beginning an entirely fresh one with new LLMs. Based on Llama 2, it is an open source project, and it itself has become the subject of academic research in the burgeoning field. A non-commercial, public demo of the Vicuna-13b model is available to access using LMSYS.

Mistral AI is a French company specializing in artificial intelligence (AI) products. Founded in April 2023 by former employees of Meta Platforms and Google DeepMind, the company has quickly risen to prominence in the AI sector.

References

  1. "Initial release · ggerganov/llama.cpp@26c0846". GitHub. Retrieved 15 May 2024.
  2. "llama.cpp/LICENSE at master · ggerganov/llama.cpp". GitHub.
  3. 1 2 Connatser, Matthew. "How this open source LLM chatbot runner hit the gas on x86, Arm CPUs". theregister.com. Retrieved 15 April 2024.
  4. Gerganov, Georgi (17 May 2024). "ggerganov/ggml".
  5. "ggerganov/llama.cpp". GitHub .
  6. "ggerganov/whisper.cpp". GitHub .
  7. Edwards, Benj (13 March 2023). "You can now run a GPT-3-level AI model on your laptop, phone, and Raspberry Pi". arstechnica.com. Retrieved 15 April 2024.
  8. "GGML - AI at the edge". ggml.ai. Retrieved 16 April 2024.
  9. Pounder, Les (25 March 2023). "How To Create Your Own AI Chatbot Server With Raspberry Pi 4". tomshardware.com. Retrieved 16 April 2024.
  10. Walkowiak, Bartosz; Walkowiak, Tomasz (2024). "Implementation of language models within an infrastructure designed for Natural Language Processing" (PDF). International Journal of Electronics and Telecommunications. 70 (1): 153–159. doi:10.24425/ijet.2024.149525 . Retrieved 8 May 2024.
  11. "GGUF by ggerganov · Pull Request #2398 · ggerganov/llama.cpp". GitHub.
  12. "GGUF". huggingface.co. Retrieved 9 May 2024.
  13. 1 2 "ggml/docs/gguf.md at master · ggerganov/ggml". GitHub.
  14. Labonne, Maxime (29 November 2023). "Quantize Llama models with GGUF and llama.cpp". Medium. Towards Data Science. Retrieved 9 May 2024.