Dryad (programming)

Dryad
Original author(s)	Microsoft Research
Developer(s)	Microsoft
Stable release	v0.2.1 / October 7, 2014;9 years ago
Repository	github.com/MicrosoftResearch/Dryad
Written in	C#, C++, C
Operating system	Microsoft Windows
Platform	.NET Framework
Type	Software framework
License	Apache License 2.0
Website	www.microsoft.com/en-us/research/project/dryad/

Last updated July 06, 2024 • 2 min readFrom Wikipedia, The Free Encyclopedia

Dryad was a research project at Microsoft Research for a general purpose runtime for execution of data parallel applications. The research prototypes of the Dryad and DryadLINQ data-parallel processing frameworks are available in source form at GitHub.^[1]

Overview

Microsoft made several preview releases of this technology available as add-ons to Windows HPC Server 2008 R2.

An application written for Dryad is modeled as a directed acyclic graph (DAG). The DAG defines the dataflow of the application, and the vertices of the graph defines the operations that are to be performed on the data. The "computational vertices" are written using sequential constructs, devoid of any concurrency or mutual exclusion semantics. The Dryad runtime parallelizes the dataflow graph by distributing the computational vertices across various execution engines (which can be multiple processor cores on the same computer or different physical computers connected by a network, as in a cluster). Scheduling of the computational vertices on the available hardware is handled by the Dryad runtime, without any explicit intervention by the developer of the application or administrator of the network. The flow of data between one computational vertex to another is implemented by using communication "channels" between the vertices, which in physical implementation is realized by TCP/IP streams, shared memory or temporary files. A stream is used at runtime to transport a finite number of structured Items.

Dryad defines a domain-specific language, which is implemented via a C++ library, that is used to create and model a Dryad execution graph. Computational vertices are written using standard C++ constructs. To make them accessible to the Dryad runtime, they must be encapsulated in a class that inherits from the GraphNode base class. The graph is defined by adding edges; edges are added by using a composition operator (defined by Dryad) that connects two graphs (or two nodes of a graph) with an edge. Managed code wrappers for the Dryad API can also be written.

There exist several high-level language compilers which use Dryad as a runtime; examples include Scope (Structured Computations Optimized for Parallel Execution) and DryadLINQ.^[2]

In October 2011, Microsoft discontinued active development on Dryad, shifting focus to the Apache Hadoop framework.^[3]^[4]^[5]

Related Research Articles

In computer programming, dataflow programming is a programming paradigm that models a program as a directed graph of the data flowing between operations, thus implementing dataflow principles and architecture. Dataflow programming languages share some features of functional languages, and were generally developed in order to bring some functional concepts to a language more suitable for numeric processing. Some authors use the term datastream instead of dataflow to avoid confusion with dataflow computing or dataflow architecture, based on an indeterministic machine paradigm. Dataflow programming was pioneered by Jack Dennis and his graduate students at MIT in the 1960s.

Dataflow architecture is a dataflow-based computer architecture that directly contrasts the traditional von Neumann architecture or control flow architecture. Dataflow architectures have no program counter, in concept: the executability and execution of instructions is solely determined based on the availability of input arguments to the instructions, so that the order of instruction execution may be hard to predict.

MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.

Concurrent computing is a form of computing in which several computations are executed concurrently—during overlapping time periods—instead of sequentially—with one completing before the next starts.

<span class="mw-page-title-main">Binary Modular Dataflow Machine</span>

Binary Modular Dataflow Machine (BMDFM) is a software package that enables running an application in parallel on shared memory symmetric multiprocessing (SMP) computers using the multiple processors to speed up the execution of single applications. BMDFM automatically identifies and exploits parallelism due to the static and mainly dynamic scheduling of the dataflow instruction sequences derived from the formerly sequential program.

In computer science, stream processing is a programming paradigm which views streams, or sequences of events in time, as the central input and output objects of computation. Stream processing encompasses dataflow programming, reactive programming, and distributed data processing. Stream processing systems aim to expose parallel processing for data streams and rely on streaming algorithms for efficient implementation. The software stack for these systems includes components such as programming models and query languages, for expressing computation; stream management systems, for distribution and scheduling; and hardware components for acceleration including floating-point units, graphics processing units, and field-programmable gate arrays.

Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use. It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

Parallel Extensions was the development name for a managed concurrency library developed by a collaboration between Microsoft Research and the CLR team at Microsoft. The library was released in version 4.0 of the .NET Framework. It is composed of two parts: Parallel LINQ (PLINQ) and Task Parallel Library (TPL). It also consists of a set of coordination data structures (CDS) – sets of data structures used to synchronize and co-ordinate the execution of concurrent tasks.

Language Integrated Query is a Microsoft .NET Framework component that adds native data querying capabilities to .NET languages, originally released as a major part of .NET Framework 3.5 in 2007.

In computing, algorithmic skeletons, or parallelism patterns, are a high-level parallel programming model for parallel and distributed computing.

Apache Pig is a high-level platform for creating programs that run on Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark. Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for relational database management systems. Pig Latin can be extended using user-defined functions (UDFs) which the user can write in Java, Python, JavaScript, Ruby or Groovy and then call directly from the language.

Data-intensive computing is a class of parallel computing applications which use a data parallel approach to process large volumes of data typically terabytes or petabytes in size and typically referred to as big data. Computing applications that devote most of their execution time to computational requirements are deemed compute-intensive, whereas applications are deemed data-intensive require large volumes of data and devote most of their processing time to I/O and manipulation of data.

Data-centric programming language defines a category of programming languages where the primary function is the management and manipulation of data. A data-centric programming language includes built-in processing primitives for accessing data stored in sets, tables, lists, and other data structures and databases, and for specific manipulation and transformation of data required by a programming application. Data-centric programming languages are typically declarative and often dataflow-oriented, and define the processing result desired; the specific processing steps required to perform the processing are left to the language compiler. The SQL relational database language is an example of a declarative, data-centric language. Declarative, data-centric programming languages are ideal for data-intensive computing applications.

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

Eclipse Deeplearning4j is a programming library written in Java for the Java virtual machine (JVM). It is a framework with wide support for deep learning algorithms. Deeplearning4j includes implementations of the restricted Boltzmann machine, deep belief net, deep autoencoder, stacked denoising autoencoder and recursive neural tensor network, word2vec, doc2vec, and GloVe. These algorithms all include distributed parallel versions that integrate with Apache Hadoop and Spark.

Apache Flink is an open-source, unified stream-processing and batch-processing framework developed by the Apache Software Foundation. The core of Apache Flink is a distributed streaming data-flow engine written in Java and Scala. Flink executes arbitrary dataflow programs in a data-parallel and pipelined manner. Flink's pipelined runtime system enables the execution of bulk/batch and stream processing programs. Furthermore, Flink's runtime supports the execution of iterative algorithms natively.

<span class="mw-page-title-main">Cuneiform (programming language)</span> Open-source workflow language

Cuneiform is an open-source workflow language for large-scale scientific data analysis. It is a statically typed functional programming language promoting parallel computing. It features a versatile foreign function interface allowing users to integrate software from many external programming languages. At the organizational level Cuneiform provides facilities like conditional branching and general recursion making it Turing-complete. In this, Cuneiform is the attempt to close the gap between scientific workflow systems like Taverna, KNIME, or Galaxy and large-scale data analysis programming models like MapReduce or Pig Latin while offering the generality of a functional programming language.

Azure Data Lake is a scalable data storage and analytics service. The service is hosted in Azure, Microsoft's public cloud.

Owl Scientific Computing is a software system for scientific and engineering computing developed in the Department of Computer Science and Technology, University of Cambridge. The System Research Group (SRG) in the department recognises Owl as one of the representative systems developed in SRG in the 2010s. The source code is licensed under the MIT License and can be accessed from the GitHub repository.

References

↑ GitHub - MicrosoftResearch/Dryad: This is a research prototype of the Dryad and DryadLINQ data-parallel processing frameworks running on Hadoop YARN.
↑ "DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language" (PDF). Microsoft Research . Retrieved 2009-01-21.
↑ Patee, Don. "Announcing the Windows Azure HPC Scheduler and HPC Pack 2008 R2 Service Pack 3 releases!". Microsoft. Retrieved 2013-05-31.
↑ Foley, Mary Joe. "Microsoft drops Dryad; puts its big-data bets on Hadoop". ZDNet. Archived from the original on November 17, 2011. Retrieved 2013-05-31.
↑ Henschen, Doug. "Microsoft Ditches Dryad, Focuses On Hadoop". Information Week. Retrieved 2013-05-31.

External links

Dryad: Programming the Data Center Archived 2007-09-12 at the Wayback Machine
Dryad Home
Video of Michael Isard explaining Dryad at Google

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] GitHub - MicrosoftResearch/Dryad: This is a research prototype of the Dryad and DryadLINQ data-parallel processing frameworks running on Hadoop YARN.

[2] "DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language" (PDF). Microsoft Research . Retrieved 2009-01-21.

[3] Patee, Don. "Announcing the Windows Azure HPC Scheduler and HPC Pack 2008 R2 Service Pack 3 releases!". Microsoft. Retrieved 2013-05-31.

[4] Foley, Mary Joe. "Microsoft drops Dryad; puts its big-data bets on Hadoop". ZDNet. Archived from the original on November 17, 2011. Retrieved 2013-05-31.

[5] Henschen, Doug. "Microsoft Ditches Dryad, Focuses On Hadoop". Information Week. Retrieved 2013-05-31.

[1]

[2]

[3]

[4]

[5]

v t e Parallel computing
General	Distributed computing Parallel computing Massively parallel Cloud computing High-performance computing Multiprocessing Manycore processor GPGPU Computer network Systolic array
Levels	Bit Instruction Thread Task Data Memory Loop Pipeline
Multithreading	Temporal Simultaneous (SMT) Simultaneous and heterogenous Speculative (SpMT) Preemptive Cooperative Clustered multi-thread (CMT) Hardware scout
Theory	PRAM model PEM model Analysis of parallel algorithms Amdahl's law Gustafson's law Cost efficiency Karp–Flatt metric Slowdown Speedup
Elements	Process Thread Fiber Instruction window Array
Coordination	Multiprocessing Memory coherence Cache coherence Cache invalidation Barrier Synchronization Application checkpointing
Programming	Stream processing Dataflow programming Models Implicit parallelism Explicit parallelism Concurrency Non-blocking algorithm
Hardware	Flynn's taxonomy SISD SIMD Array processing (SIMT) Pipelined processing Associative processing MISD MIMD Dataflow architecture Pipelined processor Superscalar processor Vector processor Multiprocessor symmetric asymmetric Memory shared distributed distributed shared UMA NUMA COMA Massively parallel computer Computer cluster Beowulf cluster Grid computer Hardware acceleration
APIs	Ateji PX Boost Chapel HPX Charm++ Cilk Coarray Fortran CUDA Dryad C++ AMP Global Arrays GPUOpen MPI OpenMP OpenCL OpenHMPP OpenACC Parallel Extensions PVM pthreads RaftLib ROCm UPC TBB ZPL
Problems	Automatic parallelization Deadlock Deterministic algorithm Embarrassingly parallel Parallel slowdown Race condition Software lockout Scalability Starvation
Category: Parallel computing

Dryad (programming)

Contents

Overview

Related Research Articles

References

Further reading

External links