Stream Processors, Inc.

Last updated • 6 min readFrom Wikipedia, The Free Encyclopedia
Stream Processors, Inc.
Company type Private
Industry Semiconductors-Specialized
Founded2004
Headquarters Sunnyvale, California, United States
Key people
Bill Dally, Co-Founder and ex-Chairman
Products Digital Signal Processor
Number of employees
Approximately 100 (2007)
Website www.streamprocessors.com

Stream Processors, Inc. (SPI), was a Silicon Valley–based fabless semiconductor company specializing in the design and manufacture of high-performance digital signal processors for applications including video surveillance, multi-function printers and video conferencing. The company ceased operations in 2009.

Contents

Company history

Foundational work in stream processing was initiated in 1995 by a research team led by MIT professor Bill Dally. In 1996, he moved to Stanford University where he continued this work, receiving a multimillion-dollar grant from DARPA with additional resources from Intel and Texas Instruments to fund the development of a project called "Imagine" - the first stream processor chip and accompanying compiler tools.

The Imagine Project

The goal of the Imagine project was to develop a C-programmable signal and image processor intended to provide both the performance density and efficiency of a special-purpose processor (such as a hard-wired ASIC). The project successfully demonstrated the advantages of stream processing. Details on the Imagine project and its results are posted on the Stanford Imagine project page. The work also showed that a number of applications ranging from wireless baseband processing, 3D graphics, encryption, IP forwarding to video processing could take advantage of the efficiency of stream processing. This research inspired other designs such as GPUs from ATI Technologies as well as the Cell microprocessor from Sony, Toshiba, and IBM.

The main deliverables from the Imagine program included:

SPI established

Dally, together with other team members, obtained a license from Stanford to commercialize the resulting technology. Stream Processors, Incorporated (SPI) was incorporated in California in 2004. Professor Dally remained at Stanford and the company hired industry veteran Chip Stearns to become the President and CEO in December of that year. [1] Through June, 2006 SPI has been able to raise a total of $26M from a trio of notable venture capital firms – Austin Ventures, Norwest Venture Partners and the Woodside Fund.

The company launched its first two products concurrently with the International Solid State Circuits Conference (ISSCC) in February, 2006 [2] and has introduced two others since. [3] [4]

SPI has headquarters located in Sunnyvale, California, as well as a software development group (SPI Software Technologies Pvt. Ltd) located in Bangalore, India.

In January 2009 Co-Founder Prof. Bill Dally accepted a position as Chief Scientist with NVIDIA Corporation. [5] At the same time he resigned as chairman. [6] In an interview Dally reflected on his experiences with startups: [6] " I have done several chip startups myself. It’s getting hard. The ante is very high. If you do a chip startup, you need patient investors with very deep pockets. It’s many tens of millions of dollars to get to a first product and $50 million to get to profits. That’s very difficult to do because investors want an exit some multiple over that investment. I am hoping we return to the days of frequent IPOs and get beyond the fire-sale acquisitions. That’s not what you can see right now. If it’s a programmable chip, the cost is even more."

In the summer of 2009 CEO Stearns left the company and was replaced by Mike Fister, an executive with senior level experience at Cadence Design Systems and Intel.

In September 2009 the company ceased operations. [7]

Technology

Similar to graphics and scientific computing, media and signal processing are characterized by available data-parallelism, locality and a high computation to global memory access ratio. Stream processing exploits these characteristics using data-parallel processing fed by a distributed memory hierarchy managed by the compiler. The main challenge for next generation massively parallel processors is data bandwidth, not computational resources. Unlike most conventional processors, the technology does not rely on a hardware cache – instead data movement is explicitly managed by the compiler and hardware.

The execution model is based on accelerating performance-critical functions (kernels) that process and produce data records (streams). Kernels and streams are scheduled at compile-time and moved to on-chip memory at runtime via a scoreboard. The compiler analyses data live times of streams to optimize allocation and minimize external memory bandwidth needs. Streams and kernels loads can overlap with execution to improve latency tolerance and the explicit data movement provides predictable performance. There are no CPU cache misses and the design presents a single-core model to the programmer – data-parallelism is within the kernels.

Architecture

The architecture includes a host CPU (System MIPS) for system-level tasks and a DSP Coprocessor Subsystem where the DSP MIPS runs the main threads that make kernel function calls to the Data Parallel Unit (DPU). For users that use libraries, and don't intend to develop DSP code, the architecture is a MIPS-based system-on-a-chip with an API to a “black box” coprocessor. The DPU Dispatcher receives kernel function calls to manage runtime kernel and stream loads. One kernel at a time is executed across the lanes, operating on local stream data stored in the Lane Register File of each lane. Each lane has a set of VLIW ALUs and distributed operand register files (ORF) allow for a large working data set and processing bandwidth exceeding 1 TeraByte/s. The Stream Load/Store Unit provides gather/scatter with a wide variety of access patterns. The InterLane Switch is a compiler-scheduled, full crossbar for high-speed access between lanes.

Tools

SPI's RapiDev Tools Suite leverages the predictability of stream processing to provide a fast path to optimized results using C programming. Starting with C reference code, the Fast Functional Debugger (FFD) library plugs into standard tools, such as Microsoft Visual Studio and GNU, and simulates the DPU to support re-structuring code to kernels and streams. Because kernels are statically scheduled and data movement is explicit, DPU cycle-accuracy can be obtained even at this functional high level. This is one source of the predictability of the architecture. For targeting code to the device, the Stream Processor Compiler (SPC) generates the VLIW executable and pre-processed C code that is compiled/linked via standard GCC for MIPS. SPC allocates streams in the Lane Register Files and provides dependency information for the kernel function calls. Software pipelining and loop unrolling are supported. Branch penalties are avoided by predicated selects and larger conditionals use conditional streams. Running under Eclipse, the Target Code Simulator provides comprehensive Host or Device binary code simulation with breakpoint and single-stepping capabilities with bandwidth and load statistics. A kernel view shows the VLIW pipeline for kernel optimizations, and a stream view shows kernel execution and stream loads to review global data movement for system profiling.

Products

SPI currently markets its Storm-1 family, that includes four fully software programmable DSPs of varying performance levels.

ProductGMACS*Applications
SP16HP-G220224
SP16-G160160
  • Telepresence
  • Surveillance DVRs
SP8-G8080
  • Printers, Scanners and MFPs
  • Surveillance DVRs
SP8LP-G3032
  • Professional camcorder
  • IP Camera

Note: GMACS stands for Giga (billions of) Multiply-Accumulate operations per Second, a common measure of DSP performance.

Support hardware and software

Related Research Articles

MIPS is a family of reduced instruction set computer (RISC) instruction set architectures (ISA) developed by MIPS Computer Systems, now MIPS Technologies, based in the United States.

<span class="mw-page-title-main">Reduced instruction set computer</span> Processor executing one instruction in minimal clock cycles

In electronics and computer science, a reduced instruction set computer (RISC) is a computer architecture designed to simplify the individual instructions given to the computer to accomplish tasks. Compared to the instructions given to a complex instruction set computer (CISC), a RISC computer might require more instructions in order to accomplish a task because the individual instructions are written in simpler code. The goal is to offset the need to process more instructions by increasing the speed of each instruction, in particular by implementing an instruction pipeline, which may be simpler to achieve given simpler instructions.

In computer science, an instruction set architecture (ISA) is an abstract model that generally defines how software controls the CPU in a computer or a family of computers. A device or program that executes instructions described by that ISA, such as a central processing unit (CPU), is called an implementation of that ISA.

Very long instruction word (VLIW) refers to instruction set architectures that are designed to exploit instruction-level parallelism (ILP). A VLIW processor allows programs to explicitly specify instructions to execute in parallel, whereas conventional central processing units (CPUs) mostly allow programs to specify instructions to execute in sequence only. VLIW is intended to allow higher performance without the complexity inherent in some other designs.

<span class="mw-page-title-main">Single instruction, multiple data</span> Type of parallel processing

Single instruction, multiple data (SIMD) is a type of parallel processing in Flynn's taxonomy. SIMD can be internal and it can be directly accessible through an instruction set architecture (ISA), but it should not be confused with an ISA. SIMD describes computers with multiple processing elements that perform the same operation on multiple data points simultaneously.

<span class="mw-page-title-main">AVR microcontrollers</span> Family of microcontrollers

AVR is a family of microcontrollers developed since 1996 by Atmel, acquired by Microchip Technology in 2016. These are modified Harvard architecture 8-bit RISC single-chip microcontrollers. AVR was one of the first microcontroller families to use on-chip flash memory for program storage, as opposed to one-time programmable ROM, EPROM, or EEPROM used by other microcontrollers at the time.

The Intel i860 is a RISC microprocessor design introduced by Intel in 1989. It is one of Intel's first attempts at an entirely new, high-end instruction set architecture since the failed Intel iAPX 432 from the beginning of the 1980s. It was the world's first million-transistor chip. It was released with considerable fanfare, slightly obscuring the earlier Intel i960, which was successful in some niches of embedded systems. The i860 never achieved commercial success and the project was terminated in the mid-1990s.

<span class="mw-page-title-main">Digital signal processor</span> Specialized microprocessor optimized for digital signal processing

A digital signal processor (DSP) is a specialized microprocessor chip, with its architecture optimized for the operational needs of digital signal processing. DSPs are fabricated on metal–oxide–semiconductor (MOS) integrated circuit chips. They are widely used in audio signal processing, telecommunications, digital image processing, radar, sonar and speech recognition systems, and in common consumer electronic devices such as mobile phones, disk drives and high-definition television (HDTV) products.

Transmeta Corporation was an American fabless semiconductor company based in Santa Clara, California. It developed low power x86 compatible microprocessors based on a VLIW core and a software layer called Code Morphing Software.

<span class="mw-page-title-main">Coprocessor</span> Type of computer processor

A coprocessor is a computer processor used to supplement the functions of the primary processor. Operations performed by the coprocessor may be floating-point arithmetic, graphics, signal processing, string processing, cryptography or I/O interfacing with peripheral devices. By offloading processor-intensive tasks from the main processor, coprocessors can accelerate system performance. Coprocessors allow a line of computers to be customized, so that customers who do not need the extra performance do not need to pay for it.

Nucleus RTOS is a real-time operating system (RTOS) produced by the Embedded Software Division of Mentor Graphics, a Siemens Business, supporting 32- and 64-bit embedded system platforms. The operating system (OS) is designed for real-time embedded systems for medical, industrial, consumer, aerospace, and Internet of things (IoT) uses. Nucleus was released first in 1993. The latest version is 3.x, and includes features such as power management, process model, 64-bit support, safety certification, and support for heterogeneous computing multi-core system on a chip (SOCs) processors.

<span class="mw-page-title-main">Blackfin</span> Family of 16-/32-bit microprocessors

Blackfin is a family of 16-/32-bit microprocessors developed, manufactured and marketed by Analog Devices. The processors have built-in, fixed-point digital signal processor (DSP) functionality performed by 16-bit multiply–accumulates (MACs), accompanied on-chip by a microcontroller. It was designed for a unified low-power processor architecture that can run operating systems while simultaneously handling complex numeric tasks such as real-time H.264 video encoding.

<span class="mw-page-title-main">Benchmark (computing)</span> Standardized performance evaluation

In computing, a benchmark is the act of running a computer program, a set of programs, or other operations, in order to assess the relative performance of an object, normally by running a number of standard tests and trials against it.

<span class="mw-page-title-main">TMS320</span> Series of Digital Signal Processor chips

TMS320 is a blanket name for a series of digital signal processors (DSPs) from Texas Instruments. It was introduced on April 8, 1983, through the TMS32010 processor, which was then the fastest DSP on the market.

In computer science, stream processing is a programming paradigm which views streams, or sequences of events in time, as the central input and output objects of computation. Stream processing encompasses dataflow programming, reactive programming, and distributed data processing. Stream processing systems aim to expose parallel processing for data streams and rely on streaming algorithms for efficient implementation. The software stack for these systems includes components such as programming models and query languages, for expressing computation; stream management systems, for distribution and scheduling; and hardware components for acceleration including floating-point units, graphics processing units, and field-programmable gate arrays.

<span class="mw-page-title-main">Multi-core processor</span> Microprocessor with more than one processing unit

A multi-core processor (MCP) is a microprocessor on a single integrated circuit (IC) with two or more separate central processing units (CPUs), called cores to emphasize their multiplicity. Each core reads and executes program instructions, specifically ordinary CPU instructions. However, the MCP can run instructions on separate cores at the same time, increasing overall speed for programs that support multithreading or other parallel computing techniques. Manufacturers typically integrate the cores onto a single IC die, known as a chip multiprocessor (CMP), or onto multiple dies in a single chip package. As of 2024, the microprocessors used in almost all new personal computers are multi-core.

<span class="mw-page-title-main">Parallax Propeller</span> Multi-core microcontroller

The Parallax P8X32A Propeller is a multi-core processor parallel computer architecture microcontroller chip with eight 32-bit reduced instruction set computer (RISC) central processing unit (CPU) cores. Introduced in 2006, it is designed and sold by Parallax, Inc.

<span class="mw-page-title-main">TILE64</span>

TILE64 is a VLIW ISA multicore processor manufactured by Tilera. It consists of a mesh network of 64 "tiles", where each tile houses a general purpose processor, cache, and a non-blocking router, which the tile uses to communicate with the other tiles on the processor.

<span class="mw-page-title-main">TriMedia (mediaprocessor)</span>

TriMedia is a family of very long instruction word media processors from NXP Semiconductors. TriMedia is a Harvard architecture CPU that features many DSP and SIMD operations to efficiently process audio and video data streams. For TriMedia processor optimal performance can be achieved by only programming in C/C++ as opposed to most other VLIW/DSP processors which require assembly language programming to achieve optimal performance. High-level programmability of TriMedia relies on the large uniform register file and the orthogonal instruction set, in which RISC-like operations can be scheduled independently of each other in the VLIW issue slots. Furthermore, TriMedia processors boast advanced caches supporting unaligned accesses without performance penalty, hardware and software data/instruction prefetch, allocate-on-write-miss, as well as collapsed load operations combining a traditional load with a 2-taps filter function. TriMedia development has been supported by various research studies on hardware cache coherency, multithreading and diverse accelerators to build scalable shared memory multiprocessor systems.

In computing, a cache control instruction is a hint embedded in the instruction stream of a processor intended to improve the performance of hardware caches, using foreknowledge of the memory access pattern supplied by the programmer or compiler. They may reduce cache pollution, reduce bandwidth requirement, bypass latencies, by providing better control over the working set. Most cache control instructions do not affect the semantics of a program, although some can.

References

  1. Press release streamprocessors.com December 13, 2004
  2. EETimes.com - Startup touts stream processing architecture for DSPs
  3. "Data-parallel DSP aimed at cost-sensitive video surveillance apps | Video Imaging DesignLine" . Retrieved 18 December 2023.
  4. EETimes.com - Stream Processors claims fastest DSP
  5. "Home".
  6. 1 2 "Stanford's Bill Dally leaps from academia to the computer graphics wars". 22 May 2009.
  7. "Report: Chip startup Stream Processors to shut down". Silicon Valley Business Journal. November 2, 2009 via bizjournals.com.

37°22′59.48″N122°04′42.08″W / 37.3831889°N 122.0783556°W / 37.3831889; -122.0783556