Kalray’s Solutions - Brane Technologies

Best performance per Watt and per dollar

Kalray’s data processing units deliver exceptional performance and energy efficiency for AI/ML workloads. Available in the U.S. through our partnership.

See Configurations

Why We Partner with Kalray

Power-efficient dataflow acceleration — Kalray DPUs

When GPUs are overkill (or over budget) for streaming, signal, and pre/post-processing, Kalray’s MPPA™ Data Processing Units deliver GPU-class throughput at a fraction of the watts—with deterministic latency. With the Brane SDK, you develop and orchestrate DPUs alongside CPUs/GPUs from a single environment. As Kalray’s official U.S. partner, Brane provides local availability, integration, and support.

Kalray MPPA DPU chip on dataflow interconnect

Perf / Watt Advantage

Many-core architecture and efficient on-chip fabric deliver high throughput at low power—shift streaming ops off the GPU and cut watts.

Deterministic Pipelines

Predictable latency and QoS for real-time agents, vision, and edge deployments—ideal for pre/post-processing and data movement.

Develop Once, Orchestrate Anywhere

Brane SDK unifies CPU/GPU/DPU workflows—build, debug, and schedule in one toolchain with flexible programming models.

Technology

Kalray’s MPPA™ Data Processing Units are programmable, many-core processors built for dataflow work—streaming I/O, pre/post-processing, filtering, vector ops, and movement. They deliver high, predictable throughput at lower power, keeping GPUs focused on math.

Kalray MPPA DPU (many-core dataflow processor)

Kalray MPPA™ — what it brings

Many-core parallelism: on-chip network + clusters keep multiple pipelines running concurrently.
Deterministic latency: predictable QoS for real-time and edge agents.
Perf per watt: offload tokenization, packing, codec, vector search, compression, and data movement.
Unified dev: C/C++ on Linux; orchestrate CPU/GPU/DPU with the Brane SDK.

Benchmark: In representative end-to-end pipelines, MPPA-accelerated stacks rival flagship datacenter GPU throughput at lower system power.*

*Internal and partner tests; details available on request.

Use Cases

Dataflow & Streaming

Pre / post-processing Tokenization Compression / encryption High-rate I/O pipelines

Real-time & Control

Deterministic latency Sensor fusion Signal processing Network / packet ops QoS / scheduling Edge agents

AI Pipelines

RAG pre-compute On-prem inference Feature extraction Multimodal staging GPU offload (non-math) Perf / watt tuning

Competitive Advantages

Performance

Perf per watt: offload tokenization, packing, decode/encode, vector search, compression, and data movement to MPPA—keep GPUs on math.

Deterministic latency: predictable dataflow and QoS for real-time, edge, and high-rate pipelines.

Sustained throughput: many-core parallelism keeps pipelines moving under load.

Reliability

Linux-first runtime with vetted drivers; Brane burn-in and stability checks on every build.

Predictable thermals & power for deskside/edge deployments; tuned fan curves and power envelopes.

Isolation by design: stage tasks across cores/clusters to contain faults and preserve QoS.

Control

On-prem data, predictable costs—no egress or per-seat GPU fees.

Standard tooling: C/C++ on Linux with dataflow frameworks; containerized pipelines.

Unified orchestration: the Brane SDK coordinates CPU/GPU/DPU and enables reproducible builds.

↑ Perf / watt ↑ Predictable latency ↓ GPU load ↓ Power & heat ↑ Throughput consistency

Official U.S. Partner

Featured Kalray Accelerators

Kalray MPPA™ acceleration cards deliver flagship-tier throughput in dataflow pipelines at a fraction of the power — ideal for AI pre/post-processing, streaming I/O, and vector workloads.

Kalray K300 — MPPA®3-80

Low Power

Storage-centric acceleration card optimized for NVMe fan-out and high-throughput I/O operations. Perfect for data-intensive applications requiring massive parallel storage access with minimal power consumption.

Key Specifications

Processor: Kalray MPPA®3-80 V1.2 @ 1 GHz
Storage fan-out: up to 24 × 30 TB NVMe SSD
Interfaces: PCIe Gen4 x16; 2 × QSFP28 100 GbE
Throughput: up to 25 TFLOPS (FP16)
Power: 36 W (typ.), 42 W (max)

Shop K300 Ask Questions

Kalray TC4 — Quad MPPA®3-80

Multi-DPU

Kalray TurboCard4 (TC4) with four MPPA DPUs

Quad-processor powerhouse delivering exceptional parallel processing performance for AI pipelines. Four MPPA®3-80 cores provide deterministic latency and massive throughput for demanding edge AI workloads.

Key Specifications

Processors: 4 × Kalray MPPA®3-80 V1.2 @ 1 GHz
Interfaces: PCIe Gen4 x16; 2 × QSFP28 100 GbE
Throughput: up to 100 TFLOPS (FP16)
Power: 60 W (typ.), 250 W (max)
Manufactured in France (Asteelflash)

Shop TC4 Ask Questions

Need a different form factor or a custom integration? We also support additional Kalray configurations.

Kalray DPU — FAQ

The essentials on Kalray TC4 DPUs and how they slot into data-heavy AI pipelines.

What is a Kalray DPU?

A Data Processing Unit that offloads streaming, parsing, and data-movement tasks from CPUs/GPUs. Kalray’s MPPA® many-core design runs thousands of lightweight threads with predictable latency and strong perf per watt for I/O-bound pipelines.

Where does a DPU help the most?

Data ingest (video, telemetry), pre/post-processing (tokenization, compression, filtering), vector/RAG streaming, and network/IO handling. Offloading frees CPU/GPU cycles for model work.

Does a DPU replace my GPU?

No—it’s complementary. GPUs handle dense math for training/inference; the DPU accelerates the surrounding data pipeline to raise overall throughput per watt.

What are the core traits of the “Coolidge” processor?

Many-core compute clusters linked by an AXI fabric and RDMA-capable NoC for fast on-chip transfers, with memory/protection units for isolation and deterministic execution—useful at the edge.

Which OS and tools are supported?

Linux target. Kalray provides the low-level SDK/drivers; the Brane SDK offers containerized runners and orchestration so DPU stages can be invoked alongside CPU/GPU code from one environment.

How do we get started?

Share your pipeline stages (sources → transforms → sinks) and constraints (latency, power). We’ll map candidate stages to the DPU and set up a quick proof-of-value. Talk to an engineer.

Configuration assistance

Need help with Kalray Coolidge (TC4)?

Whether you’re getting started, porting operators or pipelines, or selecting the right board and sizing PCIe & power, we can help you move fast with a Linux-first setup and AccelOne SDK orchestration.

Bring-up on Linux and SDK environment
Porting data transforms, codecs, pre/post stages
Board selection, bandwidth & power sizing
Integration and proof-of-value flow