GPNPU has multi-core cluster options for +100TOPS

The third generation of general-purpose neural processing units (GPNPUs) by Quadric introduces pre-integrated clusters of two, four or eight cores to deliver up to 108 TOPS. This latest iteration has increased performance and optimisations for generative AI. the company has also introduced a safety enhanced version for automotive applications.

The company described the Chimera QC series as NPUs with a DSP core designed to simplify programming and provide flexibility for high-compute applications. The IP combines the machine learning (ML) characteristics of a neural processing accelerator with the full C++ programmability of a modern DSP. The QC is the third-generation implementation of the Chimera architecture and includes both single core and multi-core cluster versions and safety-enhanced versions of both.

There are three configurable single-core processor options: the Chimera QC Nano processor delivering up to 7 TOPS of ML, the Chimera QC Perform processor (up to 28 TOPS) and the Chimera QC Ultra processor (108 TOPS).

The QC-M family of multi-core GPNPUs consists of pre-integrated clusters of two, four or eight of the QC Nano, QC Perform or QC Ultra building block cores. It can therefore scale to run small workloads in parallel (Nano cores) to high-compute applications (eight QC Ultra cores). This performance provides Level 4 central ADAS applications with 864 TOPs, said the company, sufficient for processing multiple large input format camera streams in parallel. QC-M clusters include inter-core communications circuitry as well as streaming weight-sharing functions for broadcasting common ML model weights to two or more cores in a cluster.

The Chimera architecture configuration options with high-performance multiply-accumulate (MAC) units , C++ programmable 32-bit fixed point ALUs in each processing element (PE). These arrays from 64 to 1024 PEs build the Nano, Performance and Ultra cores.

Each configured GPNPU core can have a ratio of eight, 16 or 32 INT8 MACs for each PE. Designers targeting systems with large, weight-bound workloads such as large language models (LLMs) will choose the eight MAC configuration with wide AXIs (advanced extensible interfaces). There is an option to use 4-bit weights trained in advanced training tools, which reduces data bandwidth requirements compared to standard 8-bit integer weights. The QC series cores, coupled with wide AXI interconnect interfaces (up to 1024-bits/cycle) the QC Series cores are intended for designs seeking to implement low-power, high-performance LLM models in volume consumer devices.

For more MAC-intensive workloads, e.g., high-resolution image processing, can use the higher ratio 32 MAC per ALU option. There is also a 16-bit floating point MAC unit operating at half the throughput rate of the INT8 MACs configurable option for each processor.

The instruction set simulator enables design teams to simulate target workloads for MAC ratios, AXI widths, tightly coupled Level 2 RAM size, and other user selectable hardware options. Quadric said that compared to the previous generation Chimera processor, these Chimera QC cores configuration options can deliver up to 2.7X higher TOPS/mm² compute density.

Both the QC processor series and the multi-core QC-M processor family are offered in safety enhanced (SE) versions that combine hardware enhancements to ensure greater error resiliency. Each SE version core is coupled with FMEA (failure mode and effects analysis) reports and collaborative DIA (Defense Intelligence Agency) report generation backed by the Chimera software development kit toolchain which is undergoing ISO 26262 certification.

Quadric co-founder and CEO Veerbhan Kheterpal said the compute density achieved is a significant breakthrough for the automotive market, delivering ADAS compute chiplets for under $10.00. “A component supplier … building a 3nm chiplet could deliver over 400TOPS of fully C++ programmable ML + DSP compute for software defined vehicle platforms for a die cost of well under $10,” he said. The current alternative is to the repurpose $10,000 data centre GPGPUs or performance-limited mobile phone chipsets for the automotive market, he added.

Chimera cores can be targeted to any silicon foundry and any process technology. The family of QB Series GPNPUs can achieve up to 1.7 GHz operation in 3nm processes using conventional standard cell flows and commonly available single-ported SRAM.