Ten-Four: An Open-Source Fused Dot Product Unit for Mixed-Precision GPGPU Tensor Cores

Presented in Vortex Workshop and Tutorials, MICRO 2025

Paper Slides

Abstract

Efficient mixed-precision matrix multiply accumulate (MMA) operations are critical for accelerating deep learning workloads on GPGPUs. However, existing open-source dot product implementations for Tensor Cores rely on discrete arithmetic units, leading to high latency, accumulated rounding errors, and poor resource utilization. To address these challenges, we propose Ten-Four, a scalable mixed-precision fused dot product unit that integrates both the floating-point and integer arithmetic pipelines within a single fused architecture, implemented as part of the open-source RISC-V-based Vortex GPGPU’s Tensor Core Unit extension. Our design supports low-precision multiplication in FP16/BF16/FP8/BF8/INT8/INT4 formats and higher-precision accumulation in FP32/INT32, with native support for Microscaling (MX) and sparse lane clock-gating for dynamic power reduction, while matching NVIDIA Tensor Core’s numerical accuracy. Ten-Four achieves 4-cycle operation latency at 262.325 MHz Fmax, delivering 134.308 GFLOPS peak throughput per Tensor Core on the AMD Xilinx Alveo U55C FPGA, demonstrating ~3.1x performance improvement over an equivalent Berkeley HardFloat-based implementation at less than 60% the area cost.

@misc{rout2026tenfouropensourcefuseddot,
      title={Ten-Four: An Open-Source Fused Dot Product Unit for Mixed-Precision GPGPU Tensor Cores}, 
      author={Nikhil Rout and Blaise Tine},
      year={2026},
      eprint={2512.00053},
      archivePrefix={arXiv},
      primaryClass={cs.AR},
      url={https://arxiv.org/abs/2512.00053}, 
}