Ten-Four: An Open-Source Fused Dot Product Unit for Mixed-Precision GPGPU Tensor Cores

In Preprint, Under review.

Abstract

Efficient mixed-precision MMA operations are critical for accelerating deep learning workloads on GPGPUs. However, existing open-source Tensor Core implementations rely on discrete arithmetic unit designs, leading to high latency, accumulated rounding errors, and poor resource utilization. To address these challenges, we propose Ten-Four, a configurable mixed-precision fused dot product unit integrating both floating-point and integer arithmetic pipelines within a unified architecture, implemented as part of the open-source RISC-V-based Vortex GPGPU’s Tensor Core Unit extension. It supports low-precision multiplication in TF32/FP16/BF16/FP8/BF8/INT8/INT4 with higher-precision FP32/INT32 accumulation, native Microscaling (MX) support, and sparse lane clock-gating for dynamic power reduction, while matching NVIDIA Tensor Core numerical accuracy. Ten-Four achieves 4-cycle latency at 300 MHz Fmax on the Xilinx U55C FPGA, delivering 130.368 GFLOPS peak throughput per Tensor Core and 2.7x-7.9x speedup over equivalent Berkeley HardFloat and FPnew based implementations at less than 60% the area cost. ASIC synthesis in 7nm FinFET achieves 2.771 TFLOPS/W peak efficiency at 1.58 GHz Fmax.

@misc{rout2026tenfouropensourcefuseddot,
      title={Ten-Four: An Open-Source Fused Dot Product Unit for Mixed-Precision GPGPU Tensor Cores}, 
      author={Nikhil Rout and Blaise Tine},
      year={2026},
      eprint={2512.00053},
      archivePrefix={arXiv},
      primaryClass={cs.AR},
      url={https://arxiv.org/abs/2512.00053}, 
}