Posts by Collection

portfolio

publications

Optimizing RGB to Grayscale, Gaussian Blur and Sobel-Filter operations on FPGAs for reduced dynamic power consumption

Published in 3rd IEEE conference on AIIoT, 2024

Abstract: The conversion of pixels from their RGB to Grayscale formats is a crucial first step in numerous Image Pre-Processing, Computer Vision, and as highlighted here, edge detection modules. This paper presents an implementation of the Shift-Add Multiplication algorithm for efficient constant multiplications of the NTSC formula weights for RGB to Grayscale conversion on FPGAs. The proposed module is designed to be reconfigurable to both fixed-point and floating-point formats, providing flexibility in precision and resource utilization based on application requirements. Additionally, a Python script was developed to automate the generation of Verilog code for fractional constant multiplications, as proposed in this study. Pipelined modules for Gaussian Blur and the Sobel-Filter were also designed to enable the development of a complete real-time edge detection system on FPGAs. The findings reveal that Shift-Add algorithm based multiplier’s significantly reduce dynamic power consumption as compared to the use of the built-in DSP blocks on FPGA boards while performing constant multiplications for RGB to Grayscale conversion.

Paper URL

A Configurable Mixed-Precision Fused Dot Product Unit for GPGPU Tensor Computation

Published in Vortex Workshop, MICRO '58, 2025

Abstract: There has been increasing interest in developing and accelerating mixed-precision Matrix-Multiply-Accumulate operations in GPGPUs for Deep Learning workloads. However, existing open-source RTL implementations of inner dot product units rely on discrete arithmetic units, leading to suboptimal throughput and poor resource utilization. To address these challenges, we propose a scalable mixed-precision dot product unit that integrates floating-point and integer arithmetic pipelines within a singular fused architecture, implemented as part of the open-source RISC-V based Vortex GPGPU’s Tensor Core Unit extension. Our design supports low-precision multiplication in FP16/BF16/FP8/BF8/INT8/UINT4 formats and higher-precision accumulation in FP32/INT32, with an extensible framework for adding and evaluating other custom representations in the future. Experimental results demonstrate 4-cycle operation latency at 362.2 MHz clock frequency on the AMD Xilinx Alveo U55C FPGA, delivering an ideal filled pipeline throughput of 5.795 GFlops in a 4-thread per warp configuration.

Paper URL

talks

teaching

Teaching experience 1

Undergraduate course, University 1, Department, 2014

This is a description of a teaching experience. You can use markdown like any other post.

Teaching experience 2

Workshop, University 1, Department, 2015

This is a description of a teaching experience. You can use markdown like any other post.