Portfolio item number 1
Published:
Short description of portfolio item number 1
Published:
Short description of portfolio item number 1
Published:
Short description of portfolio item number 2
Published in 3rd IEEE conference on AIIoT, 2024
Abstract: The conversion of pixels from their RGB to Grayscale formats is a crucial first step in numerous Image Pre-Processing, Computer Vision, and as highlighted here, edge detection modules. This paper presents an implementation of the Shift-Add Multiplication algorithm for efficient constant multiplications of the NTSC formula weights for RGB to Grayscale conversion on FPGAs. The proposed module is designed to be reconfigurable to both fixed-point and floating-point formats, providing flexibility in precision and resource utilization based on application requirements. Additionally, a Python script was developed to automate the generation of Verilog code for fractional constant multiplications, as proposed in this study. Pipelined modules for Gaussian Blur and the Sobel-Filter were also designed to enable the development of a complete real-time edge detection system on FPGAs. The findings reveal that Shift-Add algorithm based multiplier’s significantly reduce dynamic power consumption as compared to the use of the built-in DSP blocks on FPGA boards while performing constant multiplications for RGB to Grayscale conversion.
Published in Vortex Workshop, MICRO '58, 2025
Abstract: There has been increasing interest in developing and accelerating mixed-precision Matrix-Multiply-Accumulate operations in GPGPUs for Deep Learning workloads. However, existing open-source RTL implementations of inner dot product units rely on discrete arithmetic units, leading to suboptimal throughput and poor resource utilization. To address these challenges, we propose a scalable mixed-precision dot product unit that integrates floating-point and integer arithmetic pipelines within a singular fused architecture, implemented as part of the open-source RISC-V based Vortex GPGPU’s Tensor Core Unit extension. Our design supports low-precision multiplication in FP16/BF16/FP8/BF8/INT8/UINT4 formats and higher-precision accumulation in FP32/INT32, with an extensible framework for adding and evaluating other custom representations in the future. Experimental results demonstrate 4-cycle operation latency at 362.2 MHz clock frequency on the AMD Xilinx Alveo U55C FPGA, delivering an ideal filled pipeline throughput of 5.795 GFlops in a 4-thread per warp configuration.
Published:
This is a description of your talk, which is a markdown file that can be all markdown-ified like any other post. Yay markdown!
Published:
This is a description of your conference proceedings talk, note the different field in type. You can put anything in this field.
Undergraduate course, University 1, Department, 2014
This is a description of a teaching experience. You can use markdown like any other post.
Workshop, University 1, Department, 2015
This is a description of a teaching experience. You can use markdown like any other post.