In today's world, with the significant advancements in AI and Machine Learning applications, there is an increasing demand for faster and more complex computations. This demand has spurred the development of new hardware accelerators to cater to the needs of ML scientists.
Matrix multiplication is undeniably one of the most frequently employed calculations in Machine Learning, particularly in Neural Networks. Systolic Arrays offer a solution for accelerating matrix-matrix multiplication, and they are utilized in Google TPU Accelerators. In the following section, the RTL implementation of a 3*3 systolic array is described.
The Datapath of PE contains three registers, and a hardware capable of computing MAC operation. The constant matrix is received from Win bus. The result of the above calculation is received from Sin bus. WReg is responsible for storing weights, while Dreg is used to get the matrix values and perform the MAC operation. By employing these simple PEs, it becomes feasible to compute matrix-matrix multiplications rapidly.
Figure 1.1 - The datapath Structure of Processing Element.
For this PE to perform the proper operation, a control unit is needed to manage calculations and weight loading:
Figure 1.2 - The Controller Diagram of Processing Element.
By cascading multiple PEs, Systolic Arrays can be created. Note to the connections, as values in D move horizentally, but the values of S (result of MAC operation) move vertically in each 2 cycles.
Figure 2.1 - The Datapath structure of Systolic Array.
The Systolic Array also requires a control unit to execute the proper actions, given that the Processing Elements (PEs) are multicycle units, and the entire process necessitates multiple nodes to carry out the multiplication. The controller is also responsible for weight loading:
Figure 2.2 - The Controller structure of Systolic Array.
- Understanding Matrix Multiplication on a Weight-Stationary Systolic Architecture : https://www.telesens.co/2018/07/30/systolic-architectures/