Scaling the Pipelined Matrix Multiply Unit (MMU) for 4x4 Matrices in Verilog

Matrix multiplication is the computational backbone of AI accelerators, and as neural networks grow in complexity, so does the need for larger and faster hardware modules. In this article, we’ll show you how to scale a pipelined Matrix Multiply Unit (MMU) from 2×2 to 4×4 matrices, discuss the design challenges, and provide a practical Verilog implementation with a testbench for verification.

Why Scale to 4×4 Matrices?

Realistic Workloads: Most AI models use matrices much larger than 2×2. A 4×4 MMU is a practical step toward real-world applications.
Performance: Larger MMUs can process more data per cycle, increasing throughput.
Design Reusability: The techniques used here can be extended to even larger matrices or parameterized for flexibility.

Design Principles for a 4×4 Pipelined MMU

Scaling up means more multiplications and additions. For a 4×4 matrix multiplication:

Inputs: Two 4×4 matrices (A and B)
Outputs: One 4×4 result matrix (C)
Computation: For each element C[i][j],

C[i][j] = A[i][0] * B[0][j] + A[i][1] * B[1][j] + A[i][2] * B[2][j] + A[i][3] * B[3][j]

Pipelining Approach:

Stage 1: Compute all 4 partial products for each of 16 output element (total 64 multiplications).
Stage 2: Sum the partial products for each output element (16 additions).

Verilog Implementation: 4×4 Pipelined MMU

Here’s a scalable, pipelined Verilog module for 4×4 matrix multiplication:

module mmu_4x4_pipelined (
    input         clk,
    input         rst,
    input  [7:0]  a [0:3][0:3], // Matrix A
    input  [7:0]  b [0:3][0:3], // Matrix B
    output [17:0] c [0:3][0:3]  // Matrix C (result)
);

    // Stage 1: Multiplication
    reg [15:0] products [0:3][0:3][0:3]; // products[i][j][k] = A[i][k] * B[k][j]

    integer i, j, k;
    always @(posedge clk or posedge rst) begin
        if (rst) begin
            for (i=0; i<4; i=i+1)
                for (j=0; j<4; j=j+1)
                    for (k=0; k<4; k=k+1)
                        products[i][j][k] <= 0;
        end else begin
            for (i=0; i<4; i=i+1)
                for (j=0; j<4; j=j+1)
                    for (k=0; k<4; k=k+1)
                        products[i][j][k] <= a[i][k] * b[k][j];
        end
    end

    // Stage 2: Addition
    reg [17:0] c_reg [0:3][0:3];
    always @(posedge clk or posedge rst) begin
        if (rst) begin
            for (i=0; i<4; i=i+1)
                for (j=0; j<4; j=j+1)
                    c_reg[i][j] <= 0;
        end else begin
            for (i=0; i<4; i=i+1)
                for (j=0; j<4; j=j+1)
                    c_reg[i][j] <= products[i][j][0] + products[i][j][1] +
                                   products[i][j][2] + products[i][j][3];
        end
    end

    // Output assignment
    generate
        genvar gi, gj;
        for (gi=0; gi<4; gi=gi+1) begin: out_i
            for (gj=0; gj<4; gj=gj+1) begin: out_j
                assign c[gi][gj] = c_reg[gi][gj];
            end
        end
    endgenerate

endmodule

Note:

This code uses SystemVerilog-style arrays for clarity. If you’re using plain Verilog, you’ll need to flatten the arrays or use individual signals. However, we would suggest to use SystemVerilog-style arrays for ease of coding.
The design is fully pipelined: new inputs can be loaded every clock cycle. First output is available after two clock cycles and following outputs are available every clock cycles.

Testbench for 4×4 Pipelined MMU

Here’s a simple testbench to verify the MMU:

module tb_mmu_4x4_pipelined;
    reg clk, rst;
    reg  [7:0] a [0:3][0:3];
    reg  [7:0] b [0:3][0:3];
    wire [17:0] c [0:3][0:3];

    mmu_4x4_pipelined uut (
        .clk(clk), .rst(rst), .a(a), .b(b), .c(c)
    );

    // Clock generation
    initial clk = 0;
    always #5 clk = ~clk;

    integer i, j;
    initial begin
        rst = 1;
		  for (i=0; i<4; i=i+1) begin
            for (j=0; j<4; j=j+1) begin
                a[i][j] = 0;
                b[i][j] = 0;
            end
		  end
        #12;
        rst = 0;

        // Test Case: Identity matrix multiplication
		  for (i=0; i<4; i=i+1)	begin
            for (j=0; j<4; j=j+1) begin
                a[i][j] = (i == j) ? 1 : 0;
                b[i][j] = (i == j) ? 1 : 0;
            end
		  end
        #10; // 1 clock cycle delay

        // Test Case: All ones
		  for (i=0; i<4; i=i+1)	begin
            for (j=0; j<4; j=j+1) begin
                a[i][j] = 1;
                b[i][j] = 1;
            end
		  end
		  #10; // 1 clock cycle delay

		  // First output after the intial delay of 2 clock cycles
        $display("Identity Matrix Output:");
        for (i=0; i<4; i=i+1) begin
            $write("[ ");
            for (j=0; j<4; j=j+1)
                $write("%d ", c[i][j]);
            $write("]\n");
        end

        #10; // 1 clock cycle delay

		  // Consecutive output after 1 clock cycle
        $display("All Ones Output:");
        for (i=0; i<4; i=i+1) begin
            $write("[ ");
            for (j=0; j<4; j=j+1)
                $write("%d ", c[i][j]);
            $write("]\n");
        end

        $finish;
    end
endmodule

Design Tips for Larger Matrices

Parameterization: Use Verilog parameters or SystemVerilog parameter/localparam to make the matrix size configurable.
Resource Sharing: For very large matrices, consider sharing multipliers and adders to save area.
Deeper Pipelining: For even higher throughput, break the addition stage into multiple pipeline stages.

Scaling a pipelined MMU to 4×4 matrices is a practical step toward real-world AI hardware. With pipelining, you achieve high throughput and efficient resource utilization. The techniques shown here can be extended to larger matrices, parameterized designs, and integrated into full AI accelerators.

Ready to go further? Try parameterizing the MMU for NxN matrices.

Have questions or want to share your own scalable MMU design? Leave a comment below!

Discover more from VLSIFacts

Subscribe to get the latest posts sent to your email.

Scaling the Pipelined Matrix Multiply Unit (MMU) for 4×4 Matrices in Verilog

Why Scale to 4×4 Matrices?

Design Principles for a 4×4 Pipelined MMU

Pipelining Approach:

Verilog Implementation: 4×4 Pipelined MMU

Note:

Testbench for 4×4 Pipelined MMU

Design Tips for Larger Matrices

Like this:

Discover more from VLSIFacts

Leave a Reply Cancel reply

Why Scale to 4×4 Matrices?

Design Principles for a 4×4 Pipelined MMU

Pipelining Approach:

Verilog Implementation: 4×4 Pipelined MMU

Note:

Testbench for 4×4 Pipelined MMU

Design Tips for Larger Matrices

Spread the Word

Like this:

Discover more from VLSIFacts

Related posts:

Leave a Reply Cancel reply