Design and Verification of Pipelined 2x2 Matrix Multiply Unit in Verilog

Pipelining is a classic technique to improve throughput by overlapping operations, especially in hardware like an MMU. The article shows you how to add pipelining to the 2×2 MMU Verilog design, explain the changes, and provide an updated testbench. You can refer to the non-pipelined design here.

Pipelining is a hardware design technique where multiple operations are overlapped in time, much like an assembly line. Instead of waiting for one operation to finish before starting the next, pipelining allows new data to enter the system at every clock cycle, with each stage of the pipeline working on a different part of the computation.

Design Approach

We’ll break the computation into two pipeline stages:

Stage 1: Perform the four multiplications for each output element.
Stage 2: Perform the additions to produce the final results.

This allows the MMU to accept new inputs every clock cycle, with results available after a two-cycle latency.

module mmu_2x2_pipelined (
    input         clk,
    input         rst,
    input  [7:0]  a00, a01, a10, a11,
    input  [7:0]  b00, b01, b10, b11,
    output [16:0] c00, c01, c10, c11
);

    // Stage 1: Multiplication
    reg [15:0] m00_0, m01_0, m10_0, m11_0;
    reg [15:0] m00_1, m01_1, m10_1, m11_1;

    always @(posedge clk or posedge rst) begin
        if (rst) begin
            m00_0 <= 0; m01_0 <= 0; m10_0 <= 0; m11_0 <= 0;
            m00_1 <= 0; m01_1 <= 0; m10_1 <= 0; m11_1 <= 0;
        end else begin
            m00_0 <= a00 * b00;
            m01_0 <= a00 * b01;
            m10_0 <= a10 * b00;
            m11_0 <= a10 * b01;

            m00_1 <= a01 * b10;
            m01_1 <= a01 * b11;
            m10_1 <= a11 * b10;
            m11_1 <= a11 * b11;
        end
    end

    // Stage 2: Addition (register outputs for pipelining)
    reg [16:0] c00_r, c01_r, c10_r, c11_r;

    always @(posedge clk or posedge rst) begin
        if (rst) begin
            c00_r <= 0; c01_r <= 0; c10_r <= 0; c11_r <= 0;
        end else begin
            c00_r <= m00_0 + m00_1;
            c01_r <= m01_0 + m01_1;
            c10_r <= m10_0 + m10_1;
            c11_r <= m11_0 + m11_1;
        end
    end

    assign c00 = c00_r;
    assign c01 = c01_r;
    assign c10 = c10_r;
    assign c11 = c11_r;

endmodule

Stage 1: Multiplication

All the required multiplications for the matrix multiplication are performed in parallel.
The results of these multiplications are stored in registers (m00_0, m01_0, etc.) on the rising edge of the clock.
Purpose: This stage captures the results of all multiplications and holds them for the next stage.

Stage 2: Addition

The second always block takes the multiplication results from Stage 1 and performs the additions required for matrix multiplication.
The results are stored in another set of registers (c00_r, c01_r, etc.) on the next clock edge.
Purpose: This stage completes the matrix multiplication by summing the products.

Pipelined Testbench

Here’s an updated testbench to drive the pipelined MMU. Note the use of clock and reset, and the need to wait for two cycles before checking outputs.

module tb_mmu_2x2_pipelined;
    reg clk, rst;
    reg  [7:0] a00, a01, a10, a11;
    reg  [7:0] b00, b01, b10, b11;
    wire [16:0] c00, c01, c10, c11;

    mmu_2x2_pipelined uut (
        .clk(clk), .rst(rst),
        .a00(a00), .a01(a01), .a10(a10), .a11(a11),
        .b00(b00), .b01(b01), .b10(b10), .b11(b11),
        .c00(c00), .c01(c01), .c10(c10), .c11(c11)
    );

    // Clock generation
    initial clk = 0;
    always #5 clk = ~clk;

    initial begin
        rst = 1;
        a00 = 0; a01 = 0; a10 = 0; a11 = 0;
        b00 = 0; b01 = 0; b10 = 0; b11 = 0;
        #12; // Hold reset for a bit
        rst = 0;

        // Test Case 1
        a00 = 1; a01 = 2; a10 = 3; a11 = 4;
        b00 = 5; b01 = 6; b10 = 7; b11 = 8;
        #10; // Next clock - No Output will be generated here as the pipeline is not flushed yet

        // Test Case 2 (new data, pipelined)
        a00 = 0; a01 = 1; a10 = 1; a11 = 0;
        b00 = 1; b01 = 0; b10 = 0; b11 = 1;
        #10; //Pipeline is flushed here after 2 clock cycles

        $display("Test Case 1 Output: C = [%d %d; %d %d]", c00, c01, c10, c11);
        // Expected: [19 22; 43 50]

        #10; //Now on a new output will be generated for each clock cycle

        $display("Test Case 2 Output: C = [%d %d; %d %d]", c00, c01, c10, c11);
        // Expected: [0 1; 1 0]

        $finish;
    end
endmodule

How Pipelining Improves Performance

Throughput: After the initial latency (2 clock cycles), the MMU can accept new input matrices every clock cycle and produce a new result every clock cycle.
Overlap: While Stage 2 is adding the results of the previous input, Stage 1 can already be multiplying the next input.
Efficiency: The hardware is always busy – no stage is idle waiting for the others to finish.

Timing Diagram (Conceptual)

Clock Cycle	Stage 1 (Multiplication)	Stage 2 (Addition) and Output Generation
1	Input 1	–
2	Input 2	Input 1 and Output 1
3	Input 3	Input 2 and Output 2
4	Input 4	Input 3 and Output 3
…	…	…

Summary

Pipelining splits the computation into stages, each handled in a different clock cycle.
Registers between stages store intermediate results, allowing new data to enter the pipeline before previous computations are finished.
This design increases throughput and makes the MMU much more efficient for continuous data processing, which is essential in AI accelerators.
As pipeline breaks a data path to multiple parts, a complete operation would take more number of clock cycles (depending on the number of pipelines). However, the clock frequency can be higher as the path length is smaller now. Smaller path would relax the setup violation constraint leading to an achievable higher clock frequency.
While adding pipeline stages, care should be taken as it introduces area overhead in terms of pipeline registers and also adds up power due to the toggling of these pipeline registers.

Want to scale the above pipelined MMU to 4×4 MMU.

Discover more from VLSIFacts

Subscribe to get the latest posts sent to your email.

Design and Verification of Pipelined 2×2 Matrix Multiply Unit in Verilog

Design Approach

Stage 1: Multiplication

Stage 2: Addition

Pipelined Testbench

How Pipelining Improves Performance

Timing Diagram (Conceptual)

Summary

Like this:

Discover more from VLSIFacts

Leave a Reply Cancel reply

Design Approach

Stage 1: Multiplication

Stage 2: Addition

Pipelined Testbench

How Pipelining Improves Performance

Timing Diagram (Conceptual)

Summary

Spread the Word

Like this:

Discover more from VLSIFacts

Related posts:

Leave a Reply Cancel reply