Skip to content

VLSIFacts

Let's Program the Transistors

  • Home
  • DHD
    • Digital Electronics
    • Fault Tolerant System Design
    • TLM
    • Verification
    • Verilog
    • VHDL
    • Xilinx
  • Embedded System
    • 8085 uP
    • 8086 uP
    • 8051 uC
  • VLSI Technology
    • Analog Electronics
    • Memory Devices
    • VLSI Circuits
  • Interview
    • Interview Experience
    • Training Experience
    • Question Bank
  • Notifications
  • QUIZ
  • Community
  • Job Board
  • Contact Us

Design and Verification of Pipelined 2×2 Matrix Multiply Unit in Verilog

Posted on December 2, 2025December 13, 2025 By vlsifacts No Comments on Design and Verification of Pipelined 2×2 Matrix Multiply Unit in Verilog

Pipelining is a classic technique to improve throughput by overlapping operations, especially in hardware like an MMU. The article shows you how to add pipelining to the 2×2 MMU Verilog design, explain the changes, and provide an updated testbench. You can refer to the non-pipelined design here.

Pipelining is a hardware design technique where multiple operations are overlapped in time, much like an assembly line. Instead of waiting for one operation to finish before starting the next, pipelining allows new data to enter the system at every clock cycle, with each stage of the pipeline working on a different part of the computation.

Design Approach

We’ll break the computation into two pipeline stages:

  • Stage 1: Perform the four multiplications for each output element.
  • Stage 2: Perform the additions to produce the final results.

This allows the MMU to accept new inputs every clock cycle, with results available after a two-cycle latency.

module mmu_2x2_pipelined (
    input         clk,
    input         rst,
    input  [7:0]  a00, a01, a10, a11,
    input  [7:0]  b00, b01, b10, b11,
    output [16:0] c00, c01, c10, c11
);

    // Stage 1: Multiplication
    reg [15:0] m00_0, m01_0, m10_0, m11_0;
    reg [15:0] m00_1, m01_1, m10_1, m11_1;

    always @(posedge clk or posedge rst) begin
        if (rst) begin
            m00_0 <= 0; m01_0 <= 0; m10_0 <= 0; m11_0 <= 0;
            m00_1 <= 0; m01_1 <= 0; m10_1 <= 0; m11_1 <= 0;
        end else begin
            m00_0 <= a00 * b00;
            m01_0 <= a00 * b01;
            m10_0 <= a10 * b00;
            m11_0 <= a10 * b01;

            m00_1 <= a01 * b10;
            m01_1 <= a01 * b11;
            m10_1 <= a11 * b10;
            m11_1 <= a11 * b11;
        end
    end

    // Stage 2: Addition (register outputs for pipelining)
    reg [16:0] c00_r, c01_r, c10_r, c11_r;

    always @(posedge clk or posedge rst) begin
        if (rst) begin
            c00_r <= 0; c01_r <= 0; c10_r <= 0; c11_r <= 0;
        end else begin
            c00_r <= m00_0 + m00_1;
            c01_r <= m01_0 + m01_1;
            c10_r <= m10_0 + m10_1;
            c11_r <= m11_0 + m11_1;
        end
    end

    assign c00 = c00_r;
    assign c01 = c01_r;
    assign c10 = c10_r;
    assign c11 = c11_r;

endmodule

Stage 1: Multiplication

  • All the required multiplications for the matrix multiplication are performed in parallel.
  • The results of these multiplications are stored in registers (m00_0, m01_0, etc.) on the rising edge of the clock.
  • Purpose: This stage captures the results of all multiplications and holds them for the next stage.

Stage 2: Addition

  • The second always block takes the multiplication results from Stage 1 and performs the additions required for matrix multiplication.
  • The results are stored in another set of registers (c00_r, c01_r, etc.) on the next clock edge.
  • Purpose: This stage completes the matrix multiplication by summing the products.

Pipelined Testbench

Here’s an updated testbench to drive the pipelined MMU. Note the use of clock and reset, and the need to wait for two cycles before checking outputs.

module tb_mmu_2x2_pipelined;
    reg clk, rst;
    reg  [7:0] a00, a01, a10, a11;
    reg  [7:0] b00, b01, b10, b11;
    wire [16:0] c00, c01, c10, c11;

    mmu_2x2_pipelined uut (
        .clk(clk), .rst(rst),
        .a00(a00), .a01(a01), .a10(a10), .a11(a11),
        .b00(b00), .b01(b01), .b10(b10), .b11(b11),
        .c00(c00), .c01(c01), .c10(c10), .c11(c11)
    );

    // Clock generation
    initial clk = 0;
    always #5 clk = ~clk;

    initial begin
        rst = 1;
        a00 = 0; a01 = 0; a10 = 0; a11 = 0;
        b00 = 0; b01 = 0; b10 = 0; b11 = 0;
        #12; // Hold reset for a bit
        rst = 0;

        // Test Case 1
        a00 = 1; a01 = 2; a10 = 3; a11 = 4;
        b00 = 5; b01 = 6; b10 = 7; b11 = 8;
        #10; // Next clock - No Output will be generated here as the pipeline is not flushed yet

        // Test Case 2 (new data, pipelined)
        a00 = 0; a01 = 1; a10 = 1; a11 = 0;
        b00 = 1; b01 = 0; b10 = 0; b11 = 1;
        #10; //Pipeline is flushed here after 2 clock cycles

        $display("Test Case 1 Output: C = [%d %d; %d %d]", c00, c01, c10, c11);
        // Expected: [19 22; 43 50]

        #10; //Now on a new output will be generated for each clock cycle

        $display("Test Case 2 Output: C = [%d %d; %d %d]", c00, c01, c10, c11);
        // Expected: [0 1; 1 0]

        $finish;
    end
endmodule

How Pipelining Improves Performance

  • Throughput: After the initial latency (2 clock cycles), the MMU can accept new input matrices every clock cycle and produce a new result every clock cycle.
  • Overlap: While Stage 2 is adding the results of the previous input, Stage 1 can already be multiplying the next input.
  • Efficiency: The hardware is always busy – no stage is idle waiting for the others to finish.

Timing Diagram (Conceptual)

Clock CycleStage 1 (Multiplication)Stage 2 (Addition) and Output Generation
1Input 1–
2Input 2Input 1 and Output 1
3Input 3Input 2 and Output 2
4Input 4Input 3 and Output 3
………

Summary

  • Pipelining splits the computation into stages, each handled in a different clock cycle.
  • Registers between stages store intermediate results, allowing new data to enter the pipeline before previous computations are finished.
  • This design increases throughput and makes the MMU much more efficient for continuous data processing, which is essential in AI accelerators.
  • As pipeline breaks a data path to multiple parts, a complete operation would take more number of clock cycles (depending on the number of pipelines). However, the clock frequency can be higher as the path length is smaller now. Smaller path would relax the setup violation constraint leading to an achievable higher clock frequency.
  • While adding pipeline stages, care should be taken as it introduces area overhead in terms of pipeline registers and also adds up power due to the toggling of these pipeline registers.

Want to scale the above pipelined MMU to 4×4 MMU.

Spread the Word

  • Click to share on Facebook (Opens in new window) Facebook
  • Click to share on X (Opens in new window) X
  • Click to share on LinkedIn (Opens in new window) LinkedIn
  • Click to share on Pinterest (Opens in new window) Pinterest
  • Click to share on Tumblr (Opens in new window) Tumblr
  • Click to share on Pocket (Opens in new window) Pocket
  • Click to share on Reddit (Opens in new window) Reddit
  • Click to email a link to a friend (Opens in new window) Email
  • Click to print (Opens in new window) Print

Like this:

Like Loading...

Discover more from VLSIFacts

Subscribe to get the latest posts sent to your email.

Related posts:

  1. Scaling the Pipelined Matrix Multiply Unit (MMU) for 4×4 Matrices in Verilog
  2. Implementing and Verifying a Matrix Multiply Unit (MMU) in Verilog
  3. Matrix Multiply Unit: Architecture, Pipelining, and Verification Techniques
  4. Case and Conditional Statements Synthesis CAUTION !!!
AI for VLSI, DHD Tags:AI accelerator, ASIC, Digital Design, FPGA, hardware design, HDL, Matrix Multiplication, matrix multiply unit, MMU, Performance Optimization, Pipelined Design, RTL, testbench, Verilog

Post navigation

Previous Post: Implementing and Verifying a Matrix Multiply Unit (MMU) in Verilog
Next Post: Scaling the Pipelined Matrix Multiply Unit (MMU) for 4×4 Matrices in Verilog

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Top Posts & Pages

  • NAND and NOR gate using CMOS Technology
  • Circuit Design of a 4-bit Binary Counter Using D Flip-flops
  • AND and OR gate using CMOS Technology
  • Step-by-Step Guide to Running Lint Checks, Catching Errors, and Fixing Them: Industrial Best Practices with Examples
  • What Is an AI Accelerator? Detailed Architecture Explained

Copyright © 2025 VLSIFacts.

Powered by PressBook WordPress theme

%d