Pipelining is a classic technique to improve throughput by overlapping operations, especially in hardware like an MMU. The article shows you how to add pipelining to the 2×2 MMU Verilog design, explain the changes, and provide an updated testbench. You can refer to the non-pipelined design here.
Pipelining is a hardware design technique where multiple operations are overlapped in time, much like an assembly line. Instead of waiting for one operation to finish before starting the next, pipelining allows new data to enter the system at every clock cycle, with each stage of the pipeline working on a different part of the computation.
Design Approach
We’ll break the computation into two pipeline stages:
- Stage 1: Perform the four multiplications for each output element.
- Stage 2: Perform the additions to produce the final results.
This allows the MMU to accept new inputs every clock cycle, with results available after a two-cycle latency.
module mmu_2x2_pipelined (
input clk,
input rst,
input [7:0] a00, a01, a10, a11,
input [7:0] b00, b01, b10, b11,
output [16:0] c00, c01, c10, c11
);
// Stage 1: Multiplication
reg [15:0] m00_0, m01_0, m10_0, m11_0;
reg [15:0] m00_1, m01_1, m10_1, m11_1;
always @(posedge clk or posedge rst) begin
if (rst) begin
m00_0 <= 0; m01_0 <= 0; m10_0 <= 0; m11_0 <= 0;
m00_1 <= 0; m01_1 <= 0; m10_1 <= 0; m11_1 <= 0;
end else begin
m00_0 <= a00 * b00;
m01_0 <= a00 * b01;
m10_0 <= a10 * b00;
m11_0 <= a10 * b01;
m00_1 <= a01 * b10;
m01_1 <= a01 * b11;
m10_1 <= a11 * b10;
m11_1 <= a11 * b11;
end
end
// Stage 2: Addition (register outputs for pipelining)
reg [16:0] c00_r, c01_r, c10_r, c11_r;
always @(posedge clk or posedge rst) begin
if (rst) begin
c00_r <= 0; c01_r <= 0; c10_r <= 0; c11_r <= 0;
end else begin
c00_r <= m00_0 + m00_1;
c01_r <= m01_0 + m01_1;
c10_r <= m10_0 + m10_1;
c11_r <= m11_0 + m11_1;
end
end
assign c00 = c00_r;
assign c01 = c01_r;
assign c10 = c10_r;
assign c11 = c11_r;
endmodule
Stage 1: Multiplication
- All the required multiplications for the matrix multiplication are performed in parallel.
- The results of these multiplications are stored in registers (
m00_0,m01_0, etc.) on the rising edge of the clock. - Purpose: This stage captures the results of all multiplications and holds them for the next stage.
Stage 2: Addition
- The second always block takes the multiplication results from Stage 1 and performs the additions required for matrix multiplication.
- The results are stored in another set of registers (
c00_r,c01_r, etc.) on the next clock edge. - Purpose: This stage completes the matrix multiplication by summing the products.
Pipelined Testbench
Here’s an updated testbench to drive the pipelined MMU. Note the use of clock and reset, and the need to wait for two cycles before checking outputs.
module tb_mmu_2x2_pipelined;
reg clk, rst;
reg [7:0] a00, a01, a10, a11;
reg [7:0] b00, b01, b10, b11;
wire [16:0] c00, c01, c10, c11;
mmu_2x2_pipelined uut (
.clk(clk), .rst(rst),
.a00(a00), .a01(a01), .a10(a10), .a11(a11),
.b00(b00), .b01(b01), .b10(b10), .b11(b11),
.c00(c00), .c01(c01), .c10(c10), .c11(c11)
);
// Clock generation
initial clk = 0;
always #5 clk = ~clk;
initial begin
rst = 1;
a00 = 0; a01 = 0; a10 = 0; a11 = 0;
b00 = 0; b01 = 0; b10 = 0; b11 = 0;
#12; // Hold reset for a bit
rst = 0;
// Test Case 1
a00 = 1; a01 = 2; a10 = 3; a11 = 4;
b00 = 5; b01 = 6; b10 = 7; b11 = 8;
#10; // Next clock - No Output will be generated here as the pipeline is not flushed yet
// Test Case 2 (new data, pipelined)
a00 = 0; a01 = 1; a10 = 1; a11 = 0;
b00 = 1; b01 = 0; b10 = 0; b11 = 1;
#10; //Pipeline is flushed here after 2 clock cycles
$display("Test Case 1 Output: C = [%d %d; %d %d]", c00, c01, c10, c11);
// Expected: [19 22; 43 50]
#10; //Now on a new output will be generated for each clock cycle
$display("Test Case 2 Output: C = [%d %d; %d %d]", c00, c01, c10, c11);
// Expected: [0 1; 1 0]
$finish;
end
endmodule
How Pipelining Improves Performance
- Throughput: After the initial latency (2 clock cycles), the MMU can accept new input matrices every clock cycle and produce a new result every clock cycle.
- Overlap: While Stage 2 is adding the results of the previous input, Stage 1 can already be multiplying the next input.
- Efficiency: The hardware is always busy – no stage is idle waiting for the others to finish.
Timing Diagram (Conceptual)
| Clock Cycle | Stage 1 (Multiplication) | Stage 2 (Addition) and Output Generation |
| 1 | Input 1 | – |
| 2 | Input 2 | Input 1 and Output 1 |
| 3 | Input 3 | Input 2 and Output 2 |
| 4 | Input 4 | Input 3 and Output 3 |
| … | … | … |
Summary
- Pipelining splits the computation into stages, each handled in a different clock cycle.
- Registers between stages store intermediate results, allowing new data to enter the pipeline before previous computations are finished.
- This design increases throughput and makes the MMU much more efficient for continuous data processing, which is essential in AI accelerators.
- As pipeline breaks a data path to multiple parts, a complete operation would take more number of clock cycles (depending on the number of pipelines). However, the clock frequency can be higher as the path length is smaller now. Smaller path would relax the setup violation constraint leading to an achievable higher clock frequency.
- While adding pipeline stages, care should be taken as it introduces area overhead in terms of pipeline registers and also adds up power due to the toggling of these pipeline registers.
Want to scale the above pipelined MMU to 4×4 MMU.
Discover more from VLSIFacts
Subscribe to get the latest posts sent to your email.