Hardware Verification with C++
No Introspection in SystemVerilog

Potential Flaws in Cummings' SNUG2002SJ_FIFO2 Design

December 11, 2006 Update: Check out the response from Cliff and additional info from Kevin in the comments section of this post.

Editor's Note: Today special guest Kevin Johnston from Verilab will be joining us to provide an analysis of FIFO designs published by Cliff Cummings for SNUG 2002.  Prior to joining Verilab as a consultant in 2005, Kevin spent 14 years at Motorola/Freescale working on a variety of processor and SoC designs including the PowerPC and 68k.  The following article is his first post on Cool Verification.  Hope you enjoy it! -- JL.


There are two basic async FIFO design styles:

  1. "Pointer-less", also known as "fall-through" type: Fully-asynchronous, self-timed control logic (full-custom or compiled, embedded in the data memory array design) autonomously clocks write data from any current memory location to the subsequent memory location if that subsequent location is empty; the data "falls through" the FIFO from the memory cells connected to the Write port to the memory cells connected to the Read port.
  2. "Pointer based": The data array is a standard dual-port SRAM block; all control logic is external to the memory. A Write Pointer supplies the array address on the Write port, and a separate Read Pointer supplies the array address on the Read port. The data remains in the same memory location for the duration of its residence in the FIFO; the Write Pointer and Read Pointer cycle through the memory address range in identical sequences to retrieve the data in the same order it was entered.

Cliff Cummings has written an excellent general treatment of asynchronous FIFO design in his paper, "Simulation and Synthesis Techniques for Asynchronous FIFO Design" ("FIFO1").

Cliff also wrote a follow-on paper entitled "Simulation and Synthesis Techniques for Asynchronous FIFO Design with Asynchronous Pointer Comparisons" ("FIFO2") describing two major modifications to FIFO1: A different method to detect Full/Empty status, and reduction of two multi-bit synchronizers (for the entire Read and Write pointers) down to two single-bit synchronizers (for just the Full and Empty bits themselves).

FIFO1 and FIFO2 are both pointer-based designs.

I believe FIFO2 has both an implementation flaw and a fundamental conceptual flaw; that is, the implementation flaw can be fixed, but the conceptual flaw cannot.

The FIFO2 Implementation Flaw

The implementation flaw is that possibly runt asynchronous signals are connected to the "set" pins of both the first and the second flops of the rempty and wfull synchronizers shown in FIFO2, Figure 6. The critical factor is that the aempty_n and afull_n signals can be runt pulses. If this were not possible, then the synchronizers would work exactly as described in the paper. But in the case of runt pulses, the synchronizer behavior can be different from all the cases described.

The problem scenario is described in bullet (3) on page 14, but the description of the resulting behavior is incomplete:

"The runt pulse might preset the second synchronizer flip-flop, but not the first flip-flop. This is highly unlikely, but would result in an unnecessary, but properly synchronized rempty output (as long as the empty critical timing is met), that will be set on the output of the second flip-flop until the next read clock, when it will be cleared by the zero from the first flip-flop. This is not a problem."

In fact, a runt "set" pulse may leave the second synchronizer flop in neither valid logic state; it could instead drive the flop into metastability. The same is also true for the first synchronizer flop, but it's accepted that the first flop of a synchronizer can occasionally go metastable. The second flop is supposed to guard against the metastable propagating further downstream, however, to accomplish this purpose, the second flop must not be "sensitive" to the async input.

As the FIFO2 text carefully describes, the timing of the assertion edge of the aempty_n(afull_n) signal is derived from the rclk(wclk) domain: Only a read(write) can cause the FIFO to become empty(full). The timing of the negation edge of the aempty_n(afull_n) signal derives from the wclk(rclk) domain: Only a write(read) can cause the FIFO to become no-longer-empty(no-longer-full).

In other words, the assertion edge of either set pulse is synchronous, only the negation edge is asynchronous. If there were no possibility for the set signals to be runt pulses, then the second flop would not be sensitive to the negation (async) edge, as explained a few paragraphs further down on page 14:

"But the removal of the setting signal on the second flip-flop will violate the recovery time of the second flip-flop. Will this cause the second flip-flop to go metastable? The authors do not believe this can happen ..."

However, if the set pulse can possibly violate minimum pulse width, then the second synchronizer flop is sensitive to the negation edge, an async event.

How big a problem is this really?

For a non-runt aempty_n(afull_n) pulse, the total prop delay of rclk(wclk) -> rptr(wptr) -> CMP output -> aempty_n(afull_n) assertion -> rempty(wfull) assertion -> any further path delay to final destination regs in the rclk(wclk) domain, including setup time at the destination regs, must be less than the clock period. (See FIFO2, Figure 10.) Synthesis will tend to optimize the entire path to consume the entire clock period, using the smallest/slowest/lowest power gates possible.

Then for a runt aempty_n(afull_n) pulse, there may well be little or no timing margin left in the path for settling time if the second synchronizer flop goes metastable.

If this is simply an implementation flaw, then what's the fix?

There really is no way to prevent aempty_n/afull_n from ever being runt pulses; therefore these signals must not connect to the second stage of the rempty/wfull synchronizers. The set signals must connect only to the first stage set pin, and the second stage must connect only to the first stage via q -> d. And this is where the concept flaw comes into play.

The FIFO2 Concept Flaw

Both the aempty_n and afull_n signals are generated from terms from both the read and write clock domains. If these signals are not allowed to connect to the second stage of the synchronizers, then the consequences of both read and write operations suffer a synchronizer latency to be visible within *either* clock domain.

Contrast this with the FIFO1 design: In FIFO1 there is no synchronizer delay of any read domain signals to any destination within the read clock domain, nor delay of write signals within the wclk domain. The consequence of every read and write operation is immediately visible within its own originating domain. In FIFO1, rclk domain signals are delayed only where they cross to the wclk domain, and vice versa. In FIFO2, however, the consequence of a read(write) operation is not visible within its own originating domain until after a synchronizer latency, which requires that both domains stall after every operation to wait for the consequence (empty/full) to be valid.

Granted, the consequence of a read(write) operation is actually synchronous to the rempty(wfull) synchronizer, but it still requires an extra rclk(wclk) period to propagate through the second stage, cutting the maximum possible throughput of the FIFO precisely in half.

Questions or comments?  I can be reached at kevin dot johnston at verilab dot com, or via the comments section below.

Comments