NetFPGA: Reusable Router Architecture for Experimental
Research
Jad Naous Glen Gibb
Stanford University Stanford University
California, USA California, USA
******@********.*** ***@********.***
Sara Bolouki Nick McKeown
Stanford University Stanford University
California, USA California, USA
********@********.*** *****@********.***
ABSTRACT market, and increases the scrutiny (and therefore the qual-
ity) of code. Software re-use is widely practiced and wildly
Our goal is to enable fast prototyping of networking hard-
successful, particularly in the open-source community; as
ware (e.g. modi ed Ethernet switches and IP routers) for
well as in corporate development practices and commercial
teaching and research. To this end, we built and made avail-
tools.
able the NetFPGA platform. Starting from open-source ref-
The key to software re-use is to create a good API an
erence designs, students and researchers create their designs
interface that is intuitive and simple to use, and useful to
in Verilog, and then download them to the NetFPGA board
a large number of developers. Indeed, the whole eld of
where they can process packets at line-rate for 4-ports of
networking is built on the re-use of layers interconnected by
1GE. The board is becoming widely used for teaching and
well-documented and well-designed interfaces and APIs.
research, and so it has become important to make it easy
On the other hand, the history of re-use in hardware de-
to re-use modules and designs. We have created a standard
sign is mixed; there are no hugely successful open-source
interface between modules, making it easier to plug modules
hardware projects analogous to Linux, mostly because com-
together in pipelines, and to create new re-usable designs.
plicated designs only recently started to t on FPGAs. Still,
In this paper we describe our modular design, and how we
this might seem surprising as there are very strong incen-
have used it to build several systems, including our IP router
tives to re-use Verilog code (or other HDLs) the cost of
reference design and some extensions to it.
developing Verilog is much higher (per line of code) than for
software development languages, and the importance of cor-
Categories and Subject Descriptors rectly verifying the code is much higher, as even small bugs
can cost millions of dollars to x. Indeed, companies who
B.6.1 [Logic Design]: Design Styles sequential circuits,
build ASICs place great importance on building reusable
parallel circuits ; C.2.5 [Computer-Communication Net-
blocks or macros. And some companies exist to produce
works]: Local and Wide-Area Networks ethernet, high-
and sell expensive IP (intellectual property) blocks for use
speed, internet ; C.2.6 [Computer-Communication Net-
by others. To date, the successes with open-source re-usable
works]: Internetworking routers
hardware have been smaller, with Opencores.org being the
most well-known.
General Terms
Re-using hardware is di cult because of the dependencies
Design of the particular design it is part of; e.g. clock speed, I/Os,
and so on. Unlike software projects, there is no underlying
Keywords unifying operating system to provide a common platform for
all contributed code.
NetFPGA, modular design, reuse
Our goal is to make networking hardware design more re-
usable for teachers, students and researchers particularly
1. INTRODUCTION on the low-cost NetFPGA platform. We have created a
The bene ts of re-use are well understood: It allows devel- simple modular design methodology that allows networking
opers to quickly build on the work of others, reduces time to hardware designers to write re-usable code. We are creating
a library of modules that can be strung together in di erent
ways to create new systems.
NetFPGA is a sandbox for networking hardware it al-
Permission to make digital or hard copies of all or part of this work for
lows students and researchers to experiment with new ways
personal or classroom use is granted without fee provided that copies are
to process packets at line-rate. For example, a student in
not made or distributed for pro t or commercial advantage and that copies
a class might create a 4-port Gigabit Ethernet switch; or
bear this notice and the full citation on the rst page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior speci c a researcher might add a novel feature to a 4-port Giga-
permission and/or a fee. bit IP router. Packets can be processed in arbitrary ways,
PRESTO 08, August 22, 2008, Seattle, Washington, USA. under the control of the user. NetFPGA uses an industry-
Copyright 2008 ACM 978-1-60558-181-1/08/08 5.00.
1
Register Bus MAC MAC
RxQ TxQ
User Data Path
Packet Bus Register I/O over PCI
CPU CPU
NetFPGA RxQ TxQ
Output Port Lookup
Register Bus MAC MAC
Output Queues
Master RxQ TxQ
Input Arbiter
CPU CPU
... RxQ TxQ
Registers Registers Registers
MAC MAC
RxQ TxQ
...
Module1 Module2 Modulen CPU CPU
RxQ TxQ
DMA
DMA
to
Packet Packet Packet
from MAC MAC
Processing Processing Processing
host
host RxQ TxQ
...
RxQ TxQ
CPU CPU
RxQ TxQ
From To
Ethernet Ethernet
Figure 2: The IPv4 Router is built using the Refer-
Figure 1: Stages in the modular pipeline are con-
ence Pipeline - a simple canonical ve stage pipeline
nected using two buses: the Packet Bus and the
that can be applied to a variety of networking hard-
Register Bus.
ware.
standard design ow (users write program in Verilog, synthe- others will modify pre-built projects and add new function-
size them, and then download to programmable hardware). ality to them by inserting new modules between the available
Designs typically run at line-rate, allowing experimental de- modules. Others will design completely new projects with-
ployment in real networks. A number of classes are taught out using any pre-built modules. This paper explains how
using NetFPGA, and it is used by a growing number of net- NetFPGA enables the second group of users to reuse mod-
working researchers. ules built by others, and to create new modules in a short
NetFPGA is a PCI card that plugs into a standard PC. time.
The card contains an FPGA, four 1GigE ports and some The paper is organized as follows: Section 2 gives the de-
bu er memory (SRAM and DRAM). The board is very low- tails of the communication channels in the pipeline, Section 3
cost1 and software, gateware and courseware are freely avail- describes the reference IPv4 router and other extensions,
able at http://NetFPGA.org. For more details, see [9]. Section 4 discusses limitations of the NetFPGA approach,
Reusable modules require a well-de ned and documented and Section 5 concludes the paper.
API. It has to be exible enough to be usable on a wide
variety of modules, as well as simple enough to allow both 2. PIPELINE INTERFACE DETAILS
novice and experienced hardware designers to learn it in a
Figure 1 shows the NetFPGA pipeline that is entirely
short amount of time.
on the Virtex FPGA. Stages are interconnected using two
Our Approach like many that have gone before
point-to-point buses: the packet bus and the register bus.
exploits the fact that networking hardware is generally ar-
The packet bus transfers packets from one stage to the
ranged as a pipeline through which packets ow and are
next using a synchronous FIFO packet-based push interface,
processed at each stage. This suggests an API that carries
over a 64-bit wide bus running at 125MHz (an aggregate rate
packets from one stage to the next along with the informa-
of 8Gbps). The FIFO interface has the advantage of hiding
tion needed to process the packet or results from a previous
all the internals of the module behind a few signals and al-
stage. Our interface does exactly that and in some ways
lows modules to be concatenated in any order. It is arguably
resembles the simple processing pipeline of Click [6] which
the simplest interface that can be used to pass information
allows a user to connect modules using a generic interface.
and provide ow control while still being su ciently e cient
One di erence is that we only use a push interface, as op-
to run designs at full line rate.
posed to both push and pull.
The register bus provides another channel of communi-
NetFPGA modules are connected as a sequence of stages
cation that does not consume Packet Bus bandwidth. It
in a pipeline. Stages communicate using a simple packet-
allows information to travel in both directions through the
based synchronous FIFO push interface: Stage i + 1 tells
pipeline, but has a much lower bandwidth.
Stage i that it has space for a packet word (i.e. the FIFO
is not full); Stage i writes a packet word into Stage i + 2.1 Entry and Exit Points
1. Since processing results and other information at one
Packets enter and exit the pipeline through various Re-
stage are usually needed at a subsequent stage, Stage i can
ceive and Transmit Queue modules respectively. These con-
prepend any information it wants to convey as a word at the
nect the various I/O ports to the pipeline and translate from
beginning of a packet.
the diverse peripheral interfaces to the uni ed pipeline FIFO
Figure 1 shows the high-level modular architecture. Fig-
interface. This makes it simpler for designers to connect to
ure 2 shows the pipeline of a simple IPv4 router built this
di erent I/O ports without having to learn how to use each.
way.
Currently there are two types of I/O ports implemented
Our Goal is to enable a wide variety of users to create
with a third planned. These are: the Ethernet Rx/Tx queues,
new systems using NetFPGA. Less experienced users will
which send and receive packets via GigE ports, the CPU
reuse entire pre-built projects on the NetFPGA card. Some
DMA Rx/Tx queues, which transfer packets via DMA be-
1 tween the NetFPGA and the host CPU, and the Multi-
At the time of writing, boards are available for $500 for
research and teaching. gigabit serial Rx/Tx queues (to be added) to allow transfer-
2
Ctrl Bus Data Bus
that runs the more complex algorithms and protocols at a
7 0 63 0
0xFF Dest Port One-Hot Word Length Src Port Binary Byte Length higher level, as well as by the user to con gure and debug
0xXX Other Module Header
the hardware and the network. This means that we need
0xYY Other Module Header
to make the hardware s registers, counters, and tables vis-
...
...
ible and controllable. A common register interface exposes
0x00 First Packet Word
these data types to the software and allows it to modify
0x00 Second Packet Word
them. This is done by memory-mapping the internal hard-
...
...
ware registers. The memory-mapped registers then appear
Penultimate
0x40 Example last word with two valid bytes
Last Byte
byte
as I/O registers to the user software that can access them
using ioctl calls.
The register bus strings together register modules in each
Figure 3: Format of a packet passing on the packet
stage in a pipelined daisy-chain that is looped back in a ring.
bus.
One module in the chain initiates and responds to requests
that arrive as PCI register requests on behalf of the software.
However, any stage on the chain is allowed to issue register
ring packets over two 2.5Gbps serial links. The multi-gigabit
access requests, allowing information to trickle backwards in
serial links allow extending the NetFPGA by, for example,
the pipeline, and allowing Stage i to get information from
connecting it to another NetFPGA to implement an 8-port
Stage i + k.
switch or a ring of NetFPGAs.
The daisy-chain architecture is preferable to a centralized
2.2 Packet Bus arbiter approach because it facilitates the interconnection of
stages as well as limits inter-stage dependencies.
To keep stages simple, the interface is packet-based. When
Requests on the bus can be either a register read or a
Stage i sends a packet to Stage i + 1, it will send the entire
register write. The bus is pipelined with each transaction
packet without being interleaved with another. Modules are
consuming one clock cycle. As a request goes through a
not required to process multiple packets at a time, and they
stage, the stage decides whether to respond to the request or
are not required to split the packet header from its data
send the request unmodi ed to the next stage in the chain.
although a module is free to choose to do so internally. We
Responding to a request means modifying the request by
have found that the simple packet-based interface makes it
asserting an acknowledge signal in the request and if the
easier to reason about the performance of the entire pipeline.
request is a read, then also setting the data lines on the
The packet bus consists of a ctrl bus and a data bus along
register bus to the value of the register.
with a write signal to indicate that the buses are carrying
valid information. A ready signal from Stage i + 1 to Stage i
3. USAGE EXAMPLES
provides ow control using backpressure. Stage i + 1 asserts
the ready signal indicating it can accept at least two more This section describes the IPv4 router and two extensions
words of data, and deasserts it when it can accept only one to this router that are used for research: Bu er Monitoring
more word or less. Stage i sets the ctrl and data buses, and OpenFlow. Two other extensions, Time Synchroniza-
and asserts the write signal to write a word to Stage i + 1. tion and RCP, are described in the Appendix.
Packets on the packet bus have the format shown in Fig-
3.1 The IPv4 Router
ure 3. As the packet passes from one stage to the next, a
stage can modify the packet itself and/or parse the packet Three basic designs have been implemented on NetFPGA
to obtain information that is needed by a later stage for using the interfaces described above: a 4-port NIC, an Eth-
additional processing on the packet. ernet switch, and an IPv4 router. Most projects will build
This extracted information is prepended to the beginning on one of these designs and extend it. In this section, we will
of the packet as a 64-bit word which we call a module header describe the IPv4 router on which the rest of the examples
and is uniquely identi ed by its ctrl word from other mod- in this paper are based.
ule headers. Subsequent stages in the pipeline can identify The basic IPv4 router can run at the full 4x1Gbps line-
this module header from its ctrl word and use the header rate. The router project includes the forwarding path in
to do additional processing on the packet. hardware, two software packages that allow it to build routes
While prepending module headers onto the packet and and routing tables, and command-line and graphical user
passing processing results in-band consumes bandwidth, the interfaces for management.
bus s 8Gbps bandwidth leaves 3Gbps to be consumed by Software: The software packages allow the routing ta-
module headers (4Gbps used by Ethernet packets and 1Gbps bles to be built using a routing protocol (PeeWee OSPF
used by packets to/from the host). This translates to more [14]) running in user-space completely independent of the
than 64 bytes available for module headers per packet in the Linux host, or by using the Linux host s own routing ta-
worst case. Compared with sending processing results over ble. The software also handles slow path processing such as
a separate bus, sending them in-band simpli es the state generating ICMP messages, handling ARP, IP options, etc.
machines responsible for communicating between stages and More information can be found in the NetFPGA Guide [12].
leaves less room for assumptions and mistakes in the relative The rest of this subsection describes the hardware.
timing of packet data and processing results. Hardware: The IPv4 hardware forwarding path lends it-
self naturally to the classic ve stage switch pipeline shown
2.3 Register Bus in Figure 2. The rst stage, Rx Queues, receives each packet
Networking hardware is more than just passing packets from the board s I/O ports (such as the Ethernet ports and
around between pipeline stages. The operation of these the CPU DMA interface), appends a module header indi-
stages needs to be controllable by and visible to software cating the packet s length and ingress port, and passes it
3
3.2 Buffer Monitoring Router
using the FIFO interface into the User Datapath. The User
Datapath contains three stages that perform the packet pro- The Bu er Monitoring Router augments the IPv4 router
cessing and is where most user modi cations would occur. with an Event Capture stage that allows monitoring the out-
The Rx Queues guarantee that only good link-layer packets put bu ers occupancies in real-time with single cycle accu-
are pushed into the User Datapath, and so they handle all racy. This extension was needed to verify the results of re-
the clock domain crossings and error checking. sarch on using small output bu ers in switches and routers
The rst stage in the User Datapath, the Input Arbiter, [4]. To do this, the Event Capture stage timestamps when
uses packetized round-robin arbitration to select which of each packet enters an output queue and when it leaves, as
the Rx Queues to service and pushes the packet into the well as its length. The host can use these event records to
Output Port Lookup stage. reconstruct the evolution of the queue occupancy from the
The Output Port Lookup stage selects which output queue series.
to place the packet in and, if necessary, modi es the packet. Since packets can be arriving at up to 4Gbps with a min-
In the case of the IPv4 router the Output Port Lookup imum packet length of 64 bytes (84 including preamble and
decrements the TTL, checks and updates the IP checksum, inter-packet gap), a packet will be arriving every 168ns.
performs the forwarding table and ARP cache lookups and Each packet can generate two events: Once going into a
decides whether to send the packet to the CPU as an ex- queue, and once when leaving. Since each packet event
ception packet or forward it out one of the Ethernet ports. record is 32-bits, the required bandwidth when running at
The longest pre x match and the ARP cache lookups are full line rate is approximately 32 bits/168/2ns = 381Mbps!
performed using the FPGA s on-chip TCAMs. The stage This eliminates the possibility of placing these timestamps
also checks for non-IP packets (ARP, etc.), packets with IP in a queue and having the software read them via the PCI
options, or other exception packets to be sent up to the bus since it takes approximately 1 s per 32-bit read (32
software to be handled in the slow path. It then modi es Mbps). The other option would be to design a speci c mech-
the module header originally added by the Rx queue to also anism by which these events could be written to a queue
indicate the destination output port. and then sent via DMA to the CPU. This, however, would
After the Output Port Lookup decides what to do with require too much e ort and work for a very speci c function-
the packet, it pushes it to the Output Queues stage which ality. The solution we use is to collect these events into a
puts the packet in one of eight output bu ers (4 for CPU in- packet which can be sent out an output port either to the
terfaces and 4 for Ethernet interfaces) using the information CPU via DMA or to an external host via 1 Gbps Ethernet.
that is stored in the module header. For the IPv4 router, the To implement the solution, we need to be able to times-
Output Port Lookup stage implements the output bu ers in tamp some signals from the Output Queues module indicat-
the on-board SRAM. When output ports are free, the Out- ing packet events and store these in a FIFO. When enough
put Queues stage uses a packetized round-robin arbiter to events are collected, the events are put into a packet that is
select which output queue to service from the SRAM and injected into the router s pipeline with the correct module
delivers the packet to the nal stage, the destination Tx headers to be placed in an output queue to send to a host,
Queue, which strips out the module headers and puts the whether local via DMA or remote.
packet out on the output port, to go to either the CPU via There are are mainly two possibilities for where this exten-
DMA or out to the Ethernet. sion can be implemented. The rst choice would be to add
The ability to split up the stages of the IPv4 pipeline so the Event Capture stage between the Output Port Lookup
cleanly and hide them under the NetFPGA s pipeline inter- stage and the Output Queues stage. This would allow re-
face allows the module pipeline to be easily and e ciently using the stage to monitor signals other than those coming
extended. Developers do not need to know the details of the from the Output Queues as well as separate the monitoring
implementation of each pipeline stage since its results are logic from the monitored logic. Unfortunately, since the
explicitly present in the module headers. In the next few timestamping happens at single cycle accuracies, signals in-
sections, we use this interface to extend the IPv4 router. dicating packet storage and packet removal have to be pulled
Commercial routers are not usually built this way. Even out of the Output Queues into the Event Capture stage vio-
though the basic stages mentioned, here do exist, they are lating the FIFO discipline and using channels other than the
not so easily or cleanly split apart. The main di erence, packet and register buses for inter-module communication.
though, stems from the fact that the NetFPGA router is Another possibility is to add the bu er monitoring logic
a pure output-queued switch. An output-queued switch is into the Output Queues stage. This would not violate the
work conserving and has the highest throughput and lowest NetFPGA methodology at the cost of making it harder to
average delay. re-use the monitoring logic for other purposes. The current
Organizations building routers with many more ports than implementation uses the rst approach since we give high
NetFPGA cannot a ord (or sometimes even design) memory priority to re-use and exibility. This design is shown in
that that has enough bandwidth to be used in a pure output- Figure 4.
queued switch. So, they resort to using other tricks such The Event Capture stage consists of two parts: an Event
as virtual input queueing, combined input-output queueing Recorder module and a Packet Writer module. The Event
([1]), smart scheduling ([10]), and distributed shared mem- Recorder captures the time when signals are asserted and
ory to approximate an output queued switch. Since the serializes the events to be sent to the Packet Writer, which
NetFPGA router runs at line-rate and implements output aggregates the events into a packet by placing them in a
queueing, the main di erence between its behavior and that bu er. When an event packet is ready to be sent out, the
of a commercial router will be in terms of available bu er Packet Writer adds a header to the packet and injects it into
sizes and delays across it. the packet bus to be sent into the Output Queues.
While not hard, the main di culties encountered while
4
SRAM DRAM
User Data Path
Event User Data Path
Output Port Lookup
Capture
Output Queues
Exact
Match
Output Queues
Event
Input Arbiter
Lookup
Header
Recorder Arbiter
Parser
Wildcard
Event Lookup
Packet
Packet
Writer
Editor
OpenFlow Lookup Module
Figure 4: The event capture stage is inserted be-
tween the Output Port Lookup and Output Queues Figure 5: The OpenFlow switch pipeline imple-
stages. The Event Recorder generates events while ments a di erent Output Port Lookup stage and
the Event Packet Writer aggregates them into pack- uses DRAM for packet bu ering.
ets.
Lookup and Exact Lookup modules. The Exact Lookup mod-
implementing this system were handling simultaneous packet
ule uses two hashing functions on the ow header to index
reads and writes from the output queues, and ordering and
into the SRAM and reduce collisions. In parallel, the Wild-
serializing them to get the correct timestamps. The system
card Lookup module performs the lookup in the TCAMs to
was easily implemented over a few weeks time by one grad
check for any matches on ow entries with wildcards.
student, and veri ed thoroughly by another.
The results of both wildcard and exact match lookups are
3.3 OpenFlow Switch sent to an arbiter that decides which result to choose. Once
OpenFlow is a feature on a networking switch that allows a decision is reached on the actions to take on a packet, the
a researcher to experiment with new functionality in their counters for that ow entry are updated and the actions are
own network; for example, to add a new routing protocol, a speci ed in new module headers prepended at the beginning
new management technique, a novel packet processing algo- of the packet by the Packet Editor.
rithm, or even eventually alternatives to IP [11]. The Open- The stages between the Output Port Lookup and before
Flow Switch and the OpenFlow Protocol speci cations es- the Output Queues, the OpenFlow Action stages, handle
sentially provide a mechanism to allow a switch s ow table packet modi cations as speci ed by the actions in the mod-
to be controlled remotely. ule headers. Figure 5 only shows one OpenFlow Action stage
Packets are matched on a 10-tuple consisting of a packet s for compactness, but it is possible to have multiple Action
ingress port, Ethernet MAC destination and source addresses, stages in series each doing one of the actions from the ow
Ethernet type, VLAN identi er (if one exists), IP destina- entry. This allows adding more actions very easily as the
tion and source addresses, IP protocol identi er, and TCP/UDP speci cation matures.
source and destination ports (if they exist). Packets can be The new Output Queues stage implements output FIFOs
matched exactly or using wildcards to specify elds that are that are handled in round-robin order using a hierarchy of
Don t Cares. If no match is found for a packet, the packet on-chip Block RAMs (BRAMs) and DRAM as in [8]. The
is forwarded to the remote controller that can examine the head and tail caches are implemented as static FIFOs in
packet and decide on the next steps to take [13]. BRAM, and the larger queues are maintained in the DRAM.
Actions on packets can be forwarding on a speci c port, Running on the software side is the OpenFlow client that
normal L2/L3 switch processing, sending to the local host, establishes a SSL connection to the controller. It provides
or dropping. Optionally, they can also include modifying the controller with access to the local ow table maintained
VLAN tags, modifying the IP source/destination addresses, in the software and hardware and can connect to any Open-
and modifying the UDP/TCP source/destination addresses. Flow compatible controller such as NOX[5].
Even without the optional per-packet modi cations, imple- Pipelining two exact lookups to hide the SRAM latency
menting an OpenFlow switch using a general commodity turned out to be the most challenging part of implementing
PC would not allow us to achieve line-rate on four 1Gbps the OpenFlow Output Port Lookup stage. It required mod-
ports; therefore, we implemented an OpenFlow switch on ifying the SRAM arbiter to synchronize its state machine to
NetFPGA. the Exact Lookup module s state machine and modifying the
The OpenFlow implementation on NetFPGA replaces two Wildcard Lookup module to place its results in a shallow fo
of the IPv4 router s stages, the Output Port Lookup and because it nishes earlier and doesn t need pipelining. The
Output Queues stages, and adds a few other stages to im- Match Arbiter has to handle the delays between the Exact
plement the actions to be taken on packets as shown in Fig- Lookup and Wildcard Lookup modules and delays between
ure 5. The OpenFlow Lookup stage implements the ow when the hit/miss signals are generated and when the data
table using a combination of on-chip TCAMs and o -chip is available. To run at line-rate, all lookups had to complete
SRAM to support a large number of ow entries and allow in 16 cycles; so, another challenge was compressing all infor-
matching on wildcards. mation needed from a lookup to be able to pull it out from
As a packet enters the stage, the Header Parser pulls the the SRAM with its limited bandwidth in less than 16 cycles.
relevant elds from the packet and concatenates them. This The hardware implementation currently only uses BRAM
forms the ow header which is then passed to the Wildcard for the Output Queues and does not implement any optional
5
packet modi cations. It was completed over a period of ve and testing, as well as helping to spread the word about
weeks by one graduate student, and can handle full-line rate NetFPGA. We also wish to thank Adam Covington, David
switching across all ports. The DRAM Output Queues are Erickson, Brandon Heller, Paul Hartke, Jianying Luo, and
still being implemented. Integration with software is cur- everyone who helped make NetFPGA a success. Finally we
rently taking place. The exact match ow table can store up wish to thank our reviewers for their helpful comments and
to 32, 000 ow table entries, while the wildcard match ow suggestions.
table can hold 32. The completed DRAM Output Queues
should be able to store up to 5500 maximum-sized (1514 7. REFERENCES
bytes) packets per output queue. [1] S.-T. Chuang, A. Goel, N. McKeown, and
B. Prabhakar. Matching Output Queueing with a
4. LIMITATIONS Combined Input Output Queued Switch. In
INFOCOM (3), pages 1169 1178, 1999.
While the problems described in the previous section and
[2] N. Dukkipati, G. Gibb, N. McKeown, and J. Zhu.
in the appendix have solutions that t nicely in NetFPGA,
Building a RCP (Rate Control Protocol) Test
one has to wonder what problems do not. There are at least
Network. In Hot Interconnects, 2007.
three issues: latency, memory, and bandwidth.
[3] N. Dukkipati, M. Kobayashi, R. Zhang-Shen, and
The rst type of problems cannot be easily split into clean
N. McKeown. Processor Sharing Flows in the Internet.
computation sections that have short enough latency to t
In Thirteenth International Workshop on Quality of
into a pipeline stage and allow the pipeline to run at line-
Service (IWQoS), 2005.
rate. This includes many cryptographic applications such as
some complex message authentication codes or other public [4] M. Enachescu, Y. Ganjali, A. Goel, N. McKeown, and
key certi cations or encryptions. T. Roughgarden. Routers With Very Small Bu ers. In
Protocols that require several messages to be exchanged, IEEE Infocom, 2006.
and hence require messages to be stored in the NetFPGA [5] N. Gude, T. Koponen, J. Pettit, B. Pfa, M. Casado,
an arbitrary amount of time while waiting for responses also N. McKeown, and S. Shenker. NOX: Towards an
do not lend themselves to a simple and clean solution in Operating System for Networks. To appear.
hardware. This includes TCP, ARP, and others. [6] M. Handley, E. Kohler, A. Ghosh, O. Hodson, and
We already saw one slight example of the third type of P. Radoslavov. Designing Extensible IP Router
problems in the Bu er Monitoring extension. Most solu- Software. In NSDI 05: Proceedings of the 2nd
tions that need too much feedback from stages ahead in conference on Symposium on Networked Systems
the pipeline are di cult to implement using the NetFPGA Design & Implementation, pages 189 202, Berkeley,
pipeline. This includes input arbiters tightly coupled the CA, USA, 2005. USENIX Association.
output queues, load-balancing, weighted fair queueing, etc. [7] IEEE. IEEE 1588 - 2002, Precision Time Protocol.
Technical report, IEEE, 2002.
5. CONCLUSION [8] S. Iyer, R. R. Kompella, and N. McKeown. Designing
Packet Bu ers for Router Line Cards. Technical
Networking hardware provides fertile ground for design-
report, Stanford University High Performance
ing highly modular and re-usable components. We have de-
Networking Group, 2002.
signed an interface that directly translates the way packets
need to be processed into a simple clean pipeline that has [9] J. W. Lockwood, N. McKeown, G. Watson, G. Gibb,
enough exibility to allow for designing some powerful ex- P. Hartke, J. Naous, R. Raghuraman, and J. Luo.
tensions to a basic IPv4 router. The packet and register NetFPGA An Open Platform for Gigabit-Rate
buses provide a simple way to pass information around be- Network Switching and Routing. In MSE 07:
tween stages while maintaining a generic enough interface Proceedings of the 2007 IEEE International
to be applied across all stages in the pipeline. Conference on Microelectronic Systems Education,
By providing a simple interface between hardware stages pages 160 161, Washington, DC, USA, 2007. IEEE
and an easy way to interact with the software, NetFPGA Computer Society.
makes the learning curve for networking hardware much [10] N. McKeown. The iSLIP Scheduling Algorithm for
gentler and invites students and researchers to modify or Input-Queued Switches. IEEE/ACM Trans. Netw.,
extend the projects that run on it. By providing a library 7(2):188 201, 1999.
of re-usable modules, NetFPGA allows developers to mix [11] N. McKeown, T. Anderson, H. Balakrishnan,
and match functionality provided by di erent modules and G. Parulkar, L. Peterson, J. Rexford, S. Shenker, and
string together a new design in a very short time. In ad- J. Turner. OpenFlow: Enabling Innovation in College
dition, it naturally allows the addition of new functionality Networks. Soon to appear in ACM Computer
as a stage into the pipeline without either having to under- Communication Review.
stand the internals of previous or past stages, or having to [12] NetFPGA Development Team. NetFPGA User s and
modify any of them. We believe that we have been able to Developer s Guide. Can be found at
achieve the goal we have set for NetFPGA of allowing users http://netfpga.org/static/guide.html.
of di erent levels to easily build powerful designs in a very [13] OpenFlow Consortium. OpenFlow Switch
short time. Speci cation. Available at
http://open owswitch.org/documents.html.
6. ACKNOWLEDGMENTS [14] Stanford University. Pee-Wee OSPF Protocol Details.
We wish to thank John W. Lockwood for leading the Can be found at
NetFPGA project through the rough times of veri cation http://yuba.stanford.edu/cs344 public/docs/pwospf ref.txt.
6
APPENDIX The additional stage, the RCP Stage, parses the RCP
packet and updates Rp if required. It also calculates per-
A. RCP ROUTER port averages and makes this information available to the
RCP (Rate Control Protocol) is a congestion control algo- software via the memory-mapped register interface. The
rithm which tries to emulate processor sharing in routers [2]. router also includes the Bu er Monitoring stage from the
An RCP router maintains a single rate for all ows. RCP Bu er Monitoring Router, allowing users to monitor queue
packets carry a header with a rate eld, Rp, which is over- occupancy evolution when RCP is being used and when it
written by the router if the value within the packet is larger is not.
than the value maintained by the router, otherwise it is left The design of the system took around two days and the
unchanged. The destination copies Rp into acknowledgment implementation and testing took around 10 days.
packets sent back to the source, which the source then uses
to limit the rate at which it transmits. The packet header B. PRECISION TIME SYNCHRONIZATION
also includes the source s guess of the round trip time (RTT);
ROUTER
a router uses the RTT to calculate the average RTT for all
A d
Copyright 2008 ACM 978-1-60558-181-1/08/08 5.00.