Software It

Location:

Stanford, CA

Posted:

December 10, 2012

Contact this candidate

Resume:

NetFPGA: Reusable Router Architecture for Experimental

Research

Jad Naous Glen Gibb

Stanford University Stanford University

California, USA California, USA

******@********.*** ***@********.***

Sara Bolouki Nick McKeown

Stanford University Stanford University

California, USA California, USA

********@********.*** *****@********.***

ABSTRACT market, and increases the scrutiny (and therefore the qual-

ity) of code. Software re-use is widely practiced and wildly

Our goal is to enable fast prototyping of networking hard-

successful, particularly in the open-source community; as

ware (e.g. modi ed Ethernet switches and IP routers) for

well as in corporate development practices and commercial

teaching and research. To this end, we built and made avail-

tools.

able the NetFPGA platform. Starting from open-source ref-

The key to software re-use is to create a good API an

erence designs, students and researchers create their designs

interface that is intuitive and simple to use, and useful to

in Verilog, and then download them to the NetFPGA board

a large number of developers. Indeed, the whole eld of

where they can process packets at line-rate for 4-ports of

networking is built on the re-use of layers interconnected by

1GE. The board is becoming widely used for teaching and

well-documented and well-designed interfaces and APIs.

research, and so it has become important to make it easy

On the other hand, the history of re-use in hardware de-

to re-use modules and designs. We have created a standard

sign is mixed; there are no hugely successful open-source

interface between modules, making it easier to plug modules

hardware projects analogous to Linux, mostly because com-

together in pipelines, and to create new re-usable designs.

plicated designs only recently started to t on FPGAs. Still,

In this paper we describe our modular design, and how we

this might seem surprising as there are very strong incen-

have used it to build several systems, including our IP router

tives to re-use Verilog code (or other HDLs) the cost of

reference design and some extensions to it.

developing Verilog is much higher (per line of code) than for

software development languages, and the importance of cor-

Categories and Subject Descriptors rectly verifying the code is much higher, as even small bugs

can cost millions of dollars to x. Indeed, companies who

B.6.1 [Logic Design]: Design Styles sequential circuits,

build ASICs place great importance on building reusable

parallel circuits ; C.2.5 [Computer-Communication Net-

blocks or macros. And some companies exist to produce

works]: Local and Wide-Area Networks ethernet, high-

and sell expensive IP (intellectual property) blocks for use

speed, internet ; C.2.6 [Computer-Communication Net-

by others. To date, the successes with open-source re-usable

works]: Internetworking routers

hardware have been smaller, with Opencores.org being the

most well-known.

General Terms

Re-using hardware is di cult because of the dependencies

Design of the particular design it is part of; e.g. clock speed, I/Os,

and so on. Unlike software projects, there is no underlying

Keywords unifying operating system to provide a common platform for

all contributed code.

NetFPGA, modular design, reuse

Our goal is to make networking hardware design more re-

usable for teachers, students and researchers particularly

1. INTRODUCTION on the low-cost NetFPGA platform. We have created a

The bene ts of re-use are well understood: It allows devel- simple modular design methodology that allows networking

opers to quickly build on the work of others, reduces time to hardware designers to write re-usable code. We are creating

a library of modules that can be strung together in di erent

ways to create new systems.

NetFPGA is a sandbox for networking hardware it al-

Permission to make digital or hard copies of all or part of this work for

lows students and researchers to experiment with new ways

personal or classroom use is granted without fee provided that copies are

to process packets at line-rate. For example, a student in

not made or distributed for pro t or commercial advantage and that copies

a class might create a 4-port Gigabit Ethernet switch; or

bear this notice and the full citation on the rst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speci c a researcher might add a novel feature to a 4-port Giga-

permission and/or a fee. bit IP router. Packets can be processed in arbitrary ways,

PRESTO 08, August 22, 2008, Seattle, Washington, USA. under the control of the user. NetFPGA uses an industry-

RxQ TxQ

User Data Path

Packet Bus Register I/O over PCI

CPU CPU

NetFPGA RxQ TxQ

Output Port Lookup

Output Queues

Master RxQ TxQ

Input Arbiter

CPU CPU

... RxQ TxQ

Registers Registers Registers

MAC MAC

RxQ TxQ

...

Module1 Module2 Modulen CPU CPU

RxQ TxQ

DMA

Packet Packet Packet

from MAC MAC

Processing Processing Processing

host

host RxQ TxQ

...

RxQ TxQ

CPU CPU

RxQ TxQ

From To

Ethernet Ethernet

Figure 2: The IPv4 Router is built using the Refer-

Figure 1: Stages in the modular pipeline are con-

ence Pipeline - a simple canonical ve stage pipeline

nected using two buses: the Packet Bus and the

that can be applied to a variety of networking hard-

ware.

standard design ow (users write program in Verilog, synthe- others will modify pre-built projects and add new function-

size them, and then download to programmable hardware). ality to them by inserting new modules between the available

Designs typically run at line-rate, allowing experimental de- modules. Others will design completely new projects with-

ployment in real networks. A number of classes are taught out using any pre-built modules. This paper explains how

using NetFPGA, and it is used by a growing number of net- NetFPGA enables the second group of users to reuse mod-

working researchers. ules built by others, and to create new modules in a short

NetFPGA is a PCI card that plugs into a standard PC. time.

The card contains an FPGA, four 1GigE ports and some The paper is organized as follows: Section 2 gives the de-

bu er memory (SRAM and DRAM). The board is very low- tails of the communication channels in the pipeline, Section 3

cost1 and software, gateware and courseware are freely avail- describes the reference IPv4 router and other extensions,

able at http://NetFPGA.org. For more details, see [9]. Section 4 discusses limitations of the NetFPGA approach,

Reusable modules require a well-de ned and documented and Section 5 concludes the paper.

API. It has to be exible enough to be usable on a wide

variety of modules, as well as simple enough to allow both 2. PIPELINE INTERFACE DETAILS

novice and experienced hardware designers to learn it in a

Figure 1 shows the NetFPGA pipeline that is entirely

short amount of time.

on the Virtex FPGA. Stages are interconnected using two

Our Approach like many that have gone before

point-to-point buses: the packet bus and the register bus.

exploits the fact that networking hardware is generally ar-

The packet bus transfers packets from one stage to the

ranged as a pipeline through which packets ow and are

next using a synchronous FIFO packet-based push interface,

processed at each stage. This suggests an API that carries

over a 64-bit wide bus running at 125MHz (an aggregate rate

packets from one stage to the next along with the informa-

of 8Gbps). The FIFO interface has the advantage of hiding

tion needed to process the packet or results from a previous

all the internals of the module behind a few signals and al-

stage. Our interface does exactly that and in some ways

lows modules to be concatenated in any order. It is arguably

resembles the simple processing pipeline of Click [6] which

the simplest interface that can be used to pass information

allows a user to connect modules using a generic interface.

and provide ow control while still being su ciently e cient

One di erence is that we only use a push interface, as op-

to run designs at full line rate.

posed to both push and pull.

The register bus provides another channel of communi-

NetFPGA modules are connected as a sequence of stages

cation that does not consume Packet Bus bandwidth. It

in a pipeline. Stages communicate using a simple packet-

allows information to travel in both directions through the

based synchronous FIFO push interface: Stage i + 1 tells

pipeline, but has a much lower bandwidth.

Stage i that it has space for a packet word (i.e. the FIFO

is not full); Stage i writes a packet word into Stage i + 2.1 Entry and Exit Points

1. Since processing results and other information at one

Packets enter and exit the pipeline through various Re-

stage are usually needed at a subsequent stage, Stage i can

ceive and Transmit Queue modules respectively. These con-

prepend any information it wants to convey as a word at the

nect the various I/O ports to the pipeline and translate from

beginning of a packet.

the diverse peripheral interfaces to the uni ed pipeline FIFO

Figure 1 shows the high-level modular architecture. Fig-

interface. This makes it simpler for designers to connect to

ure 2 shows the pipeline of a simple IPv4 router built this

di erent I/O ports without having to learn how to use each.

way.

Currently there are two types of I/O ports implemented

Our Goal is to enable a wide variety of users to create

with a third planned. These are: the Ethernet Rx/Tx queues,

new systems using NetFPGA. Less experienced users will

which send and receive packets via GigE ports, the CPU

reuse entire pre-built projects on the NetFPGA card. Some

DMA Rx/Tx queues, which transfer packets via DMA be-

1 tween the NetFPGA and the host CPU, and the Multi-

At the time of writing, boards are available for $500 for

research and teaching. gigabit serial Rx/Tx queues (to be added) to allow transfer-

Ctrl Bus Data Bus

that runs the more complex algorithms and protocols at a

7 0 63 0

0xFF Dest Port One-Hot Word Length Src Port Binary Byte Length higher level, as well as by the user to con gure and debug

0xXX Other Module Header

the hardware and the network. This means that we need

0xYY Other Module Header

to make the hardware s registers, counters, and tables vis-

...

ible and controllable. A common register interface exposes

0x00 First Packet Word

these data types to the software and allows it to modify

0x00 Second Packet Word

them. This is done by memory-mapping the internal hard-

...

ware registers. The memory-mapped registers then appear

Penultimate

0x40 Example last word with two valid bytes

Last Byte

byte

as I/O registers to the user software that can access them

using ioctl calls.

The register bus strings together register modules in each

Figure 3: Format of a packet passing on the packet

stage in a pipelined daisy-chain that is looped back in a ring.

bus.

One module in the chain initiates and responds to requests

that arrive as PCI register requests on behalf of the software.

However, any stage on the chain is allowed to issue register

ring packets over two 2.5Gbps serial links. The multi-gigabit

access requests, allowing information to trickle backwards in

serial links allow extending the NetFPGA by, for example,

the pipeline, and allowing Stage i to get information from

connecting it to another NetFPGA to implement an 8-port

Stage i + k.

switch or a ring of NetFPGAs.

The daisy-chain architecture is preferable to a centralized

2.2 Packet Bus arbiter approach because it facilitates the interconnection of

stages as well as limits inter-stage dependencies.

To keep stages simple, the interface is packet-based. When

Requests on the bus can be either a register read or a

Stage i sends a packet to Stage i + 1, it will send the entire

packet without being interleaved with another. Modules are

consuming one clock cycle. As a request goes through a

not required to process multiple packets at a time, and they

stage, the stage decides whether to respond to the request or

are not required to split the packet header from its data

send the request unmodi ed to the next stage in the chain.

although a module is free to choose to do so internally. We

Responding to a request means modifying the request by

have found that the simple packet-based interface makes it

asserting an acknowledge signal in the request and if the

easier to reason about the performance of the entire pipeline.

request is a read, then also setting the data lines on the

The packet bus consists of a ctrl bus and a data bus along

with a write signal to indicate that the buses are carrying

valid information. A ready signal from Stage i + 1 to Stage i

3. USAGE EXAMPLES

provides ow control using backpressure. Stage i + 1 asserts

the ready signal indicating it can accept at least two more This section describes the IPv4 router and two extensions

words of data, and deasserts it when it can accept only one to this router that are used for research: Bu er Monitoring

more word or less. Stage i sets the ctrl and data buses, and OpenFlow. Two other extensions, Time Synchroniza-

and asserts the write signal to write a word to Stage i + 1. tion and RCP, are described in the Appendix.

Packets on the packet bus have the format shown in Fig-

3.1 The IPv4 Router

ure 3. As the packet passes from one stage to the next, a

stage can modify the packet itself and/or parse the packet Three basic designs have been implemented on NetFPGA

to obtain information that is needed by a later stage for using the interfaces described above: a 4-port NIC, an Eth-

additional processing on the packet. ernet switch, and an IPv4 router. Most projects will build

This extracted information is prepended to the beginning on one of these designs and extend it. In this section, we will

of the packet as a 64-bit word which we call a module header describe the IPv4 router on which the rest of the examples

and is uniquely identi ed by its ctrl word from other mod- in this paper are based.

ule headers. Subsequent stages in the pipeline can identify The basic IPv4 router can run at the full 4x1Gbps line-

this module header from its ctrl word and use the header rate. The router project includes the forwarding path in

to do additional processing on the packet. hardware, two software packages that allow it to build routes

While prepending module headers onto the packet and and routing tables, and command-line and graphical user

passing processing results in-band consumes bandwidth, the interfaces for management.

bus s 8Gbps bandwidth leaves 3Gbps to be consumed by Software: The software packages allow the routing ta-

module headers (4Gbps used by Ethernet packets and 1Gbps bles to be built using a routing protocol (PeeWee OSPF

used by packets to/from the host). This translates to more [14]) running in user-space completely independent of the

than 64 bytes available for module headers per packet in the Linux host, or by using the Linux host s own routing ta-

worst case. Compared with sending processing results over ble. The software also handles slow path processing such as

a separate bus, sending them in-band simpli es the state generating ICMP messages, handling ARP, IP options, etc.

machines responsible for communicating between stages and More information can be found in the NetFPGA Guide [12].

leaves less room for assumptions and mistakes in the relative The rest of this subsection describes the hardware.

timing of packet data and processing results. Hardware: The IPv4 hardware forwarding path lends it-

self naturally to the classic ve stage switch pipeline shown

2.3 Register Bus in Figure 2. The rst stage, Rx Queues, receives each packet

Networking hardware is more than just passing packets from the board s I/O ports (such as the Ethernet ports and

around between pipeline stages. The operation of these the CPU DMA interface), appends a module header indi-

stages needs to be controllable by and visible to software cating the packet s length and ingress port, and passes it

3.2 Buffer Monitoring Router

using the FIFO interface into the User Datapath. The User

Datapath contains three stages that perform the packet pro- The Bu er Monitoring Router augments the IPv4 router

cessing and is where most user modi cations would occur. with an Event Capture stage that allows monitoring the out-

The Rx Queues guarantee that only good link-layer packets put bu ers occupancies in real-time with single cycle accu-

are pushed into the User Datapath, and so they handle all racy. This extension was needed to verify the results of re-

the clock domain crossings and error checking. sarch on using small output bu ers in switches and routers

The rst stage in the User Datapath, the Input Arbiter, [4]. To do this, the Event Capture stage timestamps when

uses packetized round-robin arbitration to select which of each packet enters an output queue and when it leaves, as

the Rx Queues to service and pushes the packet into the well as its length. The host can use these event records to

Output Port Lookup stage. reconstruct the evolution of the queue occupancy from the

The Output Port Lookup stage selects which output queue series.

to place the packet in and, if necessary, modi es the packet. Since packets can be arriving at up to 4Gbps with a min-

In the case of the IPv4 router the Output Port Lookup imum packet length of 64 bytes (84 including preamble and

decrements the TTL, checks and updates the IP checksum, inter-packet gap), a packet will be arriving every 168ns.

performs the forwarding table and ARP cache lookups and Each packet can generate two events: Once going into a

decides whether to send the packet to the CPU as an ex- queue, and once when leaving. Since each packet event

ception packet or forward it out one of the Ethernet ports. record is 32-bits, the required bandwidth when running at

The longest pre x match and the ARP cache lookups are full line rate is approximately 32 bits/168/2ns = 381Mbps!

performed using the FPGA s on-chip TCAMs. The stage This eliminates the possibility of placing these timestamps

also checks for non-IP packets (ARP, etc.), packets with IP in a queue and having the software read them via the PCI

options, or other exception packets to be sent up to the bus since it takes approximately 1 s per 32-bit read (32

software to be handled in the slow path. It then modi es Mbps). The other option would be to design a speci c mech-

the module header originally added by the Rx queue to also anism by which these events could be written to a queue

indicate the destination output port. and then sent via DMA to the CPU. This, however, would

After the Output Port Lookup decides what to do with require too much e ort and work for a very speci c function-

the packet, it pushes it to the Output Queues stage which ality. The solution we use is to collect these events into a

puts the packet in one of eight output bu ers (4 for CPU in- packet which can be sent out an output port either to the

terfaces and 4 for Ethernet interfaces) using the information CPU via DMA or to an external host via 1 Gbps Ethernet.

that is stored in the module header. For the IPv4 router, the To implement the solution, we need to be able to times-

Output Port Lookup stage implements the output bu ers in tamp some signals from the Output Queues module indicat-

the on-board SRAM. When output ports are free, the Out- ing packet events and store these in a FIFO. When enough

put Queues stage uses a packetized round-robin arbiter to events are collected, the events are put into a packet that is

select which output queue to service from the SRAM and injected into the router s pipeline with the correct module

delivers the packet to the nal stage, the destination Tx headers to be placed in an output queue to send to a host,

Queue, which strips out the module headers and puts the whether local via DMA or remote.

packet out on the output port, to go to either the CPU via There are are mainly two possibilities for where this exten-

DMA or out to the Ethernet. sion can be implemented. The rst choice would be to add

The ability to split up the stages of the IPv4 pipeline so the Event Capture stage between the Output Port Lookup

cleanly and hide them under the NetFPGA s pipeline inter- stage and the Output Queues stage. This would allow re-

face allows the module pipeline to be easily and e ciently using the stage to monitor signals other than those coming

extended. Developers do not need to know the details of the from the Output Queues as well as separate the monitoring

implementation of each pipeline stage since its results are logic from the monitored logic. Unfortunately, since the

explicitly present in the module headers. In the next few timestamping happens at single cycle accuracies, signals in-

sections, we use this interface to extend the IPv4 router. dicating packet storage and packet removal have to be pulled

Commercial routers are not usually built this way. Even out of the Output Queues into the Event Capture stage vio-

though the basic stages mentioned, here do exist, they are lating the FIFO discipline and using channels other than the

not so easily or cleanly split apart. The main di erence, packet and register buses for inter-module communication.

though, stems from the fact that the NetFPGA router is Another possibility is to add the bu er monitoring logic

a pure output-queued switch. An output-queued switch is into the Output Queues stage. This would not violate the

work conserving and has the highest throughput and lowest NetFPGA methodology at the cost of making it harder to

average delay. re-use the monitoring logic for other purposes. The current

Organizations building routers with many more ports than implementation uses the rst approach since we give high

NetFPGA cannot a ord (or sometimes even design) memory priority to re-use and exibility. This design is shown in

that that has enough bandwidth to be used in a pure output- Figure 4.

queued switch. So, they resort to using other tricks such The Event Capture stage consists of two parts: an Event

as virtual input queueing, combined input-output queueing Recorder module and a Packet Writer module. The Event

([1]), smart scheduling ([10]), and distributed shared mem- Recorder captures the time when signals are asserted and

ory to approximate an output queued switch. Since the serializes the events to be sent to the Packet Writer, which

NetFPGA router runs at line-rate and implements output aggregates the events into a packet by placing them in a

queueing, the main di erence between its behavior and that bu er. When an event packet is ready to be sent out, the

of a commercial router will be in terms of available bu er Packet Writer adds a header to the packet and injects it into

sizes and delays across it. the packet bus to be sent into the Output Queues.

While not hard, the main di culties encountered while

SRAM DRAM

User Data Path

Event User Data Path

Output Port Lookup

Capture

Output Queues

Exact

Match

Output Queues

Event

Input Arbiter

Lookup

Header

Recorder Arbiter

Parser

Wildcard

Event Lookup

Packet

Writer

Editor

OpenFlow Lookup Module

Figure 4: The event capture stage is inserted be-

tween the Output Port Lookup and Output Queues Figure 5: The OpenFlow switch pipeline imple-

stages. The Event Recorder generates events while ments a di erent Output Port Lookup stage and

the Event Packet Writer aggregates them into pack- uses DRAM for packet bu ering.

ets.

Lookup and Exact Lookup modules. The Exact Lookup mod-

implementing this system were handling simultaneous packet

ule uses two hashing functions on the ow header to index

reads and writes from the output queues, and ordering and

into the SRAM and reduce collisions. In parallel, the Wild-

serializing them to get the correct timestamps. The system

card Lookup module performs the lookup in the TCAMs to

was easily implemented over a few weeks time by one grad

check for any matches on ow entries with wildcards.

student, and veri ed thoroughly by another.

The results of both wildcard and exact match lookups are

3.3 OpenFlow Switch sent to an arbiter that decides which result to choose. Once

OpenFlow is a feature on a networking switch that allows a decision is reached on the actions to take on a packet, the

a researcher to experiment with new functionality in their counters for that ow entry are updated and the actions are

own network; for example, to add a new routing protocol, a speci ed in new module headers prepended at the beginning

new management technique, a novel packet processing algo- of the packet by the Packet Editor.

rithm, or even eventually alternatives to IP [11]. The Open- The stages between the Output Port Lookup and before

Flow Switch and the OpenFlow Protocol speci cations es- the Output Queues, the OpenFlow Action stages, handle

sentially provide a mechanism to allow a switch s ow table packet modi cations as speci ed by the actions in the mod-

to be controlled remotely. ule headers. Figure 5 only shows one OpenFlow Action stage

Packets are matched on a 10-tuple consisting of a packet s for compactness, but it is possible to have multiple Action

ingress port, Ethernet MAC destination and source addresses, stages in series each doing one of the actions from the ow

Ethernet type, VLAN identi er (if one exists), IP destina- entry. This allows adding more actions very easily as the

tion and source addresses, IP protocol identi er, and TCP/UDP speci cation matures.

source and destination ports (if they exist). Packets can be The new Output Queues stage implements output FIFOs

matched exactly or using wildcards to specify elds that are that are handled in round-robin order using a hierarchy of

Don t Cares. If no match is found for a packet, the packet on-chip Block RAMs (BRAMs) and DRAM as in [8]. The

is forwarded to the remote controller that can examine the head and tail caches are implemented as static FIFOs in

packet and decide on the next steps to take [13]. BRAM, and the larger queues are maintained in the DRAM.

Actions on packets can be forwarding on a speci c port, Running on the software side is the OpenFlow client that

normal L2/L3 switch processing, sending to the local host, establishes a SSL connection to the controller. It provides

or dropping. Optionally, they can also include modifying the controller with access to the local ow table maintained

VLAN tags, modifying the IP source/destination addresses, in the software and hardware and can connect to any Open-

and modifying the UDP/TCP source/destination addresses. Flow compatible controller such as NOX[5].

Even without the optional per-packet modi cations, imple- Pipelining two exact lookups to hide the SRAM latency

menting an OpenFlow switch using a general commodity turned out to be the most challenging part of implementing

PC would not allow us to achieve line-rate on four 1Gbps the OpenFlow Output Port Lookup stage. It required mod-

ports; therefore, we implemented an OpenFlow switch on ifying the SRAM arbiter to synchronize its state machine to

NetFPGA. the Exact Lookup module s state machine and modifying the

The OpenFlow implementation on NetFPGA replaces two Wildcard Lookup module to place its results in a shallow fo

of the IPv4 router s stages, the Output Port Lookup and because it nishes earlier and doesn t need pipelining. The

Output Queues stages, and adds a few other stages to im- Match Arbiter has to handle the delays between the Exact

plement the actions to be taken on packets as shown in Fig- Lookup and Wildcard Lookup modules and delays between

ure 5. The OpenFlow Lookup stage implements the ow when the hit/miss signals are generated and when the data

table using a combination of on-chip TCAMs and o -chip is available. To run at line-rate, all lookups had to complete

SRAM to support a large number of ow entries and allow in 16 cycles; so, another challenge was compressing all infor-

matching on wildcards. mation needed from a lookup to be able to pull it out from

As a packet enters the stage, the Header Parser pulls the the SRAM with its limited bandwidth in less than 16 cycles.

relevant elds from the packet and concatenates them. This The hardware implementation currently only uses BRAM

forms the ow header which is then passed to the Wildcard for the Output Queues and does not implement any optional

packet modi cations. It was completed over a period of ve and testing, as well as helping to spread the word about

weeks by one graduate student, and can handle full-line rate NetFPGA. We also wish to thank Adam Covington, David

switching across all ports. The DRAM Output Queues are Erickson, Brandon Heller, Paul Hartke, Jianying Luo, and

still being implemented. Integration with software is cur- everyone who helped make NetFPGA a success. Finally we

rently taking place. The exact match ow table can store up wish to thank our reviewers for their helpful comments and

to 32, 000 ow table entries, while the wildcard match ow suggestions.

table can hold 32. The completed DRAM Output Queues

should be able to store up to 5500 maximum-sized (1514 7. REFERENCES

bytes) packets per output queue. [1] S.-T. Chuang, A. Goel, N. McKeown, and

B. Prabhakar. Matching Output Queueing with a

4. LIMITATIONS Combined Input Output Queued Switch. In

INFOCOM (3), pages 1169 1178, 1999.

While the problems described in the previous section and

[2] N. Dukkipati, G. Gibb, N. McKeown, and J. Zhu.

in the appendix have solutions that t nicely in NetFPGA,

Building a RCP (Rate Control Protocol) Test

one has to wonder what problems do not. There are at least

Network. In Hot Interconnects, 2007.

three issues: latency, memory, and bandwidth.

[3] N. Dukkipati, M. Kobayashi, R. Zhang-Shen, and

The rst type of problems cannot be easily split into clean

N. McKeown. Processor Sharing Flows in the Internet.

computation sections that have short enough latency to t

In Thirteenth International Workshop on Quality of

into a pipeline stage and allow the pipeline to run at line-

Service (IWQoS), 2005.

rate. This includes many cryptographic applications such as

some complex message authentication codes or other public [4] M. Enachescu, Y. Ganjali, A. Goel, N. McKeown, and

key certi cations or encryptions. T. Roughgarden. Routers With Very Small Bu ers. In

Protocols that require several messages to be exchanged, IEEE Infocom, 2006.

and hence require messages to be stored in the NetFPGA [5] N. Gude, T. Koponen, J. Pettit, B. Pfa, M. Casado,

an arbitrary amount of time while waiting for responses also N. McKeown, and S. Shenker. NOX: Towards an

do not lend themselves to a simple and clean solution in Operating System for Networks. To appear.

hardware. This includes TCP, ARP, and others. [6] M. Handley, E. Kohler, A. Ghosh, O. Hodson, and

We already saw one slight example of the third type of P. Radoslavov. Designing Extensible IP Router

problems in the Bu er Monitoring extension. Most solu- Software. In NSDI 05: Proceedings of the 2nd

tions that need too much feedback from stages ahead in conference on Symposium on Networked Systems

the pipeline are di cult to implement using the NetFPGA Design & Implementation, pages 189 202, Berkeley,

pipeline. This includes input arbiters tightly coupled the CA, USA, 2005. USENIX Association.

output queues, load-balancing, weighted fair queueing, etc. [7] IEEE. IEEE 1588 - 2002, Precision Time Protocol.

Technical report, IEEE, 2002.

5. CONCLUSION [8] S. Iyer, R. R. Kompella, and N. McKeown. Designing

Packet Bu ers for Router Line Cards. Technical

Networking hardware provides fertile ground for design-

report, Stanford University High Performance

ing highly modular and re-usable components. We have de-

Networking Group, 2002.

signed an interface that directly translates the way packets

need to be processed into a simple clean pipeline that has [9] J. W. Lockwood, N. McKeown, G. Watson, G. Gibb,

enough exibility to allow for designing some powerful ex- P. Hartke, J. Naous, R. Raghuraman, and J. Luo.

tensions to a basic IPv4 router. The packet and register NetFPGA An Open Platform for Gigabit-Rate

buses provide a simple way to pass information around be- Network Switching and Routing. In MSE 07:

tween stages while maintaining a generic enough interface Proceedings of the 2007 IEEE International

to be applied across all stages in the pipeline. Conference on Microelectronic Systems Education,

By providing a simple interface between hardware stages pages 160 161, Washington, DC, USA, 2007. IEEE

and an easy way to interact with the software, NetFPGA Computer Society.

makes the learning curve for networking hardware much [10] N. McKeown. The iSLIP Scheduling Algorithm for

gentler and invites students and researchers to modify or Input-Queued Switches. IEEE/ACM Trans. Netw.,

extend the projects that run on it. By providing a library 7(2):188 201, 1999.

of re-usable modules, NetFPGA allows developers to mix [11] N. McKeown, T. Anderson, H. Balakrishnan,

and match functionality provided by di erent modules and G. Parulkar, L. Peterson, J. Rexford, S. Shenker, and

string together a new design in a very short time. In ad- J. Turner. OpenFlow: Enabling Innovation in College

dition, it naturally allows the addition of new functionality Networks. Soon to appear in ACM Computer

as a stage into the pipeline without either having to under- Communication Review.

stand the internals of previous or past stages, or having to [12] NetFPGA Development Team. NetFPGA User s and

modify any of them. We believe that we have been able to Developer s Guide. Can be found at

achieve the goal we have set for NetFPGA of allowing users http://netfpga.org/static/guide.html.

of di erent levels to easily build powerful designs in a very [13] OpenFlow Consortium. OpenFlow Switch

short time. Speci cation. Available at

http://open owswitch.org/documents.html.

6. ACKNOWLEDGMENTS [14] Stanford University. Pee-Wee OSPF Protocol Details.

We wish to thank John W. Lockwood for leading the Can be found at

NetFPGA project through the rough times of veri cation http://yuba.stanford.edu/cs344 public/docs/pwospf ref.txt.

APPENDIX The additional stage, the RCP Stage, parses the RCP

packet and updates Rp if required. It also calculates per-

A. RCP ROUTER port averages and makes this information available to the

RCP (Rate Control Protocol) is a congestion control algo- software via the memory-mapped register interface. The

rithm which tries to emulate processor sharing in routers [2]. router also includes the Bu er Monitoring stage from the

An RCP router maintains a single rate for all ows. RCP Bu er Monitoring Router, allowing users to monitor queue

packets carry a header with a rate eld, Rp, which is over- occupancy evolution when RCP is being used and when it

written by the router if the value within the packet is larger is not.

than the value maintained by the router, otherwise it is left The design of the system took around two days and the

unchanged. The destination copies Rp into acknowledgment implementation and testing took around 10 days.

packets sent back to the source, which the source then uses

to limit the rate at which it transmits. The packet header B. PRECISION TIME SYNCHRONIZATION

also includes the source s guess of the round trip time (RTT);

ROUTER

a router uses the RTT to calculate the average RTT for all

A d

Contact this candidate