ONLamp.com    
 Published on ONLamp.com (http://www.onlamp.com/)
 See this if you're having trouble printing code examples


BGP

Running Zebra on a Unix Machine
An Alternative for a Real Router?

by Iljitsch van Beijnum, author of BGP
11/07/2002

If there is one thing I hate when reading a technical book, it's errors in the examples. A misplaced word or two in the text can be distracting, but in an example, it is often deadly. So when I set out to write BGP, I vowed to do everything I could to avoid this problem.

Unfortunately, there is no way to guarantee that errors weren't introduced during the later stages in the writing and publishing process, but as the writer, at least I could make sure the examples were correct when I put them in the manuscript. For this purpose, I had to create the actual configuration I wanted to use for each example, using my trusty-old Cisco 2500 at home. But I couldn't do this on a single router: I also needed BGP peers (neighbors) for my router to talk to. I had never been a great fan of host-based routers, but I figured anything that could talk BGP would do as something for my Cisco to talk to, at least for the basic examples. (Then I'd only need to scavenge routers elsewhere for the more complex ones.)

Enter Zebra

So, the first step was to install the Zebra routing software on my FreeBSD box. Zebra (the brainchild of Kunihiro Ishiguro) is a set of daemons, each implementing a single routing protocol. There are daemons for RIP, RIPng, OSPF, OSPFv6, and BGP, and an extra daemon called zebra that handles the interactions between the different protocols and the kernel routing table. The RIPng and OSPFv6 daemons support IPv6 routing within a single organization's network, and the BGP daemon implements the multiprotocol extensions in order to support IPv6 interdomain routing. (For more on IPv6, see Silvia Hagen's recently released book, IPv6 Essentials.)

Related Reading

BGP
Building Reliable Networks with the Border Gateway Protocol
By Iljitsch van Beijnum

Zebra is completely IPv6-aware: connecting to (for instance) the bgpd with the command telnet localhost 2605 works just as well when using the IPv6 loopback address ::1 as it does when using the IPv4 loopback address 127.0.0.1. Managing a program running on the local host by telnetting to it may seem bizarre at first, but this makes it possible for Zebra to closely emulate the Cisco user interface. This meant I could get to work right away: most of the BGP commands are the same as those on a Cisco router, and there's always help in the form of the question mark, which shows a list of possible completions for partially entered commands. And changing the running configuration on the fly is much more efficient than having to edit configuration files and restart the daemon.

One thing is a bit odd, though: in Zebra, each of the daemons must be configured separately. However, the vtysh utility handles configuration, so this isn't a huge imposition. Whenever Zebra must be configured differently from Cisco's IOS, the Zebra way usually makes more sense. For instance, on a Cisco, enabling BGP routing for the address range 192.0.2.64/26 is done using the command network 192.0.2.64 mask 255.255.255.192, but this is done differently for OSPF: network 192.0.2.64 0.0.0.63 area 0. In Zebra, the address range is represented using prefix format (for example, 192.0.2.64/26) for both BGP and OSPF.

Zebra's Advantages

Zebra closely mimics Cisco IOS behavior, even in cases where this behavior is largely arbitrary. Let me give you an example. BGP routes can have a metric, or "Multi Exit Discriminator" (MED), to use the right term. In other routing protocols, such as OSPF, the metric is the primary mechanism to select routes: the one with the lowest metric is preferred. In BGP, the metric or MED is optional, and is only used to differentiate between two or more routes that are otherwise identical, such as in the case where a customer connects to the same ISP over two links. If one of the links is faster, the customer can set a low metric on this link and a higher one on the other link to make sure the ISP sends traffic over the faster link. But the MED is optional, so what should happen if there is a MED for one route, but no MED for another? Cisco routers consider a missing MED to be the best possible one (zero).

So does Zebra. (In both IOS and Zebra this behavior can be changed with the bgp bestpath med missing-as-worst command, to better conform to IETF guidelines.) This close similarity makes Zebra relatively easy to work with for someone who learned about routers on Cisco equipment (like me). Conversely, experience with Zebra is a good starting point when learning IOS. However, this doesn't mean Zebra is an "IOS emulator." Since Zebra just implements routing protocols and not the underlying packet forwarding, filtering routes is done differently than in IOS. Until a few years ago, Cisco only implemented two filtering mechanisms for packet filtering that can also be applied to route filtering: standard access lists and extended access lists. The syntax for using extended access lists to filter routes is rather convoluted. For instance, an extended access list that matches all routes with prefix lengths of 20 to 24 bits in 192.168.0.0/16 would be:

access-list 169 permit ip 192.168.0.0 0.0.255.255 255.255.240.0 0.0.15.0

Zebra's access list syntax is much simpler, but also less powerful:

access-list test1 permit 192.168.0.0/16

This matches 192.168.0.0/16 or any more specific prefix. The only other option is to use the exact-match keyword and then the access list line matches just 192.168.0.0/16 and nothing else. Fortunately, route filtering that uses access lists is a thing of past. IOS and Zebra now both implement prefix lists, which are as powerful as Cisco's extended access lists, but with the simplicity of Zebra's:

ip prefix-list test2 permit 192.168.0.0/16 ge 20 le 24

Zebra's Disadvantages

As I became more familiar with Zebra, I grew more impressed by the software. Despite the fact that version 1.0 hasn't been released yet (only beta versions are available), I can't remember a single time when the software broke down. I can't say the same thing for the hardware: I had a hard disk crash on me. Fortunately, I had the foresight to install two hard disks and use disk mirroring with the vinum volume manager. The fact that they have hard disks has always been my biggest problem with host-based routers. Even if they don't crash (which they all do eventually), having a hard disk inside of a router means problems when the power fails: before you know it, your router is doing a file system check.

"Real" routers aren't bothered by the power going away, since all of their software and configuration data is stored in flash or non-volatile memory. They don't even have a shutdown command, just a power switch. Just recently, I found out that it's fairly simple to have a Unix machine boot and run from flash memory. As it turns out, CompactFlash memory cards use an IDE interface. With the right converter, they can be attached to an IDE interface on the motherboard of a PC. The BIOS will then happily recognize the card as a hard disk and boot from it.

Another thing that always used to worry me about host-based routers is IP forwarding performance. But some tests I did with Gigabit Ethernet cards in FreeBSD boxes convinced me that a PC-based system can handle several hundred megabits worth of data coming in or going out. Unfortunately, I was unable to fully test the routing performance due to lack of enough machines to act as source and sink for the necessary amounts of traffic. However, such a test between two boxes doesn't translate to good real-world performance as a BGP router. Currently, a full BGP feed is about 110,000 routes.

Whether or not a system can achieve good forwarding performance with so many routes in its routing table is highly dependent on the route-lookup algorithm it uses for the majority of the forwarded packets. Cisco routers implement several ways to do this. In "process switching," a regular process reads a packet from the buffer where packets are stored as they come in, and then looks up the destination in the main routing table and schedules the packet for transmission on the right output interface. On a Cisco, this is slow. On a Unix machine, this would hardly work at all: the forwarding process would have to contend with other user processes for CPU time, and may even be swapped out to disk!

Fast Switching Methods

To increase forwarding performance, IOS implements "fast switching." This forwarding algorithm uses a route cache that stores the most recently used routes in a data structure that can be searched more efficiently than the main routing table. With fast switching, packets aren't stored in a buffer for further processing, but the forwarding algorithm is executed immediately as the CPU tends to the interrupt caused by the arrival of a packet. When the packet can't be fast switched because the destination can't be found in the route cache (or for another reason), it is handed over to regular process switching. As the packet is then process switched, a route cache entry is created so subsequent packets can be fast switched.

An even faster switching method is Cisco express forwarding (CEF). CEF also operates at the interrupt level, but unlike fast switching, it employs a dedicated process for building the CEF data structures in memory. This CEF table holds a copy of the entire routing table, so there is no need to process switch the first packet towards any given destination.

The fast switching route cache uses a radix tree structure to store next hop information (MAC address and output interface). Since an IP address has 32 bits, the radix tree has a depth of 32 levels, and looking up a route requires a maximum of 32 steps, assuming the right route is present. CEF, on the other hand, uses a 256-way trie structure. This makes it possible to search the tree in only four steps that each evaluate 8 bits in one go. And since the next hop information is no longer stored in the tree structure itself, there is additional flexibility. For instance, the CEF table can encode recursive routing information.

So how is this done under Unix? Not all that differently: the 1990 4.3BSD-Reno interim release introduced a radix tree as the data structure for the kernel routing table. Thus, Unix IP forwarding is not quite as advanced as Cisco's CEF, but it improves on fast switching. This is because the radix tree holds the full routing table, so there is no need to rebuild it during process switching. So a Cisco will be somewhat faster than a Unix system with a similar CPU, but since in practice, Unix systems have much faster CPUs than Cisco routers, it more than makes up for the difference.

A somewhat unfortunate similarity between Cisco and Unix is the size of these tables. On a Cisco, entries in the BGP table, the main routing table, and the CEF table all take roughly 100 to 300 bytes of memory per route. The FreeBSD kernel also uses nearly 300 bytes per route, as do the Zebra main routing table and the BGP table. (Note that on some systems, the kernel has a limit on the amount of memory that the routing table may use.)

Summary

All in all, I have to admit a Unix box running Zebra is a decent alternative to a "real" router. On the other hand, I still prefer the tight integration between hardware and software that router vendors offer, as long as their products aren't overpriced and underpowered. It's good to have choices.

Iljitsch van Beijnum has been working with BGP in ISP and end-user networks since 1996.


O'Reilly & Associates recently released (September 2002) BGP.