The A to Zabbix of Trapping & Polling

Monitoring is one of the most crucial parts to running any network. There are many tools available to perform network monitoring, some of which are more flexible than others. This quirk looks at the Zabbix monitoring platform – more specifically, how you use combined SNMP polling and trapping triggers to monitor an IP network, based on Zabbix version 3.2.

The blog assumes you’re already familiar with the workings of Zabbix. However if you aren’t, the follow section gives a whistle-stop tour, from the perspective of discovering and monitoring network devices using SNMP. If you are already familiar with Zabbix, skip to The quirk section below.

Zabbix –  SNMP Monitoring Overview

Zabbix can do much (much) more than I’ll outline here, but if you’re not familiar with it, I’ll describe roughly how it works in relation to this quirk.

The Zabbix application is installed on a central server with the option of having one or more proxy servers that relay information back to the central server. Zabbix has the capability to monitor a wide range of environments from cloud storage platforms to LAN switching. It uses a variety tools to accomplish this but here I’ll focus on its use of SNMP.

Anything that can be exposed in an SNMP MIB can be detected and monitored by Zabbix. Examples of metrics or values that you might want to monitor in a networking environment include:

  • Interfaces states
  • Memory and CPU levels
  • Protocol information (neighbors IPs, neighborship status etc)
  • System uptime
  • Spanning-Tree events
  • HA failover events

In Zabbix these metrics/values are called items. A device that is being monitored is referred to as a host.

Zabbix monitors items on hosts by both SNMP polling and trapping. It can, for example, poll a switch’s interfaces every 5 minutes and alert if a poll response comes back stating the interface is down (the ifOperStatus OID is good for this). Alternatively an item can be configured to listen for traps. If a switch interface drops, and that switch sends an SNMP trap (either to the central server or one of its proxies), Zabbix can pick this up and trigger an alert.

So how is it actually configured and setup?

The configuration of Zabbix to monitor SNMP follows these basic steps. Zabbix specific terms have been coloured red:

  • Add a new host into Zabbix – including its IP, SNMP community and name. The device in question will need to have the appropriate read-only SNMP community configured and have trapping/polling allowed to/from the Zabbix address.

blog8_image1_hostconfig

  • Configure items for that host – An item can reference a poll (e.g. poll this device for its CPU usage) or a trap (e.g. listen for an ‘interface up/down’ trap).

blog8_image2_itemconfig

  • Configure triggers that match particular expressions relating one or more items. For example a trigger could be configured to match against the ‘CPU usage’ item receiving a value (though polling) of 90 or more (e.g. 90% CPU). The trigger will then move from an OK state to a PROBLEM state. When the trigger clears (more on that below) it will move from a PROBLEM state back to an OK state.

blog8_image3_triggerconfig

  • Configure actions that correspond triggers moving to a PROBLEM state – options depend on the severity level of the trigger but could be something like sending an email or integrating with the API of something like PagerDuty to send an SMS

This process is pretty simple on the face of things, but what happens if you have 30 switches with 48 interfaces each? You couldn’t very well configure 30×48 items that monitor interfaces states. That’s a lot of copy and pasting!

Thankfully, Zabbix has two features that allow for large scale deployments like this: 

Templates – Templates allow you to configure what are called prototype items and triggers. These prototypes are bundled all into one common template. You can then apply that template to multiple devices and they will all inherit the items and triggers without them needing to be configured individually.

Low Level Discovery LLD allows you to discover multiple items based on SNMP tables. For example if you create an LLD rule with the SNMP OID ifIndex (1.3.6.1.2.1.2.2.1.1) as the Key, Zabbix will walk that table and discover all of its interfaces. You can then take the index of each row in the table and use it to create items and triggers based on other SNMP tables. For example after discovering all the rows of the ifIndex table you could use the SNMP Index in each row to find the ifOperStatus of each of those interfaces. It doesn’t matter if the host has 48 or 8 interfaces, they will all be added using this LLD. Here’s an example of the principle using snmpwalk:

blog8_image4_snmpsample

Now this is a very high level overview of Zabbix. I’m just giving a brief snapshot for those who haven’t worked with Zabbix.

Before mention the specifics of this quirk, I’ll go into a little more detail on how triggers work, since it plays a crucial role …

A trigger is an expression that is applied to an item and, as you might expect, is used to detect when a problem occurs. A trigger has two states: OK or PROBLEM. To detect when a problem occurs, a trigger uses an aptly named problem expression. The problem expression is basically a statement that describes the conditions under which the trigger should go off (e.g. move from OK to PROBLEM).

Examples of a problem expression could be “the last poll of interface x on switch y indicates it is down” or “the last trap received from switch y indicates interface x is down”.

Triggers also have a recovery expression. This is sort of the opposite of a problem expression. Once a trigger goes off, it will remain in the PROBLEM state until such time as the problem expression is no longer true. If the problem expression suddenly evaluates to false, the trigger will move to looking at the recovery expression (if one exists). At this point, the trigger will stay in a PROBLEM state until the recovery expression becomes true. The distinction to pay attention to here is that the even though the original condition that caused the trigger to go off is no longer true, the trigger remains in a PROBLEM state until the recovery expression is true. Most importantly, the recovery expression is not evaluated until the problem expression is false. Remember this for later.

So with all of that said. Let’s take a look at the quirk.

The quirk

This quirk explores how to configure triggers within Zabbix to use both polling and trapping to monitor a network device such as a router or switch.

To illustrate the idea I will keep it simple – interface states. Imagine a template applied to a switch that uses LLD to discover all of the interfaces using the ifIndex table.

Two items prototypes are created:

One that polls the interface state (ifOperStatus) every 5 minutes

and

One that listens for traps about interface states – either going down (for example listening for 1.3.6.1.6.3.1.1.5.3 linkDown traps) or coming up (for example listening for 1.3.6.1.6.3.1.1.5.4 linkUp traps)

The question is, how should the trigger be configured? We do not want to miss an interface that flaps. If an interface drops, we want the trigger to move to a PROBLEM state. But if our trigger is just monitoring the polling item and the interface goes down and comes back up within a polling cycle then Zabbix won’t see the flap.

To illustrate these concepts, I’ll use a diagram that shows a timeline together with what polling and trapping information is received by Zabbix. It uses the following legend:

blog8_image5_legend

This first diagram illustrates how Zabbix could “miss” an interface flap, if it occurs between polling responses:

blog8_image6_diagram1

You can see here, that without trapping, as far as Zabbix is concerned the interface never drops.

So what if we just make our trigger monitor traps?

This also runs into trouble when you consider that SNMP runs over UDP and there is no guarantee that a trap will get through (especially if the interface drop affects routing or forwarding). Worse still, if the trap stating that the interface is down (the DOWN trap) makes it to Zabbix but the recovery trap (the UP trap) doesn’t make it to Zabbix then the trigger will never recover!

blog8_image7_diagram2

It appears that both approaches on their own have setbacks. The logical next step would be to look at combining the best of both worlds – i.e. configure a trigger that will move to a PROBLEM state if it receives a DOWN trap or a poll sees the interface as down. That way, one backs the other up. The idea looks like this:

blog8_image8_diagram3

Seems simple enough. However, the quirk arises when you realise there is still a problem with this approach …. namely, if the UP trap is missed, the trigger will still not recover.

To understand why, we’ll look at the logic of the trigger expression. The trigger expression is a disjunction – an or statement. The two parts of this or statement are:

The last poll states the interface is down

OR

The last trap received indicates the interface is down

A disjunction only requires one of the parts to be true for the whole expression to be true.

Consider this scenario: A DOWN trap is received, making the second part of the expression true. The trigger moves to a PROBLEM state. So far so good. Now image a few minutes later the interface comes back up but the UP trap is never received by Zabbix. Due to the fact that this is a disjunction, even if the last poll shows the interfaces as up, the second half of the expression is still true – as far as Zabbix is concerned that last trap it received showed the interface is down. As a result the alert will never clear (meaning the trigger will never move from PROBLEM back to OK).

blog8_image9_diagram4

There needs to be some way to configure the combination of the two that doesn’t leave the trigger in a PROBLEM state. When searching for a solution, the Recovery Condition comes into play…

The search

To focus on finding a solution we will first look at solving the missing UP trap problem. For now, don’t worry about polling.

Let’s say we a have trigger with the following trigger expression:

The last trap received indicates the interface is down

Then clearly if the trigger has gone off and we miss the UP trap when the interface recovers, this alert will never clear. So what if we combine this, using an and statement, with something else. Something else that will, no matter what, eventually become false. Since an and statement is a conjunction, both parts will need to be true. We can then use the recovery condition to control when the trigger moves back to an OK state.

We can leverage polling for this since, if the interface is down, polling will eventually detect it. So our trigger expression changes to this:

The last trap received indicates the interface is down

AND

The last poll states the interface is up

At first this might seem counter intuitive to what we looked at above, but consider that when an interface drops and the switch sends a trap to Zabbix, stating that the interface is down, the last poll that Zabbix made to the switch should have shown the interface as up –  hence both statements are true and the trigger correctly moves to a PROBLEM state.

But as soon as polling catches up and detects that the interface is down, the second part of our trigger expression with become false. This makes the whole trigger expression false (since it is a conjunction) and the trigger will recover and move back to an OK state.

blog8_image10_diagram5

Now this is obviously not good. The interface is, afterall, still down! But we can use the recovery expression to control when the trigger recovers.

Remember from earlier that if a recovery expression exists, it will be looked at once the problem expression becomes false.

We can’t just configure a recovery expression on its own, until we made the above tweak, since as long as the problem expression says true the recovery expression will still be ignored.

From here the solution is simple. Our recovery expression simply states

The last two polls that we received stated the interface was up.

This means that as soon as polling detects that the interface is down, the problem statement becomes false and the recovery expression is looked at. Now, until two polls in a row detect that the interface is up, the trigger will stay in a PROBLEM state.

blog8_image11_diagram6

Interestingly, what we’ve essentially done is solve the missing UP trap problem, by removing the need to rely on UP traps at all! After two UP polls the trigger recovers (note the blue line of the timeline in the above diagram). You could optionally include an …or an UP trap is received to the recovery statement to make the recovery time quicker.

But there is a caveat to this case…

Consider what happens if an interface flaps within a polling cycle, meaning as far as polling is concerned, the interface never goes down. This would mean that, in the event that UP trap is missed, the problem statement will never become false. This means the trigger will never recover and we’re back to square one…

blog8_image12_diagram7

What we need is something that will inevitably cause the trigger statement to become false. Using polling doesn’t work because as we have seen, it can “miss” an interface flap.

Fortunately Zabbix has a function called nodata which can help us. The function can be found here and works as follows:

nodata(x) = 1(true) or 0(false), where x is the number of seconds where the referenced function has (true) or has not (false) received data.

To better understand this, let’s see what happens is we remove the statement The last poll states the interface is up, and replace it with one that implements this function. Our trigger statement would then become the following:

The last trap received indicates the interface is down

AND

There has been some trap data received in last x seconds (where x > bigger than the polling interval)

The second part of this conjunction is represented by trap.nodata(350) = 0 (e.g. “It is false that there has been no trap information received in the last 350 seconds” which basically means “you have received some trap information in the last 350 seconds”).

Once the 350 seconds expires that statement becomes false and the trigger moves to looking at the recovery expression. Remember our polling interval was 5 minutes, or 300 seconds. 

The value x, must be at least as long as a polling interval, this will give the polling a chance to catch up as it were. Consider a scenario where x is less than a single polling interval and the interface drops just after the last poll. The nodata(x) expression will expire before the next poll comes through. When this happens, the trigger statement is false, so Zabbix will move to look at the Recovery Expression (which states that the last two polls are up). Zabbix will see the last two polls as up and trigger will recover when the interface is still down!

blog8_image13_diagram8

If x is bigger than the polling interval, polling can catch up and the trigger behaves correctly.

blog8_image14_diagram9

Now that we have solved this we can reintroduce polling into the trigger. Remember that the initial DOWN trap could still be missed. We saw that there were problems when trying to integrate polling and trapping together into a trigger’s Problem Expression, but we can easily create a single poll-based trigger.

This trigger can be relatively simple. The Problem Expression simply states that the last two polls show the interface as down. There doesn’t need to be a Recovery Expression, since when the trigger sees two UP polls it can recover without problems.

Now we’ve got another problem though. We don’t want two triggers to go off for just one event. Thankfully Zabbix has the feature of dependence. If we configure the poll-based trigger to only move to a PROBLEM state if the trap-based trigger is not in a PROBLEM state, then this poll-based trigger effectively acts as a backup to the trap-based one. I’ll explore the exact configuration of this in the work section.

Once this has been configured you’ll have a working solution that supports both polling and trapping without having to worry about alerts not triggering or clearing when they should. Let’s take a look at how this configured on the Zabbix UI.

The Work

In this section I will show screenshots of the triggers that are used in the aforementioned solution. I haven’t shown the configuration of the LLD or of any corresponding Actions (that will result in email or text messages being sent), but Zabbix has excellent documentation on the how to configure these features.

First we’ll look at the trapping configuration:

blog8_image15_traptrigger

The Name field can use variables based on OIDs (like ifDesc and ifAlias) that are defined in the Low Level Discovery rule to make the trigger contain meaningful information of the affected interface. The trigger expression references the trap item that listens for interface down traps.

The trap item itself will look at the log output produced by the Zabbix snmptrapd process passing traps through an SNMPTT config file. This process parses incoming traps and creates log entries. Trap items can then match against these logs.

In this case, the item matches against log entries containing the string

“Link up on interface {#SNMPINDEX}” – which is produced when a linkup trap is received

Or

“Link down on interface {#SNMPINDEX}”}” – which is produced when a linkdown trap is received

where {#SNMPINDEX} is the index of the table entry for the ifIndex table.

In this trigger expression the trap item is referenced twice. Firstly, it matches a trap item that has the “link down” substring in it (i.e. if a down trap is received for that ifIndex). Secondly, it uses the noData = 0 (false) function – this means that “some trap data has been received in the past 350 seconds”.

This matches the pseudo-expression we have above:

The last trap received indicates the interface is down

AND

There has been some trap data received in last x seconds (where x > bigger than the polling interval).

If a trap is received stating the interface is up, the trap item will no longer contain the string “link down” – rather it will contain “link up”, so the first part will become false.

Alternative, if no trap is received in 350 second (either UP or DOWN) the second half of the AND statement will become false. The polling interval is less that 350 seconds so if the up trap is missed polling will have the chance to catch up.

Either way, the trigger will eventually look at the recovery expression. The recovery expression references the ifOperStatus item and the ifAdminStatus item.

The recovery expression basically states:

IF

The last two polls of the interface operational state is up

OR

The last poll of the administrative state of the interface is down (i.e. someone has issued ‘shutdown’ on the interface, if it’s an interface on a Cisco device)

THEN recover.

The second half of the disjunction is used to account for scenarios where an engineer deliberately shut down an interface – in which case you would not want the alert to persist.

Next we’ll look at the polling trigger:

blog8_image16_polltrigger

This one is much simpler. The trigger will go off if the last two polls of the interface indicate that the operational state is down (2) AND the admin state is up (1) – meaning that it hasn’t been manually shutdown by an engineer.

Finally, the last trick to making this solution works is in the dependencies tab of this trigger prototype:

blog8_image17_dependency

In this screen, the trap-based trigger has been selected as a dependency for the poll-based trigger. This means that the poll-based trigger will only go off if the trap-based trigger hasn’t gone of. 

So that’s the work involved in configuring the actual triggers and it brings us to the end of this quirk. It demonstrates how to combine polling and trapping into Zabbix triggers to allow for consistent and correct alerting.

Zabbix has a wide range of functions and capabilities – far more than what I’ve outlined there. There may very well be another way to accomplish the same goal so as usual, any thoughts or idea are welcome. 

The Friend of my Friend is my Enemy

Imagine you’re a provider routing a PI space prefix for one of your customers. Now imagine that one of your IX peers started to advertise a more specific subnet of that customer network to you. How would and how should you forward traffic destined for that prefix? This quirk looks at just a such a scenario from the point of view of an ISP that adheres to BCP38 best practice filtering policies…

The quirk

So here’s the scenario:

Blog7_image1_setup

In this setup Xellent IT Ltd is both a customer and a provider. It provides transit for ACME Consulting but it is customer of Provider A. ACME owns PI space and choses to implement some traffic engineering. It advertises a /23 to Xellent IT and a /24 to Provider B.

Now Provider B just happens to peer with Provider A over a public internet exchange. The quirk appears when traffic from the internet, destined to 1.1.1.1/32, enters Provider A’s network, especially when you consider that Provider A implements routing policies that adhere to BCP38.

But first, what is BCP38?

You can read it yourself here, but in short, it is a Best Current Practice document that advocates for prefix filtering to minimise threats like DDoS attacks. It does this by proposing inbound PE filtering on customer connections that block traffic whose source address does not match that of a known downstream customer network. DDoS attacks have spoofed source addresses. So if every Provider filtered traffic from their customers, to make sure that the source address was from the right subnet (and not spoofed) then these kinds of DoS attacks would disappear overnight.

To quote the BCP directly:

In other words, if an ISP is aggregating routing announcements for multiple downstream networks, strict traffic filtering should be used to prohibit traffic which claims to have originated from outside of these aggregated announcements.
BCP38 – P. Ferguson, D. Senie

To put it in diagram form, the basic idea is as follows:

Blog7_image3_BCP38_inbound

A provider can also implement outbound filtering to achieve the same result. That is to say, outbound filters can be applied at peering and transit points to ensure that the source addresses of any packets sent out come from within the customer cone of the provider (a customer cone is the set of prefixes sourced by a provider, either as PI or PA space, that makes up the address space for is customer base). This can be done in conjunction with, or instead of, the inbound filtering approach.

Blog7_image4_BCP38_outbound

There are multiple ways a provider can build their network to adhere to BCP38. As an example, an automated tool could be built that references an RIR database like RIPE. This tool could perform recursive route object lookups on all autonomous systems listed in the providers AS-SET and build an ACL that blocks all outbound border traffic whose source address is not in that list.

Regardless of the method used, this quirk assumes that Provider A is using both inbound and outbound filtering. But as we’ll see, it is the outbound filtering that causes all the trouble… here’s the traffic flow:

Blog7_image2_traffic_blackholing

Now you might ask why the packet would follow this particular path. Isn’t Provider B advertising the more specific /24 it receives from ACME? How come the router that sent the packet to Provider A over the transit link can’t see the /24?

There are a number of reason for this and it depends on how the network of each Autonomous System along the way is designed. However, one common reason could be due to a traffic engineering service offered but Internet Providers call prefix scoping.


Prefix scoping allows a customer to essentially tell its provider how to advertise its prefix to the rest of the internet. This is done by including predetermined BGP communities in the prefix advertisements. The provider will recognise these communities and alter how they advertise that prefix to the wider internet. This could be done through something like route-map filtering on these communities.

In this scenario, perhaps Provider B is offering such a service. ACME may have chosen to attach the ‘do not advertise this prefix to your transit provider x’ community to its BGP advertisement to Provider B. As a result, the /24 prefix doesn’t reach the router connecting to Provider A over its transit link, so it forwards according to the /23.

This is just one example of how traffic can end up at Provider A. For now, let’s get back to the life of this packet as it enters Provider A.

Upon receipt of the packet destined for 1.1.1.1/32, Provider A’s border router will look in its routing table to determine the next hop. Because it is more specific, the 1.1.1.0/24 learned over peering will be seen in the RIB as the best path, not the /23 from the Xellent IT link. The packet is placed in an LSP (assuming an MPLS core) with a next hop of the border router that peers with Provider B at the Internet Exchange.

You can probably see what’s going to happen. When Provider A’s border router at the Internet Exchange tries to forward the packet to Provider B it has to pass through an outbound ACL. This ACL has been built in accordance with BCP38. The ACL simply checks the source address to make sure it is from with the customer cone of Provider A. Since the source address is an unknown public address sourced from off-net, the packet is dropped.

Now this is inherently a good thing isn’t it? Without this filtering, Provider A would be providing transit for free! However, it does pose a problem after all, since traffic for one of its customers subnets is being blackholed.

From here, ACME Consulting gets complaints from its customers that they can’t access their webserver. ACME contacts its transit providers and before you know it, an engineer at Provider B has done a traceroute and calls Provider A to ask why the final hop in the failed trace ends in Provider As network.

So where to from here? What should Provider A do? It doesn’t want to provide transit for free, and its policy states that BCP38 filtering must be in place. Let’s explore the options.

The Search

Before I look at the options available, it worth pausing here to reference an excellent paper by Pierre Francois of the Universite catholique de Louvain entitled Exploiting BGP Scoping Services to Violate Internet Transit Policies. It can be read here and describes the principles underlying what is happening in this quirk in a more high level logistical way that sheds light on why this is happening. I won’t go into exhaustive detail, I highly recommend reading the paper yourself, but to summarise, there are 3 conditions that come together to cause this problem.

  1. The victim Provider whose policy is violated (Provider A) receives the more specific prefix from only peers or transit providers.
  2. The victim Provider also has a customer path towards the less specific prefix.
  3. Some of the victims Providers peer or transit providers did not receive the more specific path.

This is certainly what is happening here. Provider A sees a /24 from its peer (condition 1), a /23 from its customer (condition 2) and the Transit router that forwards the packet to Provider A cannot see the /24 (condition 3). The result of these conditions is that the packet is being forwarded from AS to AS based on a combination of the more specific route and the less specific route. To quote directly from Francois’ paper:

The scoping being performed on a more specific prefix might no longer let routing information for the specific prefix be spread to all ASes of the routing system. In such cases, some ASes will route traffic falling to the range of the more specific prefix, p, according to the routing information obtained for the larger range covering it, P.
Exploiting BGP Scoping Services to Violate
Internet Transit Policies – Pierre Francois

So what options does Provider A have? How can it ensure that traffic isn’t dropped, but at the same time, make sure it can’t be abused into providing free transit for off-net traffic? Well there’s no easy answer but there are several solutions that I’ll consider:

  • Blocking the more specific route from the peer
  • Asking Xellent IT Ltd to advertise the more specific
  • Allowing the transit traffic, but with some conditions

I’ll try to argue that allowing the transit traffic but only as an exception, is the best course of action. But before that, let’s look at the first two options.

Let’s say Provider A applies an inbound route-map on its peering with Provider B (and all other peers and transits for that matter) to block any advertised prefixes that come from its own customer cone (basically, stopping its own prefixes being advertise towards itself from a non-customer). So Provider A would see Provider B advertising 1.1.1.0/24 and recognise that it as part of Xellent ITs supernet and block it.

This would certainly solve the problem of attempting to forward the traffic out of the Internet Exchange. Unfortunately, there are two crushing flaws with this approach.

Firstly, it undermines the intended traffic engineering employed by ACME and comes will all the inherent problems that asymmetric routing holds. For example, traffic ingressing back into ACME via Xellent IT could get dropped by a session-based firewall that it didn’t go through on its way out. Asymmetric routing is a perfect example of the problems than can result from some ASes forwarding on the more specific route and others forwarding on the less specific route.

Second, consider what happens if the link to Xellent IT goes down, or if Xellent IT stops advertising the /23. Suddenly Provider A has no access to the /24 network. Provider A is, in essence, relying on a customer to access part of the internet (this is of course assuming Provider A is not relying on any default routing). This would not only undermine the dual homing of Customer B, but would also stop Provider A’s other customers reaching ACMEs services.

Blog7_image5_block_24 

Clearly forwarding the traffic based on the more specific doesn’t solve anything. It might get through Provider A, but traffic is still being forwarding on a combination of prefix lengths and Provider A could end up denying traffic from its other customers reaching a part of the internet. Not a good look for an internet provider.

What about asking Xellent IT to advertise the more specific? Provider A could then simply prefer the /24 from Xellent IT using local preference. This approach has problems too. ACME isn’t actually advertising the /24 to Xellent IT. Xellent IT would need to ask ACME to do so, however they may not wish to impose such a restriction on their customer. The question then becomes, does Provider A have the right to make such a request? They certainly can’t enforce it.

There is perhaps a legal argument to be made that by not advertising the more specific Provider A is losing revenue. This will be illustrated when we look at the third option of allowing off-net traffic. I won’t broach the topic of whether or not Provider could approach Xellent IT and ask for advertisement of the more specific due to revenue loss, but it is certainly food for thought. For now though, asking Xellent IT to advertise the more specific is perhaps not the preferred approach.

Let’s turn to the third option, which sees Provider A adjust its border policies by adding to its BCP38 ACL. Not only should this ACL permit traffic with source addresses from its customer cone, it should also permit traffic that is destined to prefixes in its customer cone. The idea looks like this:

Blog7_image6_allow_offnet

Now this might look ok. Off-net transit traffic to random public address (outside of Provider As customer cone) is still blocked, and ACMEs traffic isn’t. But this special case of off-net transit opens the door for abuse in a way that could cause Provider A to lose money.

Here’s how it works. For the sake of this explanation, I’ve removed Xellent IT and made ACME a direct customer of Provider A. I’ve also introduced a third service provider.

Blog7_image7_abuse_potential

  • ACME dual homes itself by buying transit from Provider’s A and B. Provider A happens to charge more.
  • ACME advertises its /23 PI space to Provider A
  • It’s /24 is then advertised to Provider B, with a prefix scoping attribute that tells provider B not to advertise the /24 on to any transit providers.
  • As a result of this, Provider C cannot see the more specific /24. Traffic from Provider C traverses Provider A, then Provider B before arriving at ACME.

Blog7_image7_abuse_potential_2

As we’ve already discussed, this violates BCP38 principles and turns Provider A into free transit for off-net traffic. But of perhaps greater importance is the loss of revenue that Provider A experiences. No one is paying for the increased traffic volume across Provider A’s core and Provider A gains no revenue from the increase – since it only crosses free peering boundaries. Provider B benefits as it sees more chargeable bandwidth used on its downstream link to ACME. ACME Ltd benefits since it can use the cheaper connection and utilize Provider A’s peering and transit relationships for free. If ACME had a remote site connecting to Provider C, GRE tunnels across Provider A’s core could further complicate things.

If ACME was clever enough and used looking glasses and other tools to discover the forwarding path, then there clearly is potential for abuse.

Having said all of that, I would argue that if this is done on a case by case basis, in a reactionary way, it would be an acceptable solution.

For example, in this scenario, as long as traffic flows don’t reach too high a volume (something that can be monitored using something like netflow) and only this single subnet is permitted, then for a sake of maintaining network reachability, this is a reasonable exception. It is not likely the ACME is being deliberately malicious, and as long as this exception is monitored, then the revenue loss would be miniscule and allowing a one-off policy violation would seem to be acceptable.

Rather than try and account for these scenarios beforehand, the goal would be to add exceptions and monitor them as they crop up. There are a number of way to detect when these policy violations occur. In this case, the phone call and traceroute from Provider B is a good way to spot the problem. Regrettably that does require something to go wrong for it be found and fixed (meaning a disrupted service for the customer). There are ways to detect these violation apriori, but I won’t detail them here. Francois’ paper presents the option of using an open-source IP management tool like pmacct which is worth reading about.

If off-net transit traffic levels increase, or more policy violations started to appear, more aggressive tactics might need to be looked at. Though for this particular quirk, allowing the transit traffic as an exception and monitoring its throughout seems to me to be a prudent approach.

Because I’ve spoken about this at a very high level, I won’t include a work section with CLI output. I could show an ACL permitting 1.1.1.0/24 outbound but this quirk doesn’t need that level of detail to understand the concepts.

So that’s it! A really fascinating conundrum that is as interesting to figure out as it is to troubleshoot. I’d love to hear if anyone has any thoughts or possible alternatives. I toyed with the idea of using static routing at the PE facing the customer or assigning a community to routes received from peering that are in your customer cone and reacting to that somehow, but both those ideas ran into similar problems to the ones I’ve outlined above. Let me if you have any other ideas. Thanks for reading.

From MPLS L3VPN to PBB-EVPN

This blog introduces PBB-EVPN over an MPLS network. But rather than just describe the technology from scratch, I have tried to structure the explanation assuming the reader is familiar with plain old MPLS L3VPN and is new to PBB and/or EVPN. This was certainly the case with me when I first studied this topic and I’m hoping others in a similar position will find this approach insightful.

I won’t be exploring a specific quirk or scenario – rather I will look at EVPN followed by PBB, giving analogies and comparisons to MPLS L3VPN as I go, before combining them into PBB-EVPN. I will focus on how traffic is identified, learned and forwarded in each section.

So what is PBB-EVPN? Well, besides being hard to say 3 times fast, it is essentially an L2VPN technology. It enables a Layer 2 bridge domain to be stretched across a Service Provider core while utilizing MAC aggregation to deal with scaling issues.

Let’s look at EVPN first.

EVPN

EVPN, or Ethernet VPN, over an MPLS network works on a similar principle to MPLS L3VPN. The best way to conceptualize the difference is to draw an analogy (colour coded to highlight points of comparison)…

MPLS L3VPN assigns PE interfaces to VRFs. It then uses MP-BGP (with the vpnv4 unicast address family) to advertise customer IP Subnets as VPNv4 routes to Route Reflectors or other PEs. Remote PEs that have a VRF configured to import the correct route targets, accept the MP-BGP update and install an ipv4 route into the routing table for that VRF.

EVPN uses PE interfaces linked to bridge-domains with an EVI. It then uses MP-BGP (with the l2vpn evpn address family) to advertise customer MAC addresses as EVPN routes to Route Reflectors or other PEs. Remote PEs that have an EVI configured to import the correct route target, accept the MP-BGP update and install a MAC address into the bridge domain for that EVI.

This analogy is a little crude, but in both cases packets or frames destined for a given subnet or MAC will be imposed with two labels – an inner VPN label and an outer Transport label. The Transport label is typical communicated via something like LDP and will correspond to the next hop loopback of the egress PE. The VPN label is communicated in the MP-BGP updates.

These diagrams illustrate the comparison:

Blog6_image1a_and_b

In EVPN, customer devices tend to be switches rather than routers. PE-CE routing protocols, like eBGP, aren’t used since it operates over layer 2. The Service Provider appears as one big switch. In this sense, it accomplishes the same as VPLS but (among other differences) uses BGP to distribute MAC address information, rather than using a full mesh of pseudowires.

EVPN uses an EVI, or Ethernet Virtual Identifier, to identify a specific instance of EVPN as it maps to a bridge domain. For the purposes of this overview, you can think of an EVI as being quasi-equivalent to a VRF. A customer facing interface will be put into a bridge domain (layer 2 broadcast domain), which will have an EVI identifier associated with it.

The MAC address learning that EVPN utilizes what is called control-plane learning, since it is BGP (a control-plane routing protocol) that distributes the MAC address information. This is in contrast to data-plane learning, which is how a standard switch learns MAC addresses – by associating the source MAC address of a frame to the receiving interface.

The following Cisco IOS-XR config shows an EVPN bridge domain and edge interface setup, side by side with a MPLS L3VPN setup for comparison:

Blog6_output1a_and_b

NB. For MPLS L3VPN config  the RD config (which is usually configured under CE-PE eBGP config) is not shown. PBB config is shown in the EVPN Bridge domain, this will be explained further into the blog.

EVPN seems simple enough at first glance, but it has a scaling problem, which PBB can ultimately help with…

Any given customer site can have hundreds or even thousands of MAC addresses, as opposed to just one subnet (as in an MPLS L3VPN environment). The number of updates and withdrawals that BGP would have to send could be overwhelming if it needed to make adjustments for MAC addresses appearing and disappearing – not to mention the memory requirements. And you can’t summarise MAC addresses like you can IP ranges. It would be like an MPLS L3VPN environment advertising /32 prefixes for every host rather than just one prefix for the subnet. We need a way to summarise or aggregate the MAC addresses.

Here’s where PBB comes in…

PBB – Provider Backbone Bridging (802.1ah)

PBB can help solve the EVPN scaling issue by performing one key function – it maps each customer MAC address to the MAC address of the attaching PE. Customer MAC addresses are called C-MACs. The PE MAC addresses are call B-MACs (or Bridge MACs).

This works by adding an extra layer 2 header to frame as it is forwarded from one site to another across the provider core. The outer layer 2 header has a destination B-MAC address of the PE device that the inner frames destination C-MAC is associated with.  As a result, PBB is often called MAC-in-MAC. This diagram illustrates the concept:

Blog6_image2_pbb

NB. In PBB terminology the provider devices are called Bridges. So a BEB (Backbone Edge Bridge) is a PE and a BCB (Backbone Core Bridge) is a P. For sake of simplicity, I will continue to use PE/P terminology. Also worth noting is that PBB diagrams often show service provider devices as switches, to illustrate the layer 2 nature of the technology – which I’ve done above.

In the above diagram the SID (or Service ID) represents a layer 2 broadcast domain similar to what an EVI represents in EVPN.

Frames arriving on a PE interface will be inspected and, based on certain characteristics, it will be mapped or assigned to a particular Service ID (SID).

The characteristics that determine what SID a frame belongs to can be a number of things:

  • The customer assigned VLAN
  • The Service Provider assigned VLAN
  • Existing SID identifiers
  • The interface it arrives on
  • A combination of the above or other factors

To draw an analogy to MPLS L3VPN – the VRF that an incoming packet is assigned to is determined by whatever VRF is configured on the receiving interface (using ip vrf forwarding CUST_1 in Cisco IOS interface CLI).

Once the SID has been allocated, the entire frame is then encapsulated in the outer layer 2 header with destination MAC of the egress PE.

In this way C-MACs are mapped to either B-MACs or local attachment circuits. Most importantly however the core P routers do not need to learn all of the MAC addresses of the customers. They only deal with the MAC addresses of the PEs. This allows a PE to aggregate all of the attached C-MACs for a given customer behind its own B-MAC.

But how does a remote PE learn which C-MAC maps to which B-MAC?

In PBB learning is done in the data-plane, much like a regular layer 2 switch. When a PE receives a frame from the PBB core, it will strip off the outer layer 2 header and make a note of the source B-MAC (the ingress PE). It will map this source B-MAC to the source C-MAC found on the inner layer 2 header. When a frame arrives on a local attachment circuit, the PE will map the source C-MAC to the attachment circuit in the usual way.

PBB must deal with BUM traffic too. BUM traffic is Broadcast, Unknown Unicast or Multicast traffic. An example of BUM traffic is the arrival or frame for which the destination MAC address is unknown. Rather than broadcast like a regular layer 2 switch would, a PPB PE will set the destination MAC address of the outer layer 2 header to a special multicast MAC address that is built based on the SID and includes all the egress PEs that are part of the same bridge domain. EVPN uses a different method or handling BUM traffic but I will go into that later in the blog.

Overall, PBB is more complicated than the explanation given here, but this is the general principle (if you’re interested, see section 3 of my VPLS, PBB, EVPN and VxLAN Diagrams document that details how PBB can be combined the 802.1ad to add an aggregation layer to a provider network).

Now that we have the MAC-in-MAC features of PBB at our disposal, we can use it to solve the EVPN scaling problem and combine the two…

PBB-EVPN

With the help of PBB, EVPN can be adapted so that it deals with only the B-MACs.

To accomplish this, each EVPN EVI is linked to two bridge domains. One bridge domain is dedicated to customer MAC addresses and connected to the local attachment circuits. The other is dedicated to the PE routers B-MAC addresses. Both of these bridge domains are combined under the same bridge group.

Blog6_image3_bridge_domains

The PE devices will uses data-plane learning to build a MAC database, mapping each C-MAC to either an attachment circuit or the B-MAC of an egress PE. Source C-MAC addresses are learned and associated as traffic flows through the network just like PBB does.

The overall setup would look like this:

Blog6_image4_pbb_evpn_overview

The only thing EVPN needs to concern itself with is advertising the B-MACs of the PE devices. EVPN uses control-plane learning and includes the B-MACs in the MP-BGP l2vpn evpn updates. For example, if you were to look at MAC address known to a particular EVI on a route-reflector, you would only see MAC address for PE routers.

Looking again at the configuration output that we saw above, we can get a better idea of how PBB-EVPN works:

Blog6_output2_pbb_evpn_detail

NB. I have added the concept of a BVI, or Bridged Virtual Interface, to the above output. This can be used to provide a layer 3 breakout or gateway similar to how an SVI works on a L3 switch.

You can view the MAC addresses information using the following command:

Blog6_output3_macs

Now lets look at how PBB-EVPN handles BUM traffic. Unlike PBB on its own, which just sends to a multicast MAC address, PBB-EVPN will use unicast replication and send copies of the frame to all of the remote PEs that are in the same EVI. This is an EVPN method and the PE knows which remote PEs belong to the same EVI by looking in what is called a flood list.

But how does it build this flood list? To learn that, we need to look at EVPN route-types…

MPLS L3VPN sends VPNv4 routes in its updates. But EVPN send more than one “type” of update. The type of update, or route-type as it is called, will denote what kind of information is carried in the update. The route-type is part of the EVPN NLRI.

For the purposes of this blog we will only look at two route-types.

  • Route-Type 2s, which carry MAC addresses (analogous to VPNv4 updates)
  • Route-Type 3s, which carry information on the egress PEs that belong to an EVI.

It is these Route-Type 3s (or RT-3s for short) that are used to build the flood list.

When BUM traffic is received by a PE, it will send copies of the frame to all of its attachment circuits (except the one it received the frame on) and all of the PEs for which it has received a Route-Type 3 update. In other words, it will send to everything in its flood-list.

So the overall process for a BUM packet being forwarded across a PBB-EVPN backbone will look as follows:

Blog6_image5_bum_traffic

So that’s it, in a nutshell. In this way PBB and EVPN can work together to create an L2VPN network across a Service Provider.

There are other aspects of both PBB and EVPN, such as EVPN multi-homing using Ethernet Segment Identifiers or PBB MAC clearing with MIRP to name just a couple, but the purpose of this blog was to provide an introductory overview – specifically for those used to dealing with MPLS L3VPN. Thoughts are welcome, and as always, thank you for reading.

Multihoming without a PE-to-CE Dynamic Routing Protocol

This quirk looks at how a multihomed site without a CE-to-PE routing protocol, like eBGP, can run into failover problems when using a first hop redundancy protocol.

The setup is as follows:

blog5_image1_base_setup

The CE routers in this case are Cisco 887 routers. The WAN connections are ADSL lines. From the CE routers, PPP sessions connect to the provider LNS/BNGs routers (PE1 and PE2). These PPP sessions run over L2TP tunnels between the LAC and LNS. RADIUS is used by the LNS routers to authenticate the PPP sessions and to obtain IP and routing attributes.

CE1 and CE2 are running HSRP. CE1 is Active. The CE LAN interfaces are switchports and the IP/HSRP configurations are on SVIs for the access VLAN. Both CEs have a static default route pointing to the dialer interface for their respective WAN connections. CE1 tracks its dialer interface so that it can lower its HSRP priority if the WAN connection fails (allowing CE2 to take over).

Outbound traffic is routed via the HSRP Active router.

Inbound traffic works as follows:

When an LNS router authenticates a PPP session, it will send an Auth-Request to the RADIUS server. The RADIUS server, when sending its Access-Accept to confirm the user is valid, will also return RADIUS attributes that the LNS server parses and applies to its configuration. For example, the attributes can indicate what IP to assign to the user – a Framed-IP that will show on the dialer interface of the CE. Cisco’s Framed-Route AVP (Attribute Value Pair) can also be used to include static routes.

In this scenario Framed-IP and Framed-Route RADIUS attributes (among others not detailed here) are returned, which gives a WAN IP to the CE and installs a static route onto the LNS router. Each PPP session has one or more LAN ranges associated with it. The static route points traffic for these LAN ranges to the Framed-IP assigned for the PPP session.

The site in this scenario has a /28 network assigned to it. The primary PPP session from CE1 receives two static routes – one for each of the two /29s that the /28 is made up of. The secondary PPP session from CE2 receives a single /28 static route.

These static routes are redistributed into the iBGP running in the service provider network. In the event that a PPP session drops, the associated static routes will be removed from the LNS routers.

Under normal circumstances, incoming traffic will follow either of the two more specific /29s down the primary WAN connection.

There are other ways to prefer one WAN connection over another (using BGP attributes when redistirbuting or similar) but I’ve used this subnet splitting apporach for simplicity.

In the event that the primary WAN connection fails, the following occurs:

For outbound traffic: CE1 lowers its HSRP priority allowing CE2 to take over. Outgoing traffic now goes via CE2.

For inbound traffic: The PPP session on PE1 will drop and both of the static routes will be removed. This leaves the /28 down the secondary WAN connection for traffic to be forwarded down.

blog5_image2_wan_failover

But what happens if the FastEthernet0 LAN interface on CE1 fails?

HSRP will fail over, meaning outbound traffic will leave the site via the secondary WAN connection as expected.

However because the PPP session does not drop, the two /29 static routes to CE1 remain in place. Return traffic will traverse this WAN link and end up at CE1. CE1 has no route to the destination and will send it back over its default. Traffic will then loop until the TTL decrements to zero. The site has lost connectivity.

blog5_image3_lan_failover_problem

A reconfiguration is needed in order to allow for this situation, which is sometimes called “LAN-side failover”.

The Search

The first and most obvious question might be, why not run a routing protocol, like eBGP, between the PEs and CEs? The PE router would learn about the LAN range over this protocol rather than having static routes. The CEs would use redistribute connected and in the event that the LAN failed, this advertisement would cease.

There are a couple reasons why you might not want to run a dynamic PE-to-CE routing protocol. Firstly, there could be a lot of incoming subscriber sessions on the LNS routers. The overhead involved in running so many eBGP sessions might be too much compared to simply using RADIUS Attributes. Secondly, not all CPEs can support BGP, or whatever PE-to-CE protocol you want to run. Granted, an 887 can, but not all devices have this capability.

So with that said, let’s look at some options for how to deal with this issue…

There are several options to resolve this quirk. I’ll explore two of them here, each of which takes a different approach.

The first option is to ensure that in the event that the LAN interface goes down, the CE router automatically brings down the WAN connection.

Depending on the CPE used, there can be multiple ways to do this. In the case of a Cisco 887, a good way to do this is with EEM scripting. The EEM script can be made to trigger based on a tracking object for the LAN interface. You will also need to make sure that a second EEM script is configured to bring the WAN link back up if the LAN link is restored. I will show an example of such a script below.

An alternative approach is to ensure that there is a direct link between the Active and Standby routers in addition to the regular LAN link. Both LAN connections into each CE router would be in the same VLAN, allowing connection to the SVI. This would mean that if Fa0 dropped, HSRP would not fail over. Traffic leaving the site would still go via CE1, but it would pass through CE2 first and use the direct link between them.

blog5_image4_lan_ce_to_ce_link

As a side note, it is worth mentioning that one might mistakenly think that CE2, upon receiving outbound traffic, would forward it directly out of its WAN interface in accordance with its default route (causing asymmetric routing when the traffic returns via CE1). But this doesn’t happen. What needs to be remembered is that the routers interfaces are switchports and the destination MAC address will still be 0000.0c07.acxx (where xx is the HSRP group number). CE1 still holds this MAC meaning CE2 will pass it onwards through its switchport rather than routing the traffic.

In my experience this option is preferable. A single cable run and access port configuration is all that is needed. EEM Scripts can be unreliable at times and might not trigger when they should. Having said that, if this needs to be done on the CPE after deployment and remote hands are not possible, the EEM script might be the best approach.

The Work

The general HSRP setup could be as follows:

hostname CE1
!
interface Vlan10
 description SVI for LAN
 ip address 123.123.123.2 255.255.255.240
 standby 10 ip 123.123.123.1
 standby 10 priority 200
 standby 10 preempt
 standby 10 track 1 decrement 150
!
track 1 interface Dialer0 ip routing
!

The EEM script described above will need to trigger when Fa0 goes down. For that, the following tracker is used:

track 2 interface FastEthernet0 line-protocol

This EEM script will shut down the WAN connection if the tracker goes down and restore it if the tracker comes back up:

event manager applet LAN_FAILOVER_DOWN
 event track 2 state down
 action 1.0 syslog msg "Fa0 down. Shutting down controller interface"
 action 2.0 cli command "enable"
 action 3.0 cli command "configure terminal"
 action 4.0 cli command "controller vdsl 0"
 action 5.0 cli command "shutdown"
 action 6.0 cli command "end"
 action 7.0 syslog msg "Controller interface shutdown complete"
!
event manager applet LAN_FAILOVER_UP
 event track 2 state up
 action 1.0 syslog msg "Fa0 up. Enabling controller interface."
 action 2.0 cli command "enable"
 action 3.0 cli command "configure terminal"
 action 4.0 cli command "controller vdsl 0"
 action 5.0 cli command "no shutdown"
 action 6.0 cli command "end"
 action 7.0 syslog msg "Controller interface enabled."

When Fa0 goes drops, the syslog entries look this this:

Feb 27 14:42:18 GMT: %LINEPROTO-5-UPDOWN: Line protocol on Interface 
FastEthernet0, changed state to down
Feb 27 14:42:19 GMT: %TRACKING-5-STATE: 2 interface Fa0 line-protocol 
Up->Down
Feb 27 14:42:19 GMT: %HA_EM-6-LOG: LAN_FAILOVER_DOWN: Fa0 down. S
hutting down controller interface
Feb 27 14:42:19 GMT: %CONTROLLER-5-UPDOWN: Controller VDSL 0, 
changed state to administratively down
Feb 27 14:42:19 GMT: %SYS-5-CONFIG_I: Configured from console by on 
vty1 (EEM:LAN_FAILOVER_DOWN)
Feb 27 14:42:19 GMT: %HA_EM-6-LOG: LAN_FAILOVER_DOWN: Controller 
interface shutdown complete

And when it is restored…

Feb 27 14:43:53 GMT: %LINK-3-UPDOWN: Interface FastEthernet0, changed 
state to up
Feb 27 14:43:53 GMT: %HA_EM-6-LOG: LAN_FAILOVER_UP: Fa0 up. Enabling 
controller interface.
Feb 27 14:43:54 GMT: %SYS-5-CONFIG_I: Configured from console by on 
vty1 (EEM:LAN_FAILOVER_UP)
Feb 27 14:43:54 GMT: %HA_EM-6-LOG: LAN_FAILOVER_UP: Controller 
interface enabled.
Feb 27 14:44:54 GMT: %CONTROLLER-5-UPDOWN: Controller VDSL 0, 
changed state to up

The second option is simpler and does not require much configuration at all. All we’d need to do is run a cable from Fa1 on CE1 to Fa1 on CE2 and put the following configuration under Fa1:

interface fa1
 description link to other CE for LAN failover
 switchport
 switchport mode access
 switchport access vlan 10

There isn’t much else to show for this solution other than to re-iterate that with this in place, HSRP would not fail over and traffic in both direction would flow via CE2s switchports.

There are other ways to tackle this problem that I have not detailed here (using etherchannel on the LAN perhaps, or something involving floating static routes) and any alternatives ideas would be good to hear about and interesting to discuss. Thanks for reading.

 

MPLS Management misconfiguration

There are many different ways for ISPs to manage MPLS devices like routers and firewalls that are deployed to customer sites. This quirk explores one such solution and looks at a scenario where a misconfiguration results in VRF route leaking between customers.

The quirk

When an ISP deploys Customer Edge (CE) devices to customers sites they might, and often do, want to maintain management. For customers with a simple public internet connection this is usually straight forward – the device is reachable over the internet and  an ACL or similar policy will be configured, allowing access from only a list of approved ISP IP addresses (for extra security VPNs could be used).

However when Peer-to-Peer L3VPN MPLS is used, it is more complicated. The customer network is not directly accessible from the internet without going through some kind of a breakout site. The ISP will either need a link into their customers MPLS network or must configure access through the breakout. This can become complicated as the number of customers, and the number of sites per customer, increases.

One option, presented in this quirk, is to have all MPLS customers PE-CE WAN subnets come from a common supernet range. These WAN subnets can then be exported into a common management VRF using a specific RT. The network that will be used to demonstrate this looks as follows:

blog4_image1_base_setup

This is available for download as a GNS3 lab from here. It includes the solution to the quirk as detailed below.

The ISPs ASN is 500. The two customer have ASNs 100 and 200 (depending on the setup these would typically be private ASNs, but they have been shown here as 100 and 200 for simplicity). A management router (MGMT) in ASN 64512 has access to the PE-CE WAN ranges for all of the customers, all of which come from the supernet 172.30.0.0/16. A special subnet within this range, 172.30.254.0/24, is reserved for the Management network itself. The MGMT router, or MPLS jump box as it may also be called, is connected to this range – as would any other devices requiring access to the MPLS customers devices (backup or monitoring systems for instance… not shown).

The basic idea is that each customer VRF exports their PE-CE WAN ranges with an RT of 500:501. The MGMT VRF then imports this RT.

Along side this, the MGMT VRF will exports its own routes (from the 172.30.254.0/24 supernet) with an RT of 500:500. All of the customer VRFs import 500:500.

This has two key features:

  • Customer WAN ranges will all be from the 172.30.0.0/16 and must not overlap between customers.
  • WAN ranges and site subnets are not, at any point, leaked between customer VRFs.

To get a better idea of how it works, take a look at the following diagram:

blog4_image2_mpls_mgmt_concept

The CLI for each customer VRF setup looks as follows:

ip vrf CUST_1
 description Customer_1_VRF
 rd 500:1
 vpn id 500:1
 export map VRF_EXPORT_MAP
 route-target export 500:1
 route-target import 500:1
 route-target import 500:500
!
route-map VRF_EXPORT_MAP permit 10
 match ip address prefix-list VRF_WANS_EXCEPT_MGMT
 set extcommunity rt 500:501 additive
route-map VRF_EXPORT_MAP permit 20
!
ip prefix-list VRF_WANS_EXCEPT_MGMT seq 10 deny 172.30.254.0/24 le 32
ip prefix-list VRF_WANS_EXCEPT_MGMT seq 20 permit 172.30.0.0/16 le 32

Note that the export map used on customer VRFs makes a point to exclude the routes that the Management supernet (172.30.254.0/24). This is done on the off chance that the range exists within the customers VRF table.

The VRF for the Management network is configured as follows (note this is only configured on CE3 in the above lab):

ip vrf MGMT_VRF
 description VRF for Management of Customer CEs
 rd 500:500
 vpn id 500:500
 route-target export 500:500
 route-target import 500:500
 route-target import 500:501

This results in the WAN ranges for customers being tagged with the 500:501 RT but not the LAN ranges.

PE1#sh bgp vpnv4 unicast vrf CUST_1 172.30.1.0/30
BGP routing table entry for 500:1:172.30.1.0/30, version 9
Paths: (1 available, best #1, table CUST_1)
  Advertised to update-groups:
    1         3

  Local
    0.0.0.0 from 0.0.0.0 (1.1.1.1)
      Origin incomplete, metric 0, localpref 100, weight 32768, valid, 
       sourced, best
      Extended Community: RT:500:1 RT:500:501
      mpls labels in/out 23/aggregate(CUST_1)

PE1#sh bgp vpnv4 unicast vrf CUST_1 192.168.50.0/24
BGP routing table entry for 500:1:192.168.50.0/24, version 3
Paths: (1 available, best #1, table CUST_1)
  Advertised to update-groups:
    3

  100
    172.30.1.2 from 172.30.1.2 (192.168.50.1)
      Origin incomplete, metric 0, localpref 100, valid, external, best
      Extended Community: RT:500:1
      mpls labels in/out 24/nolabel
PE1#

192.168.50.0/24, above, is a one of the LAN ranges and does not have the 500:501 RT.

Every VRF can see the management network and the management network can see all the PE-CE WAN ranges for every customer:

PE1#sh ip route vrf CUST_2

Routing Table: CUST_2
Codes: C - connected, S - static, R - RIP, M - mobile, B - BGP
       D - EIGRP, EX - EIGRP external, O - OSPF, IA - OSPF inter area
       N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
       E1 - OSPF external type 1, E2 - OSPF external type 2
       i - IS-IS, su - IS-IS summary, L1 - IS-IS level-1
       L2 - IS-IS level-2, ia - IS-IS inter area, * - candidate default
       U - per-user static route, o - ODR
       P - periodic downloaded static route

Gateway of last resort is not set

B       192.168.60.0/24 [20/0] via 172.30.1.10, 01:32:17
        172.30.0.0/30 is subnetted, 3 subnets
B         172.30.254.0 [200/0] via 3.3.3.3, 01:32:09
B         172.30.1.4 [200/0] via 2.2.2.2, 01:32:09
C         172.30.1.8 is directly connected, FastEthernet1/0
B       192.168.50.0/24 [200/0] via 2.2.2.2, 01:32:09

PE1#
PE3#sh ip route vrf MGMT_VRF

Routing Table: MGMT_VRF
Codes: C - connected, S - static, R - RIP, M - mobile, B - BGP
       D - EIGRP, EX - EIGRP external, O - OSPF, IA - OSPF inter area
       N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
       E1 - OSPF external type 1, E2 - OSPF external type 2
       i - IS-IS, su - IS-IS summary, L1 - IS-IS level-1
       L2 - IS-IS level-2, ia - IS-IS inter area, * - candidate default
       U - per-user static route, o - ODR
       P - periodic downloaded static route

Gateway of last resort is not set

        172.30.0.0/30 is subnetted, 4 subnets
C         172.30.254.0 is directly connected, FastEthernet0/0
B         172.30.1.0 [200/0] via 1.1.1.1, 01:32:24
B         172.30.1.4 [200/0] via 2.2.2.2, 01:32:24
B         172.30.1.8 [200/0] via 1.1.1.1, 01:32:24

PE3#

Also, note that the routing table for Customer 2 (vrf CUST_2) cannot see the 172.30.1.0/30 WAN range for Customer 1 (vrf CUST_1).

Given the proper config, the MGMT router can access the WAN ranges for customers:

MGMT#telnet 172.30.1.2
Trying 172.30.1.2 ... Open

User Access Verification
Password:
CE1-1>

NB. I’m not advocating using telnet in such an environment. Use SSH as a minimum when you can.

The quirk comes in when a simple misconfiguration introduces route leaking between customer VRFs.

Consider an engineer accidentally configuring a VRF that exports all its vpnv4 prefixes with RT 500:500 (rather than only exporting its PE-CE WAN routes with RT500:501 as described above). The mistake is easy enough to make and will cause routes from the newly configured VRF to be imported by all other customer VRFs. This will have a severe impact for any customers with the same route within their VRF.

To demonstrate this, imagine that the CUST_1 VRF is not yet configured. Pinging from site Customer 2 Site 2 (CE2-2 on the lower left side of the diagram) with a source of 192.168.60.1 to Customer 2 Site 1 (CE1-2) with a destination of 192.168.50.1 works fine

CE2-2#trace 192.168.50.1 source lo1

Type escape sequence to abort.
Tracing the route to 192.168.50.1
 1 172.30.1.9 12 msec 24 msec 24 msec
 2 10.10.14.4 [AS 500] [MPLS: Labels 16/24 Exp 0] 92 msec 64 msec 44 msec
 3 172.30.1.5 [AS 500] [MPLS: Label 24 Exp 0] 48 msec 68 msec 52 msec
 4 172.30.1.6 [AS 500] 116 msec 88 msec 104 msec

CE2-2#

If the CUST_1 VRF is now setup with the aforementioned misconfiguration, route leaking between CUST_1 and CUST_2 will result:

PE1(config)#ip vrf CUST_1
PE1(config-vrf)# description Customer_1_VRF
PE1(config-vrf)# rd 500:1
PE1(config-vrf)# vpn id 500:1
PE1(config-vrf)# route-target export 500:1
PE1(config-vrf)# route-target import 500:1
PE1(config-vrf)# route-target export 500:500
PE1(config-vrf)#
PE1(config-vrf)# interface FastEthernet0/1
PE1(config-if)# description Link to CE 1 for Customer 1
PE1(config-if)# ip vrf forwarding CUST_1
PE1(config-if)# ip address 172.30.1.1 255.255.255.252
PE1(config-if)# duplex auto
PE1(config-if)# speed auto
PE1(config-if)# no shut
PE1(config-if)#exit
PE1(config)#router bgp 500
PE1(config-router)# address-family ipv4 vrf CUST_1
PE1(config-router-af)# redistribute connected
PE1(config-router-af)# redistribute static
PE1(config-router-af)# neighbor 172.30.1.2 remote-as 100
PE1(config-router-af)# neighbor 172.30.1.2 description Customer 1 Site 1
PE1(config-router-af)# neighbor 172.30.1.2 activate
PE1(config-router-af)# neighbor 172.30.1.2 default-originate
PE1(config-router-af)# neighbor 172.30.1.2 as-override
PE1(config-router-af)# neighbor 172.30.1.2 route-map CUST_1_SITE_1_IN in
PE1(config-router-af)# no synchronization
PE1(config-router-af)# exit-address-family
PE1(config-router)#

VRF CUST_1 will export its routes (including 192.168.50.0/24 from Customer 1 Site 1 – CE1-1) and the VRF CUST_2 will import these routes due to the RT of 500:500.

Looking at the BGP and routing table for the CUST_2 VRF shows that the next hop for 192.68.50.0/24 is now the CE1-1 router.

PE1#sh ip route vrf CUST_2 192.168.50.0
Routing entry for 192.168.50.0/24
  Known via "bgp 500", distance 20, metric 0
  Tag 100, type external
  Last update from 172.30.1.2 00:02:45 ago
  Routing Descriptor Blocks:
  * 172.30.1.2 (CUST_1), from 172.30.1.2, 00:02:45 ago
      Route metric is 0, traffic share count is 1
      AS Hops 1
      Route tag 100

PE1#sh bgp vpnv4 unicast vrf CUST_2 192.168.50.0
BGP routing table entry for 500:2:192.168.50.0/24, version 21
Paths: (2 available, best #1, table CUST_2)
  Advertised to update-groups:
    2

  100, imported path from 500:1:192.168.50.0/24
    172.30.1.2 from 172.30.1.2 (192.168.50.1)
      Origin incomplete, metric 0, localpref 100, valid, external, best
      Extended Community: RT:500:1 RT:500:500

  200
    2.2.2.2 (metric 20) from 5.5.5.5 (5.5.5.5)
      Origin incomplete, metric 0, localpref 100, valid, internal
      Extended Community: RT:500:2
      Originator: 2.2.2.2, Cluster list: 5.5.5.5
      mpls labels in/out nolabel/24

PE1#

There are now two possible paths to reach 192.168.50.0/24. One imported from the VRF for CUST_1 and one from its own (coming from CE1-2). The path via AS 100 is being preferred due to the lower IGP metric. Note the 500:500 RT in this path.

Once this is done CE2-2 cannot reach its 192.168.50/24 subnet on CE1-2.

CE2-2#trace 192.168.50.1 source lo1
Type escape sequence to abort.

Tracing the route to 192.168.50.1
1 172.30.1.9 8 msec 12 msec 12 msec
2 * * *
3 * * *
4 * * *
...output omitted for brevity

Granted, this issue is caused by a mistake, but the difference between the correct and incorrect commands is minimal. An engineer under pressure or working quickly could potentially disrupt a massive MPLS infrastructure resulting in outages for multiple customers.

The search

As mentioned at the beginning of this blog, there are multiple ways to manage an MPLS network.

One possibility is to have a single router that, rather than import and export WAN routes based on RTs, has a single loopback address in each VRF. It is from this loopback that the router will source SSH or telnet sessions to the customer CE devices. For example:

interface loopback 1
 description Loopback source for Customer 1
 ip vrf forwarding CUST_1
 ip address 100.100.100.100 255.255.255.255
!
interface loopback 2
 description Loopback source for Customer 2
 ip vrf forwarding CUST_2
 ip address 100.100.100.100 255.255.255.255

MGMT# telnet 172.30.1.2 /vrf CUST_1

This has a number of advantages:

  • This router acts as a single jump host (rather than a subnet), which could be considered more secure
  • There is no restriction on the WAN addresses for each customer. They can be any WAN range at all and can overlap between customers.
  • The same IP address can be used for each VRFs loopback (as long as it doesn’t clash with any existing IPs already in the customers VRF).

However there are a number of disadvantages:

  • Each VRF must be configured on this jump router
  • This jump router is a single point of failure
  • The command to log on is more complex and requires the users to know the VRFs exact name rather than just the router IP.
  • Migrating to this solution, from the aforementioned RT import/export solution, would be a cumbersome and long process.
  • Centralised MPLS backups could be complicated if there is a not a common subnet (like 172.30.254.0/24) reachable by all CE devices.

For these reasons it was decided not to use this solution. Rather, it was decided to use import filtering, to prevent this issue from taking place even if the misconfiguration occurred. The import filtering uses a route-map that makes the followed sequential check:

    1. If a route has the RT 500:500 and is from the management range (172.30.254.0/24) allow it.
    2. If any other route has the RT 500:500, deny it.
    3. Allow the import of all other routes.

Essentially, rather than just importing 500:500, this route-map checks to make sure that a vpnv4 prefix comes from the management range of 172.30.254.0/24. The biggest issue in this scenario was the deployment of this route-map to all VRFs on all PEs. But with a little bit of scripting (I won’t go into the details here), this was far more plausible than the option of deploying a multi-VRF jump router.

The work

The route map described in the above section looks as follows:

ip extcommunity-list standard VRF_MGMT_COMMUNITY permit rt 500:500
ip prefix-list VRF_MGMT_LAN seq 5 permit 172.30.254.0/24 le 32
!
route-map VRF_IMPORT_MAP permit 10
 match ip address prefix-list VRF_MGMT_LAN
 match extcommunity VRF_MGMT_COMMUNITY
!
route-map VRF_IMPORT_MAP deny 20
 match extcommunity VRF_MGMT_COMMUNITY
!
route-map VRF_IMPORT_MAP permit 30

NB. This is a good example of and/or operation in a route map. If the types differ (in this case a prefix list and an extcommunity list) the operation is treated as a conjunction (AND) operation. If the types are the same it is a disjunction (OR) operation.

This will prevent the issue from occurring as it will stop the import of any vpnv4 prefix that has an RT of 500:500 unless it is from the management range.

Here is the configuration of this import map on PE1 (the other PEs are not shown but it should be configured on them too):

PE1#conf t
Enter configuration commands, one per line. End with CNTL/Z.
PE1(config)# ip extcommunity-list standard VRF_MGMT_COMMUNITY permit 
rt 500:500
PE1(config)#ip prefix-list VRF_MGMT_LAN seq 5 permit 172.30.254.0/24 
le 32
PE1(config)#!
PE1(config)#route-map VRF_IMPORT_MAP permit 10
PE1(config-route-map)# match ip address prefix-list VRF_MGMT_LAN
PE1(config-route-map)# match extcommunity VRF_MGMT_COMMUNITY
PE1(config-route-map)#!
PE1(config-route-map)#route-map VRF_IMPORT_MAP deny 20
PE1(config-route-map)# match extcommunity VRF_MGMT_COMMUNITY
PE1(config-route-map)#!
PE1(config-route-map)#route-map VRF_IMPORT_MAP permit 30
PE1(config-route-map)#
PE1(config-route-map)#ip vrf CUST_2
PE1(config-vrf)#import map VRF_IMPORT_MAP

After this addition, in the event that the misconfiguration takes place when creating the CUST_1 VRF, the import map will block the 192.168.50.0/24 subnet. The only path that the CUST_2 VRF has to 192.168.50.0/24 is from CE1-2, which is correct. Here is the configuration and resulting verification:

PE1(config)#ip vrf CUST_1
PE1(config-vrf)# description Customer_1_VRF
PE1(config-vrf)# rd 500:1
PE1(config-vrf)# vpn id 500:1
PE1(config-vrf)# route-target export 500:1
PE1(config-vrf)# route-target import 500:1
PE1(config-vrf)# route-target export 500:500
PE1#sh ip route vrf CUST_2 192.168.50.0
Routing entry for 192.168.50.0/24
  Known via "bgp 500", distance 200, metric 0
  Tag 200, type internal
  Last update from 2.2.2.2 00:22:12 ago
  Routing Descriptor Blocks:
  * 2.2.2.2 (Default-IP-Routing-Table), from 5.5.5.5, 00:22:12 ago
    Route metric is 0, traffic share count is 1
    AS Hops 1
    Route tag 200

PE1#sh bgp vpnv4 unicast vrf CUST_2 192.168.50.0
BGP routing table entry for 500:2:192.168.50.0/24, version 12
Paths: (1 available, best #1, table CUST_2)
Advertised to update-groups:
    2
  200
    2.2.2.2 (metric 20) from 5.5.5.5 (5.5.5.5)
      Origin incomplete, metric 0, localpref 100, valid, internal, best
      Extended Community: RT:500:2
      Originator: 2.2.2.2, Cluster list: 5.5.5.5
      mpls labels in/out nolabel/24
PE1#
CE2-2#trace 192.168.50.1 source lo1

Type escape sequence to abort.
Tracing the route to 192.168.50.1

 1 172.30.1.9 12 msec 24 msec 8 msec
 2 10.10.14.4 [AS 500] [MPLS: Labels 18/24 Exp 0] 60 msec 68 msec 64 msec
 3 172.30.1.5 [AS 500] [MPLS: Label 24 Exp 0] 52 msec 68 msec 44 msec
 4 172.30.1.6 [AS 500] 84 msec 56 msec 56 msec

CE2-2#

Management of the correct WAN device is still working as well…

MGMT#telnet 172.30.1.10
Trying 172.30.1.10 ... Open

User Access Verification

Password:
CE2-2>

Just for good measure, and to double check that our route-map is making a difference, let’s see what happens if we remove the import map from the CUST_2 VRF.

PE1#conf t
Enter configuration commands, one per line. End with CNTL/Z.
PE1(config)#ip vrf CUST_2
PE1(config-vrf)#no import map VRF_IMPORT_MAP
PE1(config-vrf)#^Z
PE1#
*Mar 1 00:27:45.259: %SYS-5-CONFIG_I: Configured from console by console
PE1#sh bgp vpnv4 unicast vrf CUST_2 192.168.50.0
BGP routing table entry for 500:2:192.168.50.0/24, version 22
Paths: (2 available, best #1, table CUST_2)
Flag: 0x820
  Advertised to update-groups:
    2
  100, imported path from 500:1:192.168.50.0/24
    172.30.1.2 from 172.30.1.2 (192.168.50.1)
      Origin incomplete, metric 0, localpref 100, valid, external, best
      Extended Community: RT:500:1 RT:500:500
  200
    2.2.2.2 (metric 20) from 5.5.5.5 (5.5.5.5)
      Origin incomplete, metric 0, localpref 100, valid, internal
      Extended Community: RT:500:2
      Originator: 2.2.2.2, Cluster list: 5.5.5.5
      mpls labels in/out nolabel/24
PE1#

The offending route is imported into the CUST_2 VRF pretty quickly, proving that our route-map works. If the route map is put back in place, and we wait for the BGP Scanner to run (after 30 seconds or less) the vpnv4 prefix is blocked again:

PE1#conf t
Enter configuration commands, one per line. End with CNTL/Z.
PE1(config)#ip vrf CUST_2
PE1(config-vrf)#import map VRF_IMPORT_MAP
PE1(config-vrf)#^Z
PE1#
*Mar 1 00:29:51.443: %SYS-5-CONFIG_I: Configured from console by console
PE1#sh bgp vpnv4 unicast vrf CUST_2 192.168.50.0
BGP routing table entry for 500:2:192.168.50.0/24, version 24
Paths: (1 available, best #1, table CUST_2)
Flag: 0x820
  Advertised to update-groups:
    2
  200
    2.2.2.2 (metric 20) from 5.5.5.5 (5.5.5.5)
      Origin incomplete, metric 0, localpref 100, valid, internal, best
      Extended Community: RT:500:2
      Originator: 2.2.2.2, Cluster list: 5.5.5.5
      mpls labels in/out nolabel/24
PE1#

This quirk shows just one way to successfully configure MPLS management and protect against misconfiguration. Give me a shout if anything was unclear or if you have any thoughts. As mentioned earlier, the GNS3 lab is available for download so have a tinker and see what you think.

Site by Site MPLS Breakout Migration

This months quirk is a bit late. I have been studying furiously and managed to pass my Deploying Cisco Service Provider Advanced Network Routing exam last week. Only two to go before I get CCNP SP. 🙂

Another plus side is that I have a tonne of study notes that I will be uploading over the next few weeks. So anyone interested in Multicast, BGP or IPv6 watch this space.

Anyways, this quirk looks at a design solution whereby a 100+ site MPLS customer needed to change the Service Provider for their primary internet breakout one site at a time…

 

The quirk     

The customer had an L3VPN MPLS cloud with a new ISP, but still had their primary internet breakout with their old ISP.

The below diagram shows a stripped down version of such a network, illustrating the basic idea:

blog3_image1_base_setup

So whilst all of the MPLS sites connected to the new ISPs core, the link to the internet was still going out through a site that connected to the old provider.

The customer needed to move the default route and primary breakout over but did not want to do a single “big bang” migration and move all of the sites at once. Rather, they wanted to migrate one site at a time.

The search

The first step in looking at how to accomplish this was to break down the requirements. The following conditions needed to be met:

  • Each site must still be able to access all other sites and the file/application servers at the primary breakout site. These servers would be moved to the new ISP connection and breakout site 2 last of all.
  • As each site moves over to the new breakout, they only need PAT to gain access to the internet – no public services are run at the remote sites.
  • The PI space held by the customer, used for public facing services on the application servers, would be moved to the new provider once all site were migrated.
  • Sites must be able to be moved one at a time without affecting any other sites.
  • The majority of MPLS sites were single homed with a static default.

Looking at these requirements gave us a good idea of what we needed to achieve.

Policy based routing was considered first. Adjusting either the next hop or VRF using the source address. However this would require too much overhead in identifying the site that had been moved, either the by community value or source prefix, combined with setting the next hop or VRF to use.

Ultimately, the use of a second VRF with “all but default” route leaking was decided upon. This involved creating a second VRF with a default route pointing to the new ISP breakout. All routes except the defaults were to be leaked between these VRFs.

This meant that all we needed to migrate a site, was change the VRF to which the attachment circuit belonged.

It is worth highlight that had there been a significant number of multihomed sites implementing BGP, using policy based routing may have been preferred. This is because a large number of BGP neighborships would need to be reconfigured to the correct VRF.

The work

The below output has been taken from a simulation. The MPLS sites have been represented using loopbacks1-3 on PE_RTR.

First we will take a look at a traceroute to the internet (to IP 50.50.50.50) and the routing table for the original VRF before any changes were made: 

PE_RTR#sh ip route vrf CUST-A-OLD-ISP

Routing Table: CUST-A-OLD-ISP
Codes: C - connected, S - static, R - RIP, M - mobile, B - BGP
 D - EIGRP, EX - EIGRP external, O - OSPF, IA - OSPF inter area
 N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
 E1 - OSPF external type 1, E2 - OSPF external type 2
 i - IS-IS, su - IS-IS summary, L1 - IS-IS level-1, L2 - IS-IS level-2
 ia - IS-IS inter area, * - candidate default, U - per-user static route
 o - ODR, P - periodic downloaded static route

Gateway of last resort is 1.1.1.1 to network 0.0.0.0

 10.0.0.0/30 is subnetted, 3 subnets
B 10.10.11.0/30 [200/0] via 2.2.2.2, 00:15:34
B 10.10.10.0/30 [200/0] via 1.1.1.1, 00:15:34
B 10.20.20.0/30 [200/0] via 2.2.2.2, 00:15:34
C 192.168.1.0/24 is directly connected, Loopback1
C 192.168.2.0/24 is directly connected, Loopback2
C 192.168.3.0/24 is directly connected, Loopback3
B 192.168.50.0/24 [200/0] via 1.1.1.1, 00:15:34
B 192.168.51.0/24 [200/0] via 2.2.2.2, 00:15:34
B 192.168.101.0/24 [200/0] via 1.1.1.1, 00:16:19
B* 0.0.0.0/0 [200/5] via 1.1.1.1, 00:15:43
PE_RTR#

PE_RTR#trace vrf CUST-A-OLD-ISP 50.50.50.50 source lo1

Type escape sequence to abort.
Tracing the route to 50.50.50.50

 1 10.1.3.2 [MPLS: Labels 17/20 Exp 0] 116 msec 72 msec 48 msec
 2 10.10.10.1 [MPLS: Label 20 Exp 0] 24 msec 44 msec 24 msec
 3 10.10.10.2 20 msec 20 msec 36 msec
 4 192.168.50.1 28 msec 56 msec 24 msec
 5 100.100.100.1 116 msec 52 msec 72 msec
 6 100.111.111.1 64 msec 140 msec 60 msec
PE_RTR#

So the WAN range of the breakout in this simulation is 100.100.100.0/29. This is their PI space. Notice the range 192.168.101.0/24, which is the subnet that the file/application servers are on.

The VRF configuration on the PEs is straightforward.

ip vrf CUST-A-OLD-ISP
 description VRF for Old ISP Breakout
 rd 100:1
 route-target export 100:1
 route-target import 100:1

Before we created the new VRF, we needed a way to differentiate what can and cannot be leaked. For this we used filtering when exporting RTs. We designated the RT 100:100 for routes that should be leaked.

First we started by making a prefix list that catches the default route:

ip prefix-list defaultRoute seq 5 permit 0.0.0.0/0
ip prefix-list defaultRoute seq 50 deny 0.0.0.0/0 le 32

Then we specified a route-map that attached the RT 100:100 to prefixes that are not the default route

route-map ALL-EXCEPT-DEFAULT permit 10
 match ip address prefix-list defaultRoute
!
route-map ALL-EXCEPT-DEFAULT permit 20
 set extcommunity rt 100:100 additive

Note the use of the additive keyword so as not to overwrite any existing communities.

Once we had these setup, we created the new VRF and applied this route-map in the form of an export-map to set the correct RTs. We made sure to import 100:100 and then applied the same to original VRF.

ip vrf CUST-A-NEW-ISP
 description VRF for New ISP Breakout
 rd 100:2
 export map ALL-EXCEPT-DEFAULT
 route-target export 100:2
 route-target import 100:100
 route-target import 100:2
!
ip vrf CUST-A-OLD-ISP
 description VRF for Old ISP Breakout
 rd 100:1
 export map ALL-EXCEPT-DEFAULT
 route-target export 100:1
 route-target import 100:100
 route-target import 100:1

From here, after deploying this to all the relevant PEs and injecting a new default route, the migration from one VRF to another was fairly straight forward. Below shows an example using a simulated loopback (the principle would be the same for the incoming attachment circuit to a customer site):

PE_RTR(config)#interface Loopback1
PE_RTR(config-if)# ip vrf forwarding CUST-A-NEW-ISP
% Interface Loopback1 IP address 192.168.1.1 removed due to enabling 
VRF CUST-A-NEW-ISP
PE_RTR(config-if)# ip address 192.168.1.1 255.255.255.0

If we look at the routing table for this new vrf we see the following:

PE_RTR#sh ip route vrf CUST-A-NEW-ISP

Routing Table: CUST-A-NEW-ISP
Codes: C - connected, S - static, R - RIP, M - mobile, B - BGP
 D - EIGRP, EX - EIGRP external, O - OSPF, IA - OSPF inter area
 N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
 E1 - OSPF external type 1, E2 - OSPF external type 2
 i - IS-IS, su - IS-IS summary, L1 - IS-IS level-1, L2 - IS-IS level-2
 ia - IS-IS inter area, * - candidate default, U - per-user static route
 o - ODR, P - periodic downloaded static route

Gateway of last resort is 2.2.2.2 to network 0.0.0.0

 10.0.0.0/30 is subnetted, 3 subnets
B 10.10.11.0/30 [200/0] via 2.2.2.2, 00:16:16
B 10.10.10.0/30 [200/0] via 1.1.1.1, 00:16:16
B 10.20.20.0/30 [200/0] via 2.2.2.2, 00:16:16
C 192.168.1.0/24 is directly connected, Loopback1
B 192.168.2.0/24 is directly connected, 00:16:17, Loopback2
B 192.168.3.0/24 is directly connected, 00:16:23, Loopback3
B 192.168.50.0/24 [200/0] via 1.1.1.1, 00:16:16
B 192.168.51.0/24 [200/0] via 2.2.2.2, 00:16:18
B 192.168.101.0/24 [200/0] via 1.1.1.1, 00:16:20
B* 0.0.0.0/0 [200/0] via 2.2.2.2, 00:16:18
PE_RTR#

An interesting side note here is that even though Loopback2 and 3 are directly connected, they are shown as having been learned through BGP. This is the result of the import from the original VRF. Indeed upon closer inspection of one of the prefixes we see the 100:100 community:

PE_RTR#sh bgp vpnv4 unicast vrf CUST-A-NEW-ISP 192.168.3.0/24
BGP routing table entry for 100:2:192.168.3.0/24, version 47
Paths: (1 available, best #1, table CUST-A-NEW-ISP)
 Not advertised to any peer
 Local, imported path from 100:1:192.168.3.0/24
 0.0.0.0 from 0.0.0.0 (3.3.3.3)
 Origin incomplete, metric 0, localpref 100, weight 32768, valid, 
external, best
 Extended Community: RT:100:1 RT:100:100
 mpls labels in/out nolabel/aggregate(CUST-A-OLD-ISP)

And looking at the default route we see no such community and a different next hop from the original table.

PE_RTR#sh bgp vpnv4 unicast vrf CUST-A-NEW-ISP 0.0.0.0
BGP routing table entry for 100:2:0.0.0.0/0, version 40
Paths: (1 available, best #1, table CUST-A-NEW-ISP)
 Not advertised to any peer
 65489
 2.2.2.2 (metric 3) from 2.2.2.2 (2.2.2.2)
 Origin incomplete, metric 5, localpref 200, valid, internal, best
 Extended Community: RT:100:2
 mpls labels in/out nolabel/23

The old VRFs table still shows a route for the newly migrated site (although now learned via BGP) and the default route is still as it was originally:

PE_RTR#sh ip route vrf CUST-A-OLD-ISP

Routing Table: CUST-A-OLD-ISP
Codes: C - connected, S - static, R - RIP, M - mobile, B - BGP
 D - EIGRP, EX - EIGRP external, O - OSPF, IA - OSPF inter area
 N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
 E1 - OSPF external type 1, E2 - OSPF external type 2
 i - IS-IS, su - IS-IS summary, L1 - IS-IS level-1, L2 - IS-IS level-2
 ia - IS-IS inter area, * - candidate default, U - per-user static route
 o - ODR, P - periodic downloaded static route

Gateway of last resort is 1.1.1.1 to network 0.0.0.0

 10.0.0.0/30 is subnetted, 3 subnets
B 10.10.11.0/30 [200/0] via 2.2.2.2, 00:15:34
B 10.10.10.0/30 [200/0] via 1.1.1.1, 00:15:34
B 10.20.20.0/30 [200/0] via 2.2.2.2, 00:15:34
B 192.168.1.0/24 is directly connected, 00:15:36, Loopback1
C 192.168.2.0/24 is directly connected, Loopback2
C 192.168.3.0/24 is directly connected, Loopback3
B 192.168.50.0/24 [200/0] via 1.1.1.1, 00:15:34
B 192.168.51.0/24 [200/0] via 2.2.2.2, 00:15:34
B 192.168.101.0/24 [200/0] via 1.1.1.1, 00:16:19
B* 0.0.0.0/0 [200/5] via 1.1.1.1, 00:15:43
PE_RTR#
PE_RTR#sh bgp vpnv4 unicast vrf CUST-A-OLD-ISP 0.0.0.0
BGP routing table entry for 100:1:0.0.0.0/0, version 15
Paths: (1 available, best #1, table CUST-A-OLD-ISP)
 Not advertised to any peer
 65489
 1.1.1.1 (metric 3) from 1.1.1.1 (1.1.1.1)
 Origin incomplete, metric 0, localpref 100, valid, internal, best
 Extended Community: RT:100:1
 mpls labels in/out nolabel/26

Finally, a traceroute test shows that the newly migrated site accesses the internet via a different site and can still access the application server subnet

PE_RTR#trace vrf CUST-A-NEW-ISP 50.50.50.50 source lo1

Type escape sequence to abort.
Tracing the route to 50.50.50.50

 1 10.1.3.2 [MPLS: Labels 16/20 Exp 0] 44 msec 40 msec 52 msec
 2 10.20.20.1 [MPLS: Label 20 Exp 0] 32 msec 36 msec 52 msec
 3 10.20.20.2 52 msec 40 msec 32 msec
 4 192.168.51.1 54 msec 39 msec 31 msec
 5 200.200.200.2 68 msec 60 msec 32 msec
 6 200.222.222.2 65 msec 143 msec 62 msec

PE_RTR#
PE_RTR#trace vrf CUST-A-NEW-ISP 192.168.101.1 source lo1

Type escape sequence to abort.
Tracing the route to 192.168.101.1

 1 10.1.3.2 [MPLS: Labels 16/22 Exp 0] 56 msec 52 msec 44 msec
 2 10.10.10.1 [MPLS: Label 22 Exp 0] 36 msec 24 msec 24 msec
 3 10.10.10.2 40 msec 40 msec 36 msec
 4 192.168.50.1 26 msec 57 msec 23 msec
 5 192.168.101.1 32 msec 48 msec 36 msec
PE_RTR#

One final point to make is that advertising the PI space to both providers for backup purposes was a possibility. as-path prepend could have been used from breakout site 2 to make it less preferred. But complications come into play depending on how each provider advertises the PI space and whether they honour any adjustments that the customer makes. Should return traffic not follow the same path, stateful firewall sessions would also encounter also difficulty.

So a pretty straight forward solution in the end but interesting from the perspective of a migration standpoint. I am interest to hear thoughts on whether anyone would have taken a different approach. Perhaps we should have done policy based routing or maybe another solution? As usual thoughts are always welcome.

Asymmetric routing caused by unfiltered redistribution

This quirk demonstrates how the different administrative distances of BGP, combined with the Best Path Selection algorithm can cause asymmetric routing if redistribution isn’t done carefully.

As a reminder, each blog will follow 3 sections: The quirk, the search and the work. The quirk describes the problem, the search shows how a solution was reached and the work shows the technical and CLI aspects.

The quirk

The scenario we will be looking at is as follows:

blog2_image1_base_setup

The network consists of an MPLS core with multiple remote sites (only one is shown here). There is a dual homed breakout site, which passes through a firewall (performing security and address translation services as normal) and onwards to an internet facing WAN connection.

A default route is learned over eBGP from the Provider Edge router (PE4) connected to the internet facing Customer Edge router (CE4). This is redistributed into OSPF. The MPLS facing Customer Edge routers (CE1 and CE2) redistribute OSPF into BGP using the redistribute ospf 1 match internal external 2 command. The default and local 10.200.0.0/24 routes are advertised to the Provider Edge Routers (PE1 and PE2) and into the MPLS core. PE1 gives the routes received from CE1 a local preference of 200 making this WAN link preferred.

So that the breakout firewall has a path back to the MPLS sites, every MPLS sites range is advertised through eBGP into the MPLS core before being sent to CE1 and CE2 and redistributed into OSPF.

The quirk comes into play when you consider that, at this stage, no filtering of any kind is applied to the redistribution. Combine that with the order in which the BGP sessions of CE1 and CE2 establish and we quickly see problems with return traffic from the internet headed back to an MPLS site.

Consider the following sequence of events:

  1. CE2 establishes its eBGP neighborship to PE2 before CE1 establishes it session to PE1. CE2 learns about the MPLS LAN ranges from PE2. These eBGP learned routes have an AD of 20.
  1. CE2 redistributes these eBGP prefixes into the OSFP link state database (LSDB).
  1. CE1 receives the Type 5 LSAs and installs these prefixes into its RIB. These OSPF prefixes have an AD of 110.
  1. Without filtering, CE1 will redistribute these into BGP. BGP will give them a weight of 32,768 (because they are redistributed and thus locally sourced). Another, and sometimes overlooked, aspect is that these locally generated routes will be given an AD of 200.
  2. CE1 now establishes its neighborship to PE1 and receives the prefixes for the MPLS sites over eBGP (just as CE2 did). These eBGP prefixes are installed in the BGP RIB and have an AD of 20. They have a weight of 0 since they are learned from a neighbor.
  1. Now CE1 has to choose the best path back to any given MPLS site. One might think that the decision is easy, by comparing Administrative Distances. CE1 knows about the MPLS sites through eBGP and OSPF. eBGPs AD is 20. OSPFs AD is 110. Therefore eBGP should win right? Not quite. When a router receives paths to a given destination from multiple routing sources, it uses the Administrative Distance to judge the trustworthiness of the protocol – with the lowest one being most trusted. But, what needs to be considered here is that each routing protocol will put its best route forward to be considered… and in the case of BGP this could result in routes with different ADs. Let’s follow what happens:

OSPF has one only E2 route, which has an AD of 110. So OSPF puts this forward.

However the BGP Router process has two options to choose from. It runs through the BGP Best Path Selection Algorithm to decide (for a reminder of its steps take a look at this document).

It doesn’t get very far before a decision is made. In fact, it is on the first step! The route redistributed from OSPF has a weight of 32,768 whereas the one learned from its eBGP neighbor has a weight of 0. Higher weight wins, so BGP selects the prefix that was learned through redistribution and puts it forward. Remember this route has an AD of 200…

CE1 looks at its options and chooses the routing source with the lowest AD, which in this case is OSPF. As a result the OSPF route is installed in the IP RIB.

  1. CE1 does not even redistribute its eBGP learned prefixes into OSPF. Redistribution takes place from the IP RIB and there are no BGP routes in there.
  2. Because of this, the breakout firewall only sees routes for the MPLS sites from CE2 and sets CE2 as the next hop.

From here, we can see that traffic leaving a remote MPLS site destined for the internet, will go out via the primary CE1-PE1 link. However return traffic will go back via the CE2-PE2 link.

blog2_image2_traffic_path

Of course if CE1 establishes its BGP session first this is not an issue, however that is far from ideal. We needed to look at a way to either make sure CE1 brings up its BGP session first, prevent CE1 from learning routes from CE2, or prevent the redistribution back into BGP from OSPF.

 

The search

There are a number of ways to tackle this issue. Some better than others.

One possible approach would be to try to make sure that CE1 was always the first to bring its BGP peering up… or rather, to make sure that CE2 clears its BGP configuration if it detects CE1 bring its BGP neighborship up. The following EEM script, configured on CE2, was used to test this idea:

event manager applet lanprimarywan
 event track 123 state up
 action 1.0 syslog msg "START_EEM_SRIPT1: Soft clears BGP relationship 
 when Primary Routers WAN link comes up"
 action 2.0 cli command "event timer countdown time 60"
 action 3.0 cli command "enable"
 action 4.0 cli command "clear ip bgp 10.10.1.6"
 action 5.0 cli command "end"
 action 6.0 syslog msg "BGP clear by EEM”
!
ip route 10.10.1.1 255.255.255.255 10.200.0.252
!
track 123 ip sla 123
!
ip sla 123
 icmp-echo 10.10.1.1 source-ip 10.200.0.253
 frequency 10
ip sla schedule 123 life forever start-time now
!

In short, CE2 would track the PE1 WAN interface. A static route has been included to make sure that it tracks it by going through CE1 (rather than its WAN connection). If this tracking object came up, CE2 would clear its BGP session. There is a delay timer put into the script to allow a minute for CE1 to bring up its BGP session.

There is a major problem with this approach however. Just because the WAN link is up doesn’t mean the PE1-CE1 BGP neighborship is up. The neighborship could drop for some other reason, without the link failing. If this happened CE2 would never clear its BGP session.

Plus, even if the tracking worked as expected, it might be deemed too disruptive to hard clear a BGP session for such an important site. As we will see, there are better options available.

A second possible approach involves preventing CE1 from learning any OSPF routes from CE2. This can be accomplished using a distribute-list. A distribute-list sits between the Shortest Path First calculation and the IP routing table. It doesn’t stop prefixes from entering the LSDB or affect the best route OSPF chooses. But it will prevent routes moving from the LSDB to the IP Routing table. If a distribute-list is applied inbound and allows only the local LAN ranges and the default route, then the MPLS site prefixes will never enter CE1s IP RIB from OSPF. Since redistribution is performed from the IP RIB, they will never show up in the BGP table.

The configuration would look as follows:

router ospf 1
 redistribute bgp 65489 metric 10 subnets
 network 10.200.0.0 0.0.0.255 area 0
 distribute-list LOCALS_AND_DEFAULT in
!
ip prefix-list LOCALS_AND_DEFAULT seq 5 permit 0.0.0.0/0
ip prefix-list LOCALS_AND_DEFAULT seq 10 permit 10.200.0.0/24
ip prefix-list LOCALS_AND_DEFAULT seq 100 deny 0.0.0.0/0 le 32

This configuration works just fine but there is a third option that makes use of tagging and allows for a cleaner approach.

This third option is outlined in the work section below. It involves making use of tagging and filtering using route-maps.

When prefixes are advertised to CE2 over eBGP and redistributed into OSPF, we can tag the prefixes. We can then configure a route-map on CE1 that only allows prefixes that do not have this tag to be redistributed into BGP.

Let’s explore the configuration of how this would be achieved.

 

The work

For this scenario I have built a GNS3 lab that looks as follows (this is available for download from the GNS3 page):

gns3_mpls_breakout_bgp_and_ospf_lab_7

Three MPLS sites are represented by loopbacks on the router named LNS (representing an L2TP Network Server in name only. It is simply a 3725 running BGP and MPLS). The ranges for these MPLS site are 192.168.1-3.0/24. A loopback with IP 50.50.50.50/32 on the INTERNET router (the cloud image) is used to simulate a public IP.

Here is the base configuration for CE1 and CE2 as far as OSPF and BGP are concerned:

hostname CE1
!
router ospf 1
 router-id 11.11.11.11
 log-adjacency-changes
 redistribute bgp 65489 metric 5 subnets
 network 10.200.0.0 0.0.0.255 area 0
!
router bgp 65489
 bgp log-neighbor-changes
 neighbor 10.10.1.1 remote-as 100
!
 address-family ipv4
  redistribute connected
  redistribute static
  redistribute ospf 1 match internal external 2
  neighbor 10.10.1.1 activate
  neighbor 10.10.1.1 allowas-in
  neighbor 10.10.1.1 soft-reconfiguration inbound
  neighbor 10.10.1.1 route-map BLOCK_LOCALS_AND_DEFAULT in
  neighbor 10.10.1.1 route-map ALLOW_LOCALS_AND_DEFAULT out
  default-information originate
  no auto-summary
  no synchronization
 exit-address-family
!
ip prefix-list LOCALS_AND_DEFAULT seq 5 permit 0.0.0.0/0
ip prefix-list LOCALS_AND_DEFAULT seq 10 permit 10.200.0.0/24
ip prefix-list LOCALS_AND_DEFAULT seq 100 deny 0.0.0.0/0 le 32
!
route-map BLOCK_LOCALS_AND_DEFAULT deny 10
 match ip address prefix-list LOCALS_AND_DEFAULT
!
route-map BLOCK_LOCALS_AND_DEFAULT permit 20
!
route-map ALLOW_LOCALS_AND_DEFAULT permit 10
 match ip address prefix-list LOCALS_AND_DEFAULT
hostname CE2
!
router ospf 1
 router-id 22.22.22.22
 log-adjacency-changes
 redistribute bgp 65489 metric 10 subnets
 network 10.200.0.0 0.0.0.255 area 0
!
router bgp 65489
 bgp log-neighbor-changes
 neighbor 10.10.1.5 remote-as 100
!
 address-family ipv4
  redistribute connected
  redistribute static
  redistribute ospf 1 match internal external 2
  neighbor 10.10.1.5 activate
  neighbor 10.10.1.5 allowas-in
  neighbor 10.10.1.5 soft-reconfiguration inbound
  neighbor 10.10.1.5 route-map BLOCK_LOCALS_AND_DEFAULT in
  neighbor 10.10.1.5 route-map ALLOW_LOCALS_AND_DEFAULT out
  default-information originate
  no auto-summary
  no synchronization
 exit-address-family
!
ip prefix-list LOCALS_AND_DEFAULT seq 5 permit 0.0.0.0/0
ip prefix-list LOCALS_AND_DEFAULT seq 10 permit 10.200.0.0/24
ip prefix-list LOCALS_AND_DEFAULT seq 100 deny 0.0.0.0/0 le 32
!
route-map BLOCK_LOCALS_AND_DEFAULT deny 10
 match ip address prefix-list LOCALS_AND_DEFAULT
!
route-map BLOCK_LOCALS_AND_DEFAULT permit 20
!
route-map ALLOW_LOCALS_AND_DEFAULT permit 10
 match ip address prefix-list LOCALS_AND_DEFAULT
!

Note that CE1 has a lower cost when redistributing routes. This is to ensure the breakout firewall will prefer going via CE1 given the option.

Let’s clear the BGP neighborship of CE1 and see what routes it selects:

CE1#clear ip bgp *
CE1#
*Mar 1 00:06:52.471: %BGP-5-ADJCHANGE: neighbor 10.10.1.1 Down User reset
CE1#
*Mar 1 00:06:53.759: %BGP-5-ADJCHANGE: neighbor 10.10.1.1 Up
CE1#
CE1#sh ip route
Codes: C - connected, S - static, R - RIP, M - mobile, B - BGP
 D - EIGRP, EX - EIGRP external, O - OSPF, IA - OSPF inter area
 N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
 E1 - OSPF external type 1, E2 - OSPF external type 2
 i - IS-IS, su - IS-IS summary, L1 - IS-IS level-1, L2 - IS-IS level-2
 ia - IS-IS inter area, * - candidate default, U - per-user static route
 o - ODR, P - periodic downloaded static route

Gateway of last resort is 10.200.0.1 to network 0.0.0.0

 100.0.0.0/30 is subnetted, 1 subnets
O E2 100.100.100.0 [110/20] via 10.200.0.1, 00:05:46, FastEthernet0/0
 99.0.0.0/29 is subnetted, 1 subnets
O 99.99.99.0 [110/2] via 10.200.0.1, 00:05:46, FastEthernet0/0
 10.0.0.0/8 is variably subnetted, 3 subnets, 2 masks
C 10.10.1.0/30 is directly connected, Serial0/0
B 10.10.1.4/30 [20/0] via 10.10.1.1, 00:00:21
C 10.200.0.0/24 is directly connected, FastEthernet0/0
O E2 192.168.1.0/24 [110/10] via 10.200.0.253, 00:00:24, FastEthernet0/0
O E2 192.168.2.0/24 [110/10] via 10.200.0.253, 00:00:24, FastEthernet0/0
O E2 192.168.3.0/24 [110/10] via 10.200.0.253, 00:00:24, FastEthernet0/0
O*E2 0.0.0.0/0 [110/5] via 10.200.0.1, 00:05:47, FastEthernet0/0
CE1#

So currently CE1 is preferring its E2 OSPF routes to reach the MPLS sites. When pinging from a remote MPLS site we see that it takes an outbound path across the PE1-CE1 link:

LNS#trace vrf CUST_A 50.50.50.50 source lo1

Type escape sequence to abort.
Tracing the route to 50.50.50.50

 1 10.1.3.2 [MPLS: Labels 18/21 Exp 0] 44 msec 48 msec 44 msec
 2 10.10.1.1 [MPLS: Label 21 Exp 0] 36 msec 32 msec 36 msec
 3 10.10.1.2 36 msec 32 msec 36 msec
 4 10.200.0.1 36 msec 84 msec 36 msec
 5 99.99.99.2 100 msec 96 msec 56 msec
 6 100.100.100.2 32 msec 80 msec 80 msec
LNS#

However the path back from the firewall traverses the CE2-PE2 link:

FW#trace 192.168.1.1

Type escape sequence to abort.
Tracing the route to 192.168.1.1

 1 10.200.0.253 32 msec 28 msec 8 msec
 2 10.10.1.5 36 msec 36 msec 36 msec
 3 10.1.2.2 [MPLS: Labels 16/20 Exp 0] 56 msec 40 msec 44 msec
 4 192.168.1.1 [MPLS: Label 20 Exp 0] 84 msec 60 msec 88 msec
FW#

The BGP table of CE1 helps to show what is happening:

CE1#sh bgp ipv4 unicast
BGP table version is 12, local router ID is 11.11.11.11
Status codes: s suppressed, d damped, h history, * valid, > best, i -
internal, r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

 Network Next Hop Metric LocPrf Weight Path
*> 0.0.0.0 10.200.0.1 5 32768 ?
*> 10.10.1.0/30 0.0.0.0 0 32768 ?
* 10.10.1.1 0 0 100 ?
*> 10.10.1.4/30 10.10.1.1 0 100 ?
*> 10.200.0.0/24 0.0.0.0 0 32768 ?
*> 99.99.99.0/29 10.200.0.1 2 32768 ?
*> 100.100.100.0/30 10.200.0.1 20 32768 ?
* 192.168.1.0 10.10.1.1 0 100 ?
*> 10.200.0.253 10 32768 ?
* 192.168.2.0 10.10.1.1 0 100 ?
*> 10.200.0.253 10 32768 ?
* 192.168.3.0 10.10.1.1 0 100 ?
*> 10.200.0.253 10 32768 ?
CE1#

We can see that there are two paths to each MPLS site. The reason for its best path selection becomes clear when taking a closer look at the one of the prefixes:

CE1#sh bgp ipv4 unicast 192.168.1.0/24
BGP routing table entry for 192.168.1.0/24, version 4
Paths: (2 available, best #2, table Default-IP-Routing-Table)
 Not advertised to any peer
 100, (received & used)
 10.10.1.1 from 10.10.1.1 (1.1.1.1)
 Origin incomplete, localpref 100, valid, external
 Local
 10.200.0.253 from 0.0.0.0 (11.11.11.11)
 Origin incomplete, metric 10, localpref 100, weight 32768, valid, 
 sourced, best
CE1#

The path via 10.1.1.1 has a weight of 0 since it is learned from an eBGP neighbor (as the word external implies). The path via CE2 is locally sourced (as the word sourced and the Local AS path imply) and has a weight of 32,768. Because of this, the second path, which has AD 200, is chosen as the best path and ultimately loses out to OSPF.

Now let’s look at fixing this using route-maps and tagging. The first step is to configure CE2 to tag any eBGP routes that it redistributes into OSPF with tag 10.

CCE2#conf t
Enter configuration commands, one per line. End with CNTL/Z.
CE2(config)#route-map SET_TAG permit 10
CE2(config-route-map)#set tag 10
CE2(config-route-map)#exit
CE2(config)#router ospf 1
CE2(config-router)#redistribute bgp 65489 metric 10 subnets route-map 
SET_TAG

The OSPF LSDB now reflects this change:

CE2#sh ip ospf database external 192.168.1.0

 OSPF Router with ID (22.22.22.22) (Process ID 1)

 Type-5 AS External Link States

 LS age: 57
 Options: (No TOS-capability, DC)
 LS Type: AS External Link
 Link State ID: 192.168.1.0 (External Network Number )
 Advertising Router: 22.22.22.22
 LS Seq Number: 8000000C
 Checksum: 0xCE04
 Length: 36
 Network Mask: /24
 Metric Type: 2 (Larger than any link state path)
 TOS: 0
 Metric: 10
 Forward Address: 0.0.0.0
 External Route Tag: 10

CE2#

The next task is to configure CE1 to block redistribution for anything that has tag 10:

CE1#conf t
Enter configuration commands, one per line. End with CNTL/Z.
CE1(config)#route-map BLOCK_TAG deny 10
CE1(config-route-map)#match tag 10
CE1(config-route-map)#route-map BLOCK_TAG permit 20
CE1(config-route-map)#exit
CE1(config)#router bgp 65489
CE1(config-router)# redistribute ospf 1 match internal external 2 
route-map BLOCK_TAG

The effect is immediate. CE1 now prefers to the path over eBGP:

CE1#sh ip route
Codes: C - connected, S - static, R - RIP, M - mobile, B - BGP
 D - EIGRP, EX - EIGRP external, O - OSPF, IA - OSPF inter area
 N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
 E1 - OSPF external type 1, E2 - OSPF external type 2
 i - IS-IS, su - IS-IS summary, L1 - IS-IS level-1, L2 - IS-IS level-2
 ia - IS-IS inter area, * - candidate default, U - per-user static route
 o - ODR, P - periodic downloaded static route

Gateway of last resort is 10.200.0.1 to network 0.0.0.0

 100.0.0.0/30 is subnetted, 1 subnets
O E2 100.100.100.0 [110/20] via 10.200.0.1, 00:31:44, FastEthernet0/0
 99.0.0.0/29 is subnetted, 1 subnets
O 99.99.99.0 [110/2] via 10.200.0.1, 00:31:44, FastEthernet0/0
 10.0.0.0/8 is variably subnetted, 3 subnets, 2 masks
C 10.10.1.0/30 is directly connected, Serial0/0
B 10.10.1.4/30 [20/0] via 10.10.1.1, 00:26:19
C 10.200.0.0/24 is directly connected, FastEthernet0/0
B 192.168.1.0/24 [20/0] via 10.10.1.1, 00:00:26
B 192.168.2.0/24 [20/0] via 10.10.1.1, 00:00:26
B 192.168.3.0/24 [20/0] via 10.10.1.1, 00:00:26
O*E2 0.0.0.0/0 [110/5] via 10.200.0.1, 00:31:45, FastEthernet0/0
CE1#

In addition to this, there is now only one route for the MPLS sites in the BGP RIB:

CE1#show bgp ipv4 unicast
BGP table version is 15, local router ID is 11.11.11.11
Status codes: s suppressed, d damped, h history, * valid, > best, i - 
internal, r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

 Network Next Hop Metric LocPrf Weight Path
*> 0.0.0.0 10.200.0.1 5 32768 ?
*> 10.10.1.0/30 0.0.0.0 0 32768 ?
* 10.10.1.1 0 0 100 ?
*> 10.10.1.4/30 10.10.1.1 0 100 ?
*> 10.200.0.0/24 0.0.0.0 0 32768 ?
*> 99.99.99.0/29 10.200.0.1 2 32768 ?
*> 100.100.100.0/30 10.200.0.1 20 32768 ?
*> 192.168.1.0 10.10.1.1 0 100 ?
*> 192.168.2.0 10.10.1.1 0 100 ?
*> 192.168.3.0 10.10.1.1 0 100 ?
CE1#

Just to double check, we can clear the CE1 BGP session again to make sure that the change sticks:

CE1#
CE1#clear ip bgp *
CE1#
*Mar 1 00:36:04.463: %BGP-5-ADJCHANGE: neighbor 10.10.1.1 Down User reset
*Mar 1 00:36:05.283: %BGP-5-ADJCHANGE: neighbor 10.10.1.1 Up
CE1#show ip route
Codes: C - connected, S - static, R - RIP, M - mobile, B - BGP
 D - EIGRP, EX - EIGRP external, O - OSPF, IA - OSPF inter area
 N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
 E1 - OSPF external type 1, E2 - OSPF external type 2
 i - IS-IS, su - IS-IS summary, L1 - IS-IS level-1, L2 - IS-IS level-2
 ia - IS-IS inter area, * - candidate default, U - per-user static route
 o - ODR, P - periodic downloaded static route

Gateway of last resort is 10.200.0.1 to network 0.0.0.0

 100.0.0.0/30 is subnetted, 1 subnets
O E2 100.100.100.0 [110/20] via 10.200.0.1, 00:35:32, FastEthernet0/0
 99.0.0.0/29 is subnetted, 1 subnets
O 99.99.99.0 [110/2] via 10.200.0.1, 00:35:32, FastEthernet0/0
 10.0.0.0/8 is variably subnetted, 3 subnets, 2 masks
C 10.10.1.0/30 is directly connected, Serial0/0
B 10.10.1.4/30 [20/0] via 10.10.1.1, 00:00:56
C 10.200.0.0/24 is directly connected, FastEthernet0/0
B 192.168.1.0/24 [20/0] via 10.10.1.1, 00:00:57
B 192.168.2.0/24 [20/0] via 10.10.1.1, 00:00:57
B 192.168.3.0/24 [20/0] via 10.10.1.1, 00:00:57
O*E2 0.0.0.0/0 [110/5] via 10.200.0.1, 00:35:33, FastEthernet0/0
CE1#

Success. The route-map has successfully blocked the OSPF being redistributed into the BGP table. As a result the route that the BGP Router process puts forth is the eBGP route, which wins over OSPF with an AD of 20.

A couple of side points to note: An alterative to this approach is to adjust the redistributed routes to make the BGP Best Path Algorithm select the eBGP route over the locally redistributed route. We could have done this using a route-map that resets the weight of the redistributed routes to zero and sets the local preference to 95 (below the default of 100). The config would look as follows:

router bgp 65489
 address-family ipv4
 redistribute ospf 1 match internal external 2 route-map 
 LOWER_WEIGHT_AND_PREF
!
route-map LOWER_WEIGHT_AND_PREF permit 10
 set local-preference 95
 set weight 0

However in this network scenario, there is no real reason to redistribute the MPLS sites back into BGP. It is safer to block them entirely.

It’s also prudent to apply this configuration in the opposite direction as well (tag redistributed routes on CE1 and block them on CE2).

And finally, you might have noticed the route-maps applied inbound and outbound on the eBGP sessions in the base config shown above. These are done to avoid routes looping from BGP to OSPF and back into BGP. MPLS solutions often have multiple sites with the same private AS number meaning allowas-in or as-override must be used to bypass BGP loop prevention (whereby a router running BGP will ignore updates for prefixes that have its own AS number in the AS_PATH attribute). This tagging could easily be used on the outbound advertisements, instead of the prefix-lists shown above. Tagging is more dynamic than manually defining the local ranges using prefix-lists.

Finally let’s confirm routing is following the same path inbound and outbound:

LNS#trace vrf CUST_A 50.50.50.50 source lo1

Type escape sequence to abort.
Tracing the route to 50.50.50.50

 1 10.1.3.2 [MPLS: Labels 18/21 Exp 0] 28 msec 36 msec 40 msec
 2 10.10.1.1 [MPLS: Label 21 Exp 0] 36 msec 36 msec 32 msec
 3 10.10.1.2 40 msec 24 msec 32 msec
 4 10.200.0.1 40 msec 48 msec 40 msec
 5 99.99.99.2 88 msec 56 msec 52 msec
 6 100.100.100.2 92 msec 84 msec 72 msec
LNS#
FW#trace 192.168.1.1 source fa0/0

Type escape sequence to abort.
Tracing the route to 192.168.1.1

 1 10.200.0.253 20 msec
 2 10.10.1.1 24 msec
 3 10.1.2.2 [MPLS: Labels 16/20 Exp 0] 60 msec
 4 192.168.1.1 [MPLS: Label 20 Exp 0] 44 msec 20 msec 44 msec
FW#

Looks good. Routing is symmetric and as expected.

There more ways to solve this problem than I have shown here. Feel free to play around with the lab to see what you can come up with.

Feeback is more than welcome. Let me know if you found this blog useful or interesting. Thanks for reading.