The Friend of my Friend is my Enemy

Imagine you’re a provider routing a PI space prefix for one of your customers. Now imagine that one of your IX peers started to advertise a more specific subnet of that customer network to you. How would and how should you forward traffic destined for that prefix? This quirk looks at just a such a scenario from the point of view of an ISP that adheres to BCP38 best practice filtering policies…

The quirk

So here’s the scenario:

Blog7_image1_setup

In this setup Xellent IT Ltd is both a customer and a provider. It provides transit for ACME Consulting but it is customer of Provider A. ACME owns PI space and choses to implement some traffic engineering. It advertises a /23 to Xellent IT and a /24 to Provider B.

Now Provider B just happens to peer with Provider A over a public internet exchange. The quirk appears when traffic from the internet, destined to 1.1.1.1/32, enters Provider A’s network, especially when you consider that Provider A implements routing policies that adhere to BCP38.

But first, what is BCP38?

You can read it yourself here, but in short, it is a Best Current Practice document that advocates for prefix filtering to minimise threats like DDoS attacks. It does this by proposing inbound PE filtering on customer connections that block traffic whose source address does not match that of a known downstream customer network. DDoS attacks have spoofed source addresses. So if every Provider filtered traffic from their customers, to make sure that the source address was from the right subnet (and not spoofed) then these kinds of DoS attacks would disappear overnight.

To quote the BCP directly:

In other words, if an ISP is aggregating routing announcements for multiple downstream networks, strict traffic filtering should be used to prohibit traffic which claims to have originated from outside of these aggregated announcements.
BCP38 – P. Ferguson, D. Senie

To put it in diagram form, the basic idea is as follows:

Blog7_image3_BCP38_inbound

A provider can also implement outbound filtering to achieve the same result. That is to say, outbound filters can be applied at peering and transit points to ensure that the source addresses of any packets sent out come from within the customer cone of the provider (a customer cone is the set of prefixes sourced by a provider, either as PI or PA space, that makes up the address space for is customer base). This can be done in conjunction with, or instead of, the inbound filtering approach.

Blog7_image4_BCP38_outbound

There are multiple ways a provider can build their network to adhere to BCP38. As an example, an automated tool could be built that references an RIR database like RIPE. This tool could perform recursive route object lookups on all autonomous systems listed in the providers AS-SET and build an ACL that blocks all outbound border traffic whose source address is not in that list.

Regardless of the method used, this quirk assumes that Provider A is using both inbound and outbound filtering. But as we’ll see, it is the outbound filtering that causes all the trouble… here’s the traffic flow:

Blog7_image2_traffic_blackholing

Now you might ask why the packet would follow this particular path. Isn’t Provider B advertising the more specific /24 it receives from ACME? How come the router that sent the packet to Provider A over the transit link can’t see the /24?

There are a number of reason for this and it depends on how the network of each Autonomous System along the way is designed. However, one common reason could be due to a traffic engineering service offered but Internet Providers call prefix scoping.


Prefix scoping allows a customer to essentially tell its provider how to advertise its prefix to the rest of the internet. This is done by including predetermined BGP communities in the prefix advertisements. The provider will recognise these communities and alter how they advertise that prefix to the wider internet. This could be done through something like route-map filtering on these communities.

In this scenario, perhaps Provider B is offering such a service. ACME may have chosen to attach the ‘do not advertise this prefix to your transit provider x’ community to its BGP advertisement to Provider B. As a result, the /24 prefix doesn’t reach the router connecting to Provider A over its transit link, so it forwards according to the /23.

This is just one example of how traffic can end up at Provider A. For now, let’s get back to the life of this packet as it enters Provider A.

Upon receipt of the packet destined for 1.1.1.1/32, Provider A’s border router will look in its routing table to determine the next hop. Because it is more specific, the 1.1.1.0/24 learned over peering will be seen in the RIB as the best path, not the /23 from the Xellent IT link. The packet is placed in an LSP (assuming an MPLS core) with a next hop of the border router that peers with Provider B at the Internet Exchange.

You can probably see what’s going to happen. When Provider A’s border router at the Internet Exchange tries to forward the packet to Provider B it has to pass through an outbound ACL. This ACL has been built in accordance with BCP38. The ACL simply checks the source address to make sure it is from with the customer cone of Provider A. Since the source address is an unknown public address sourced from off-net, the packet is dropped.

Now this is inherently a good thing isn’t it? Without this filtering, Provider A would be providing transit for free! However, it does pose a problem after all, since traffic for one of its customers subnets is being blackholed.

From here, ACME Consulting gets complaints from its customers that they can’t access their webserver. ACME contacts its transit providers and before you know it, an engineer at Provider B has done a traceroute and calls Provider A to ask why the final hop in the failed trace ends in Provider As network.

So where to from here? What should Provider A do? It doesn’t want to provide transit for free, and its policy states that BCP38 filtering must be in place. Let’s explore the options.

The Search

Before I look at the options available, it worth pausing here to reference an excellent paper by Pierre Francois of the Universite catholique de Louvain entitled Exploiting BGP Scoping Services to Violate Internet Transit Policies. It can be read here and describes the principles underlying what is happening in this quirk in a more high level logistical way that sheds light on why this is happening. I won’t go into exhaustive detail, I highly recommend reading the paper yourself, but to summarise, there are 3 conditions that come together to cause this problem.

  1. The victim Provider whose policy is violated (Provider A) receives the more specific prefix from only peers or transit providers.
  2. The victim Provider also has a customer path towards the less specific prefix.
  3. Some of the victims Providers peer or transit providers did not receive the more specific path.

This is certainly what is happening here. Provider A sees a /24 from its peer (condition 1), a /23 from its customer (condition 2) and the Transit router that forwards the packet to Provider A cannot see the /24 (condition 3). The result of these conditions is that the packet is being forwarded from AS to AS based on a combination of the more specific route and the less specific route. To quote directly from Francois’ paper:

The scoping being performed on a more specific prefix might no longer let routing information for the specific prefix be spread to all ASes of the routing system. In such cases, some ASes will route traffic falling to the range of the more specific prefix, p, according to the routing information obtained for the larger range covering it, P.
Exploiting BGP Scoping Services to Violate
Internet Transit Policies – Pierre Francois

So what options does Provider A have? How can it ensure that traffic isn’t dropped, but at the same time, make sure it can’t be abused into providing free transit for off-net traffic? Well there’s no easy answer but there are several solutions that I’ll consider:

  • Blocking the more specific route from the peer
  • Asking Xellent IT Ltd to advertise the more specific
  • Allowing the transit traffic, but with some conditions

I’ll try to argue that allowing the transit traffic but only as an exception, is the best course of action. But before that, let’s look at the first two options.

Let’s say Provider A applies an inbound route-map on its peering with Provider B (and all other peers and transits for that matter) to block any advertised prefixes that come from its own customer cone (basically, stopping its own prefixes being advertise towards itself from a non-customer). So Provider A would see Provider B advertising 1.1.1.0/24 and recognise that it as part of Xellent ITs supernet and block it.

This would certainly solve the problem of attempting to forward the traffic out of the Internet Exchange. Unfortunately, there are two crushing flaws with this approach.

Firstly, it undermines the intended traffic engineering employed by ACME and comes will all the inherent problems that asymmetric routing holds. For example, traffic ingressing back into ACME via Xellent IT could get dropped by a session-based firewall that it didn’t go through on its way out. Asymmetric routing is a perfect example of the problems than can result from some ASes forwarding on the more specific route and others forwarding on the less specific route.

Second, consider what happens if the link to Xellent IT goes down, or if Xellent IT stops advertising the /23. Suddenly Provider A has no access to the /24 network. Provider A is, in essence, relying on a customer to access part of the internet (this is of course assuming Provider A is not relying on any default routing). This would not only undermine the dual homing of Customer B, but would also stop Provider A’s other customers reaching ACMEs services.

Blog7_image5_block_24 

Clearly forwarding the traffic based on the more specific doesn’t solve anything. It might get through Provider A, but traffic is still being forwarding on a combination of prefix lengths and Provider A could end up denying traffic from its other customers reaching a part of the internet. Not a good look for an internet provider.

What about asking Xellent IT to advertise the more specific? Provider A could then simply prefer the /24 from Xellent IT using local preference. This approach has problems too. ACME isn’t actually advertising the /24 to Xellent IT. Xellent IT would need to ask ACME to do so, however they may not wish to impose such a restriction on their customer. The question then becomes, does Provider A have the right to make such a request? They certainly can’t enforce it.

There is perhaps a legal argument to be made that by not advertising the more specific Provider A is losing revenue. This will be illustrated when we look at the third option of allowing off-net traffic. I won’t broach the topic of whether or not Provider could approach Xellent IT and ask for advertisement of the more specific due to revenue loss, but it is certainly food for thought. For now though, asking Xellent IT to advertise the more specific is perhaps not the preferred approach.

Let’s turn to the third option, which sees Provider A adjust its border policies by adding to its BCP38 ACL. Not only should this ACL permit traffic with source addresses from its customer cone, it should also permit traffic that is destined to prefixes in its customer cone. The idea looks like this:

Blog7_image6_allow_offnet

Now this might look ok. Off-net transit traffic to random public address (outside of Provider As customer cone) is still blocked, and ACMEs traffic isn’t. But this special case of off-net transit opens the door for abuse in a way that could cause Provider A to lose money.

Here’s how it works. For the sake of this explanation, I’ve removed Xellent IT and made ACME a direct customer of Provider A. I’ve also introduced a third service provider.

Blog7_image7_abuse_potential

  • ACME dual homes itself by buying transit from Provider’s A and B. Provider A happens to charge more.
  • ACME advertises its /23 PI space to Provider A
  • It’s /24 is then advertised to Provider B, with a prefix scoping attribute that tells provider B not to advertise the /24 on to any transit providers.
  • As a result of this, Provider C cannot see the more specific /24. Traffic from Provider C traverses Provider A, then Provider B before arriving at ACME.

Blog7_image7_abuse_potential_2

As we’ve already discussed, this violates BCP38 principles and turns Provider A into free transit for off-net traffic. But of perhaps greater importance is the loss of revenue that Provider A experiences. No one is paying for the increased traffic volume across Provider A’s core and Provider A gains no revenue from the increase – since it only crosses free peering boundaries. Provider B benefits as it sees more chargeable bandwidth used on its downstream link to ACME. ACME Ltd benefits since it can use the cheaper connection and utilize Provider A’s peering and transit relationships for free. If ACME had a remote site connecting to Provider C, GRE tunnels across Provider A’s core could further complicate things.

If ACME was clever enough and used looking glasses and other tools to discover the forwarding path, then there clearly is potential for abuse.

Having said all of that, I would argue that if this is done on a case by case basis, in a reactionary way, it would be an acceptable solution.

For example, in this scenario, as long as traffic flows don’t reach too high a volume (something that can be monitored using something like netflow) and only this single subnet is permitted, then for a sake of maintaining network reachability, this is a reasonable exception. It is not likely the ACME is being deliberately malicious, and as long as this exception is monitored, then the revenue loss would be miniscule and allowing a one-off policy violation would seem to be acceptable.

Rather than try and account for these scenarios beforehand, the goal would be to add exceptions and monitor them as they crop up. There are a number of way to detect when these policy violations occur. In this case, the phone call and traceroute from Provider B is a good way to spot the problem. Regrettably that does require something to go wrong for it be found and fixed (meaning a disrupted service for the customer). There are ways to detect these violation apriori, but I won’t detail them here. Francois’ paper presents the option of using an open-source IP management tool like pmacct which is worth reading about.

If off-net transit traffic levels increase, or more policy violations started to appear, more aggressive tactics might need to be looked at. Though for this particular quirk, allowing the transit traffic as an exception and monitoring its throughout seems to me to be a prudent approach.

Because I’ve spoken about this at a very high level, I won’t include a work section with CLI output. I could show an ACL permitting 1.1.1.0/24 outbound but this quirk doesn’t need that level of detail to understand the concepts.

So that’s it! A really fascinating conundrum that is as interesting to figure out as it is to troubleshoot. I’d love to hear if anyone has any thoughts or possible alternatives. I toyed with the idea of using static routing at the PE facing the customer or assigning a community to routes received from peering that are in your customer cone and reacting to that somehow, but both those ideas ran into similar problems to the ones I’ve outlined above. Let me if you have any other ideas. Thanks for reading.

Asymmetric routing caused by unfiltered redistribution

This quirk demonstrates how the different administrative distances of BGP, combined with the Best Path Selection algorithm can cause asymmetric routing if redistribution isn’t done carefully.

As a reminder, each blog will follow 3 sections: The quirk, the search and the work. The quirk describes the problem, the search shows how a solution was reached and the work shows the technical and CLI aspects.

The quirk

The scenario we will be looking at is as follows:

blog2_image1_base_setup

The network consists of an MPLS core with multiple remote sites (only one is shown here). There is a dual homed breakout site, which passes through a firewall (performing security and address translation services as normal) and onwards to an internet facing WAN connection.

A default route is learned over eBGP from the Provider Edge router (PE4) connected to the internet facing Customer Edge router (CE4). This is redistributed into OSPF. The MPLS facing Customer Edge routers (CE1 and CE2) redistribute OSPF into BGP using the redistribute ospf 1 match internal external 2 command. The default and local 10.200.0.0/24 routes are advertised to the Provider Edge Routers (PE1 and PE2) and into the MPLS core. PE1 gives the routes received from CE1 a local preference of 200 making this WAN link preferred.

So that the breakout firewall has a path back to the MPLS sites, every MPLS sites range is advertised through eBGP into the MPLS core before being sent to CE1 and CE2 and redistributed into OSPF.

The quirk comes into play when you consider that, at this stage, no filtering of any kind is applied to the redistribution. Combine that with the order in which the BGP sessions of CE1 and CE2 establish and we quickly see problems with return traffic from the internet headed back to an MPLS site.

Consider the following sequence of events:

  1. CE2 establishes its eBGP neighborship to PE2 before CE1 establishes it session to PE1. CE2 learns about the MPLS LAN ranges from PE2. These eBGP learned routes have an AD of 20.
  1. CE2 redistributes these eBGP prefixes into the OSFP link state database (LSDB).
  1. CE1 receives the Type 5 LSAs and installs these prefixes into its RIB. These OSPF prefixes have an AD of 110.
  1. Without filtering, CE1 will redistribute these into BGP. BGP will give them a weight of 32,768 (because they are redistributed and thus locally sourced). Another, and sometimes overlooked, aspect is that these locally generated routes will be given an AD of 200.
  2. CE1 now establishes its neighborship to PE1 and receives the prefixes for the MPLS sites over eBGP (just as CE2 did). These eBGP prefixes are installed in the BGP RIB and have an AD of 20. They have a weight of 0 since they are learned from a neighbor.
  1. Now CE1 has to choose the best path back to any given MPLS site. One might think that the decision is easy, by comparing Administrative Distances. CE1 knows about the MPLS sites through eBGP and OSPF. eBGPs AD is 20. OSPFs AD is 110. Therefore eBGP should win right? Not quite. When a router receives paths to a given destination from multiple routing sources, it uses the Administrative Distance to judge the trustworthiness of the protocol – with the lowest one being most trusted. But, what needs to be considered here is that each routing protocol will put its best route forward to be considered… and in the case of BGP this could result in routes with different ADs. Let’s follow what happens:

OSPF has one only E2 route, which has an AD of 110. So OSPF puts this forward.

However the BGP Router process has two options to choose from. It runs through the BGP Best Path Selection Algorithm to decide (for a reminder of its steps take a look at this document).

It doesn’t get very far before a decision is made. In fact, it is on the first step! The route redistributed from OSPF has a weight of 32,768 whereas the one learned from its eBGP neighbor has a weight of 0. Higher weight wins, so BGP selects the prefix that was learned through redistribution and puts it forward. Remember this route has an AD of 200…

CE1 looks at its options and chooses the routing source with the lowest AD, which in this case is OSPF. As a result the OSPF route is installed in the IP RIB.

  1. CE1 does not even redistribute its eBGP learned prefixes into OSPF. Redistribution takes place from the IP RIB and there are no BGP routes in there.
  2. Because of this, the breakout firewall only sees routes for the MPLS sites from CE2 and sets CE2 as the next hop.

From here, we can see that traffic leaving a remote MPLS site destined for the internet, will go out via the primary CE1-PE1 link. However return traffic will go back via the CE2-PE2 link.

blog2_image2_traffic_path

Of course if CE1 establishes its BGP session first this is not an issue, however that is far from ideal. We needed to look at a way to either make sure CE1 brings up its BGP session first, prevent CE1 from learning routes from CE2, or prevent the redistribution back into BGP from OSPF.

 

The search

There are a number of ways to tackle this issue. Some better than others.

One possible approach would be to try to make sure that CE1 was always the first to bring its BGP peering up… or rather, to make sure that CE2 clears its BGP configuration if it detects CE1 bring its BGP neighborship up. The following EEM script, configured on CE2, was used to test this idea:

event manager applet lanprimarywan
 event track 123 state up
 action 1.0 syslog msg "START_EEM_SRIPT1: Soft clears BGP relationship 
 when Primary Routers WAN link comes up"
 action 2.0 cli command "event timer countdown time 60"
 action 3.0 cli command "enable"
 action 4.0 cli command "clear ip bgp 10.10.1.6"
 action 5.0 cli command "end"
 action 6.0 syslog msg "BGP clear by EEM”
!
ip route 10.10.1.1 255.255.255.255 10.200.0.252
!
track 123 ip sla 123
!
ip sla 123
 icmp-echo 10.10.1.1 source-ip 10.200.0.253
 frequency 10
ip sla schedule 123 life forever start-time now
!

In short, CE2 would track the PE1 WAN interface. A static route has been included to make sure that it tracks it by going through CE1 (rather than its WAN connection). If this tracking object came up, CE2 would clear its BGP session. There is a delay timer put into the script to allow a minute for CE1 to bring up its BGP session.

There is a major problem with this approach however. Just because the WAN link is up doesn’t mean the PE1-CE1 BGP neighborship is up. The neighborship could drop for some other reason, without the link failing. If this happened CE2 would never clear its BGP session.

Plus, even if the tracking worked as expected, it might be deemed too disruptive to hard clear a BGP session for such an important site. As we will see, there are better options available.

A second possible approach involves preventing CE1 from learning any OSPF routes from CE2. This can be accomplished using a distribute-list. A distribute-list sits between the Shortest Path First calculation and the IP routing table. It doesn’t stop prefixes from entering the LSDB or affect the best route OSPF chooses. But it will prevent routes moving from the LSDB to the IP Routing table. If a distribute-list is applied inbound and allows only the local LAN ranges and the default route, then the MPLS site prefixes will never enter CE1s IP RIB from OSPF. Since redistribution is performed from the IP RIB, they will never show up in the BGP table.

The configuration would look as follows:

router ospf 1
 redistribute bgp 65489 metric 10 subnets
 network 10.200.0.0 0.0.0.255 area 0
 distribute-list LOCALS_AND_DEFAULT in
!
ip prefix-list LOCALS_AND_DEFAULT seq 5 permit 0.0.0.0/0
ip prefix-list LOCALS_AND_DEFAULT seq 10 permit 10.200.0.0/24
ip prefix-list LOCALS_AND_DEFAULT seq 100 deny 0.0.0.0/0 le 32

This configuration works just fine but there is a third option that makes use of tagging and allows for a cleaner approach.

This third option is outlined in the work section below. It involves making use of tagging and filtering using route-maps.

When prefixes are advertised to CE2 over eBGP and redistributed into OSPF, we can tag the prefixes. We can then configure a route-map on CE1 that only allows prefixes that do not have this tag to be redistributed into BGP.

Let’s explore the configuration of how this would be achieved.

 

The work

For this scenario I have built a GNS3 lab that looks as follows (this is available for download from the GNS3 page):

gns3_mpls_breakout_bgp_and_ospf_lab_7

Three MPLS sites are represented by loopbacks on the router named LNS (representing an L2TP Network Server in name only. It is simply a 3725 running BGP and MPLS). The ranges for these MPLS site are 192.168.1-3.0/24. A loopback with IP 50.50.50.50/32 on the INTERNET router (the cloud image) is used to simulate a public IP.

Here is the base configuration for CE1 and CE2 as far as OSPF and BGP are concerned:

hostname CE1
!
router ospf 1
 router-id 11.11.11.11
 log-adjacency-changes
 redistribute bgp 65489 metric 5 subnets
 network 10.200.0.0 0.0.0.255 area 0
!
router bgp 65489
 bgp log-neighbor-changes
 neighbor 10.10.1.1 remote-as 100
!
 address-family ipv4
  redistribute connected
  redistribute static
  redistribute ospf 1 match internal external 2
  neighbor 10.10.1.1 activate
  neighbor 10.10.1.1 allowas-in
  neighbor 10.10.1.1 soft-reconfiguration inbound
  neighbor 10.10.1.1 route-map BLOCK_LOCALS_AND_DEFAULT in
  neighbor 10.10.1.1 route-map ALLOW_LOCALS_AND_DEFAULT out
  default-information originate
  no auto-summary
  no synchronization
 exit-address-family
!
ip prefix-list LOCALS_AND_DEFAULT seq 5 permit 0.0.0.0/0
ip prefix-list LOCALS_AND_DEFAULT seq 10 permit 10.200.0.0/24
ip prefix-list LOCALS_AND_DEFAULT seq 100 deny 0.0.0.0/0 le 32
!
route-map BLOCK_LOCALS_AND_DEFAULT deny 10
 match ip address prefix-list LOCALS_AND_DEFAULT
!
route-map BLOCK_LOCALS_AND_DEFAULT permit 20
!
route-map ALLOW_LOCALS_AND_DEFAULT permit 10
 match ip address prefix-list LOCALS_AND_DEFAULT
hostname CE2
!
router ospf 1
 router-id 22.22.22.22
 log-adjacency-changes
 redistribute bgp 65489 metric 10 subnets
 network 10.200.0.0 0.0.0.255 area 0
!
router bgp 65489
 bgp log-neighbor-changes
 neighbor 10.10.1.5 remote-as 100
!
 address-family ipv4
  redistribute connected
  redistribute static
  redistribute ospf 1 match internal external 2
  neighbor 10.10.1.5 activate
  neighbor 10.10.1.5 allowas-in
  neighbor 10.10.1.5 soft-reconfiguration inbound
  neighbor 10.10.1.5 route-map BLOCK_LOCALS_AND_DEFAULT in
  neighbor 10.10.1.5 route-map ALLOW_LOCALS_AND_DEFAULT out
  default-information originate
  no auto-summary
  no synchronization
 exit-address-family
!
ip prefix-list LOCALS_AND_DEFAULT seq 5 permit 0.0.0.0/0
ip prefix-list LOCALS_AND_DEFAULT seq 10 permit 10.200.0.0/24
ip prefix-list LOCALS_AND_DEFAULT seq 100 deny 0.0.0.0/0 le 32
!
route-map BLOCK_LOCALS_AND_DEFAULT deny 10
 match ip address prefix-list LOCALS_AND_DEFAULT
!
route-map BLOCK_LOCALS_AND_DEFAULT permit 20
!
route-map ALLOW_LOCALS_AND_DEFAULT permit 10
 match ip address prefix-list LOCALS_AND_DEFAULT
!

Note that CE1 has a lower cost when redistributing routes. This is to ensure the breakout firewall will prefer going via CE1 given the option.

Let’s clear the BGP neighborship of CE1 and see what routes it selects:

CE1#clear ip bgp *
CE1#
*Mar 1 00:06:52.471: %BGP-5-ADJCHANGE: neighbor 10.10.1.1 Down User reset
CE1#
*Mar 1 00:06:53.759: %BGP-5-ADJCHANGE: neighbor 10.10.1.1 Up
CE1#
CE1#sh ip route
Codes: C - connected, S - static, R - RIP, M - mobile, B - BGP
 D - EIGRP, EX - EIGRP external, O - OSPF, IA - OSPF inter area
 N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
 E1 - OSPF external type 1, E2 - OSPF external type 2
 i - IS-IS, su - IS-IS summary, L1 - IS-IS level-1, L2 - IS-IS level-2
 ia - IS-IS inter area, * - candidate default, U - per-user static route
 o - ODR, P - periodic downloaded static route

Gateway of last resort is 10.200.0.1 to network 0.0.0.0

 100.0.0.0/30 is subnetted, 1 subnets
O E2 100.100.100.0 [110/20] via 10.200.0.1, 00:05:46, FastEthernet0/0
 99.0.0.0/29 is subnetted, 1 subnets
O 99.99.99.0 [110/2] via 10.200.0.1, 00:05:46, FastEthernet0/0
 10.0.0.0/8 is variably subnetted, 3 subnets, 2 masks
C 10.10.1.0/30 is directly connected, Serial0/0
B 10.10.1.4/30 [20/0] via 10.10.1.1, 00:00:21
C 10.200.0.0/24 is directly connected, FastEthernet0/0
O E2 192.168.1.0/24 [110/10] via 10.200.0.253, 00:00:24, FastEthernet0/0
O E2 192.168.2.0/24 [110/10] via 10.200.0.253, 00:00:24, FastEthernet0/0
O E2 192.168.3.0/24 [110/10] via 10.200.0.253, 00:00:24, FastEthernet0/0
O*E2 0.0.0.0/0 [110/5] via 10.200.0.1, 00:05:47, FastEthernet0/0
CE1#

So currently CE1 is preferring its E2 OSPF routes to reach the MPLS sites. When pinging from a remote MPLS site we see that it takes an outbound path across the PE1-CE1 link:

LNS#trace vrf CUST_A 50.50.50.50 source lo1

Type escape sequence to abort.
Tracing the route to 50.50.50.50

 1 10.1.3.2 [MPLS: Labels 18/21 Exp 0] 44 msec 48 msec 44 msec
 2 10.10.1.1 [MPLS: Label 21 Exp 0] 36 msec 32 msec 36 msec
 3 10.10.1.2 36 msec 32 msec 36 msec
 4 10.200.0.1 36 msec 84 msec 36 msec
 5 99.99.99.2 100 msec 96 msec 56 msec
 6 100.100.100.2 32 msec 80 msec 80 msec
LNS#

However the path back from the firewall traverses the CE2-PE2 link:

FW#trace 192.168.1.1

Type escape sequence to abort.
Tracing the route to 192.168.1.1

 1 10.200.0.253 32 msec 28 msec 8 msec
 2 10.10.1.5 36 msec 36 msec 36 msec
 3 10.1.2.2 [MPLS: Labels 16/20 Exp 0] 56 msec 40 msec 44 msec
 4 192.168.1.1 [MPLS: Label 20 Exp 0] 84 msec 60 msec 88 msec
FW#

The BGP table of CE1 helps to show what is happening:

CE1#sh bgp ipv4 unicast
BGP table version is 12, local router ID is 11.11.11.11
Status codes: s suppressed, d damped, h history, * valid, > best, i -
internal, r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

 Network Next Hop Metric LocPrf Weight Path
*> 0.0.0.0 10.200.0.1 5 32768 ?
*> 10.10.1.0/30 0.0.0.0 0 32768 ?
* 10.10.1.1 0 0 100 ?
*> 10.10.1.4/30 10.10.1.1 0 100 ?
*> 10.200.0.0/24 0.0.0.0 0 32768 ?
*> 99.99.99.0/29 10.200.0.1 2 32768 ?
*> 100.100.100.0/30 10.200.0.1 20 32768 ?
* 192.168.1.0 10.10.1.1 0 100 ?
*> 10.200.0.253 10 32768 ?
* 192.168.2.0 10.10.1.1 0 100 ?
*> 10.200.0.253 10 32768 ?
* 192.168.3.0 10.10.1.1 0 100 ?
*> 10.200.0.253 10 32768 ?
CE1#

We can see that there are two paths to each MPLS site. The reason for its best path selection becomes clear when taking a closer look at the one of the prefixes:

CE1#sh bgp ipv4 unicast 192.168.1.0/24
BGP routing table entry for 192.168.1.0/24, version 4
Paths: (2 available, best #2, table Default-IP-Routing-Table)
 Not advertised to any peer
 100, (received & used)
 10.10.1.1 from 10.10.1.1 (1.1.1.1)
 Origin incomplete, localpref 100, valid, external
 Local
 10.200.0.253 from 0.0.0.0 (11.11.11.11)
 Origin incomplete, metric 10, localpref 100, weight 32768, valid, 
 sourced, best
CE1#

The path via 10.1.1.1 has a weight of 0 since it is learned from an eBGP neighbor (as the word external implies). The path via CE2 is locally sourced (as the word sourced and the Local AS path imply) and has a weight of 32,768. Because of this, the second path, which has AD 200, is chosen as the best path and ultimately loses out to OSPF.

Now let’s look at fixing this using route-maps and tagging. The first step is to configure CE2 to tag any eBGP routes that it redistributes into OSPF with tag 10.

CCE2#conf t
Enter configuration commands, one per line. End with CNTL/Z.
CE2(config)#route-map SET_TAG permit 10
CE2(config-route-map)#set tag 10
CE2(config-route-map)#exit
CE2(config)#router ospf 1
CE2(config-router)#redistribute bgp 65489 metric 10 subnets route-map 
SET_TAG

The OSPF LSDB now reflects this change:

CE2#sh ip ospf database external 192.168.1.0

 OSPF Router with ID (22.22.22.22) (Process ID 1)

 Type-5 AS External Link States

 LS age: 57
 Options: (No TOS-capability, DC)
 LS Type: AS External Link
 Link State ID: 192.168.1.0 (External Network Number )
 Advertising Router: 22.22.22.22
 LS Seq Number: 8000000C
 Checksum: 0xCE04
 Length: 36
 Network Mask: /24
 Metric Type: 2 (Larger than any link state path)
 TOS: 0
 Metric: 10
 Forward Address: 0.0.0.0
 External Route Tag: 10

CE2#

The next task is to configure CE1 to block redistribution for anything that has tag 10:

CE1#conf t
Enter configuration commands, one per line. End with CNTL/Z.
CE1(config)#route-map BLOCK_TAG deny 10
CE1(config-route-map)#match tag 10
CE1(config-route-map)#route-map BLOCK_TAG permit 20
CE1(config-route-map)#exit
CE1(config)#router bgp 65489
CE1(config-router)# redistribute ospf 1 match internal external 2 
route-map BLOCK_TAG

The effect is immediate. CE1 now prefers to the path over eBGP:

CE1#sh ip route
Codes: C - connected, S - static, R - RIP, M - mobile, B - BGP
 D - EIGRP, EX - EIGRP external, O - OSPF, IA - OSPF inter area
 N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
 E1 - OSPF external type 1, E2 - OSPF external type 2
 i - IS-IS, su - IS-IS summary, L1 - IS-IS level-1, L2 - IS-IS level-2
 ia - IS-IS inter area, * - candidate default, U - per-user static route
 o - ODR, P - periodic downloaded static route

Gateway of last resort is 10.200.0.1 to network 0.0.0.0

 100.0.0.0/30 is subnetted, 1 subnets
O E2 100.100.100.0 [110/20] via 10.200.0.1, 00:31:44, FastEthernet0/0
 99.0.0.0/29 is subnetted, 1 subnets
O 99.99.99.0 [110/2] via 10.200.0.1, 00:31:44, FastEthernet0/0
 10.0.0.0/8 is variably subnetted, 3 subnets, 2 masks
C 10.10.1.0/30 is directly connected, Serial0/0
B 10.10.1.4/30 [20/0] via 10.10.1.1, 00:26:19
C 10.200.0.0/24 is directly connected, FastEthernet0/0
B 192.168.1.0/24 [20/0] via 10.10.1.1, 00:00:26
B 192.168.2.0/24 [20/0] via 10.10.1.1, 00:00:26
B 192.168.3.0/24 [20/0] via 10.10.1.1, 00:00:26
O*E2 0.0.0.0/0 [110/5] via 10.200.0.1, 00:31:45, FastEthernet0/0
CE1#

In addition to this, there is now only one route for the MPLS sites in the BGP RIB:

CE1#show bgp ipv4 unicast
BGP table version is 15, local router ID is 11.11.11.11
Status codes: s suppressed, d damped, h history, * valid, > best, i - 
internal, r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

 Network Next Hop Metric LocPrf Weight Path
*> 0.0.0.0 10.200.0.1 5 32768 ?
*> 10.10.1.0/30 0.0.0.0 0 32768 ?
* 10.10.1.1 0 0 100 ?
*> 10.10.1.4/30 10.10.1.1 0 100 ?
*> 10.200.0.0/24 0.0.0.0 0 32768 ?
*> 99.99.99.0/29 10.200.0.1 2 32768 ?
*> 100.100.100.0/30 10.200.0.1 20 32768 ?
*> 192.168.1.0 10.10.1.1 0 100 ?
*> 192.168.2.0 10.10.1.1 0 100 ?
*> 192.168.3.0 10.10.1.1 0 100 ?
CE1#

Just to double check, we can clear the CE1 BGP session again to make sure that the change sticks:

CE1#
CE1#clear ip bgp *
CE1#
*Mar 1 00:36:04.463: %BGP-5-ADJCHANGE: neighbor 10.10.1.1 Down User reset
*Mar 1 00:36:05.283: %BGP-5-ADJCHANGE: neighbor 10.10.1.1 Up
CE1#show ip route
Codes: C - connected, S - static, R - RIP, M - mobile, B - BGP
 D - EIGRP, EX - EIGRP external, O - OSPF, IA - OSPF inter area
 N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
 E1 - OSPF external type 1, E2 - OSPF external type 2
 i - IS-IS, su - IS-IS summary, L1 - IS-IS level-1, L2 - IS-IS level-2
 ia - IS-IS inter area, * - candidate default, U - per-user static route
 o - ODR, P - periodic downloaded static route

Gateway of last resort is 10.200.0.1 to network 0.0.0.0

 100.0.0.0/30 is subnetted, 1 subnets
O E2 100.100.100.0 [110/20] via 10.200.0.1, 00:35:32, FastEthernet0/0
 99.0.0.0/29 is subnetted, 1 subnets
O 99.99.99.0 [110/2] via 10.200.0.1, 00:35:32, FastEthernet0/0
 10.0.0.0/8 is variably subnetted, 3 subnets, 2 masks
C 10.10.1.0/30 is directly connected, Serial0/0
B 10.10.1.4/30 [20/0] via 10.10.1.1, 00:00:56
C 10.200.0.0/24 is directly connected, FastEthernet0/0
B 192.168.1.0/24 [20/0] via 10.10.1.1, 00:00:57
B 192.168.2.0/24 [20/0] via 10.10.1.1, 00:00:57
B 192.168.3.0/24 [20/0] via 10.10.1.1, 00:00:57
O*E2 0.0.0.0/0 [110/5] via 10.200.0.1, 00:35:33, FastEthernet0/0
CE1#

Success. The route-map has successfully blocked the OSPF being redistributed into the BGP table. As a result the route that the BGP Router process puts forth is the eBGP route, which wins over OSPF with an AD of 20.

A couple of side points to note: An alterative to this approach is to adjust the redistributed routes to make the BGP Best Path Algorithm select the eBGP route over the locally redistributed route. We could have done this using a route-map that resets the weight of the redistributed routes to zero and sets the local preference to 95 (below the default of 100). The config would look as follows:

router bgp 65489
 address-family ipv4
 redistribute ospf 1 match internal external 2 route-map 
 LOWER_WEIGHT_AND_PREF
!
route-map LOWER_WEIGHT_AND_PREF permit 10
 set local-preference 95
 set weight 0

However in this network scenario, there is no real reason to redistribute the MPLS sites back into BGP. It is safer to block them entirely.

It’s also prudent to apply this configuration in the opposite direction as well (tag redistributed routes on CE1 and block them on CE2).

And finally, you might have noticed the route-maps applied inbound and outbound on the eBGP sessions in the base config shown above. These are done to avoid routes looping from BGP to OSPF and back into BGP. MPLS solutions often have multiple sites with the same private AS number meaning allowas-in or as-override must be used to bypass BGP loop prevention (whereby a router running BGP will ignore updates for prefixes that have its own AS number in the AS_PATH attribute). This tagging could easily be used on the outbound advertisements, instead of the prefix-lists shown above. Tagging is more dynamic than manually defining the local ranges using prefix-lists.

Finally let’s confirm routing is following the same path inbound and outbound:

LNS#trace vrf CUST_A 50.50.50.50 source lo1

Type escape sequence to abort.
Tracing the route to 50.50.50.50

 1 10.1.3.2 [MPLS: Labels 18/21 Exp 0] 28 msec 36 msec 40 msec
 2 10.10.1.1 [MPLS: Label 21 Exp 0] 36 msec 36 msec 32 msec
 3 10.10.1.2 40 msec 24 msec 32 msec
 4 10.200.0.1 40 msec 48 msec 40 msec
 5 99.99.99.2 88 msec 56 msec 52 msec
 6 100.100.100.2 92 msec 84 msec 72 msec
LNS#
FW#trace 192.168.1.1 source fa0/0

Type escape sequence to abort.
Tracing the route to 192.168.1.1

 1 10.200.0.253 20 msec
 2 10.10.1.1 24 msec
 3 10.1.2.2 [MPLS: Labels 16/20 Exp 0] 60 msec
 4 192.168.1.1 [MPLS: Label 20 Exp 0] 44 msec 20 msec 44 msec
FW#

Looks good. Routing is symmetric and as expected.

There more ways to solve this problem than I have shown here. Feel free to play around with the lab to see what you can come up with.

Feeback is more than welcome. Let me know if you found this blog useful or interesting. Thanks for reading.