Routing loop shambles

Hey everyone! It’s been a while since I posted anything, but I’ve come across this interesting quirk in my studies which I think would be of interest for anyone studying OSPF, BGP and how they work together. Comments and thoughts are welcome as always.

This blog introduces the concept of OSPF sham-links and how they can be used to influence OSPF routes across an MPLS core. It also explores how, if not used carefully, routing loops could occur with disastrous effects. 

As a reminder, once I’ve set up the scenario, I’ll go through the quirk (explaining the problem), the search (finding a solution) and the work (implementing the solution) as usual.

Scenario

This scenario looks at a standard MPLS customer with two sites. These sites use OSPF as the PE-CE routing protocol and have a backdoor link between them over which OSPF is run – joining both sites into area 0.

The diagram looks like this:

blog11_image1_initial_scenario

I’ve labbed this in GNS3 and all routers are IOS-XE devices except for XR1 and XR2 which, as the names suggest, are IOS-XR boxes.

LAN ranges have been simulated using loopbacks. Each PE is doing redistribution from OSPF into MP-BGP (internal, external 1 and external 2) and from MP-BGP into OSPF.

The design goal here is to have both sites connected in OSPF area 0 using the backdoor link as a backup – with traffic normally preferring to go over the MPLS network (or OSPF super backbone). XR1 and R1 should back each other up. Only if both of these are down should traffic traverse the backdoor link.

I’ll first introduce the problems inherent in the default behaviour as shown in the diagram above – focusing on how R4 and R5 would reach LAN1 (192.168.71.0/24) on R7. I’ll then go into how a sham-link can help solve these problems. However, as we will see in the quirk, if sham-links aren’t applied correctly some problems could appear.

OSPF and MPLS

We’ll start by looking at how OSPF and MPLS interact. For now, let’s assume the backdoor link is shutdown.

OSPF is being used between the PEs and CEs. So the PEs find themselves redistributing from OSPF into MP-BGP. When this is done, MP-BGP will set these OSPF specific community/values into the resulting VPNv4 prefix:

  • The domain ID – this is an extended community taken from the process ID on the router and is considered when redistributing back into OSPF (more on that below).
  • The route-type – an extended community broken up into 3 parts: the area, the LSA type and an additional option.
  • The OSPF router id – another extended community representing the router sourcing this VPNv4 prefix.
  • The OSPF cost is copied to the MED value.

Here we can see the output from R3 as it has redistributed the OSPF route for LAN 1 into BGP:

R3#sh run | sec router ospf
router ospf 1 vrf A
 router-id 3.3.3.3
 redistribute bgp 1 subnets
 network 10.3.7.3 0.0.0.0 area 0
R3#sh bgp vpnv4 unicast vrf A 192.168.71.0
BGP routing table entry for 1:1:192.168.71.0/24, version 77
Paths: (1 available, best #1, table A)
  Advertised to update-groups:
     1
  Refresh Epoch 1
  Local
    10.3.7.7 (via vrf A) from 0.0.0.0 (3.3.3.3)
      Origin incomplete, metric 2, localpref 100, weight 32768, 
        valid, sourced, best
      Extended Community: RT:100:100 
              OSPF DOMAIN ID:0x0005:0x000000010200
        OSPF RT:0.0.0.0:2:0 OSPF ROUTER ID:3.3.3.3:0
      mpls labels in/out 24/nolabel
      rx pathid: 0, tx pathid: 0x0
R3#

You can see the Domain ID field is set to 0x0005:0x000000010200. The 00000001 section represents process ID 1. MED is 2 – this represents the OSPF cost of 2 to reach LAN1. The RT is  0.0.0.0:2:0 and router-ID is 3.3.3.3:0.

NB. IOS-XR doesn’t encode the domain ID by default. For this scenario we will assume it has been configured on XR1 using the following commands:

RP/0/RP0/CPU0:XR1(config)#router ospf 1
RP/0/RP0/CPU0:XR1(config-ospf)# vrf A
RP/0/RP0/CPU0:XR1(config-ospf-vrf)# domain-id type 0005 value 000000010200

What’s important to consider here is how the PEs on the other end of the MPLS network redistribute this back into OSPF on the other side.

When the MP-BGP prefix is redistributed back into OSPF by either R1 or XR1, it uses the domain ID to determine if the route should appear as inter-area or external (I’m using colour coding here to help with differentiating between area descriptions… and because trying to read inter and intra when they occur in the same sentence makes my head hurt). If the Process ID section of Domain ID in the VPNv4 prefix matches the local OSPF process ID on the PE doing the redistribution, then the prefix will be sent into OSPF using an inter-area Type 3 LSA. If the doesn’t, it will be an external Type 5 LSA.

In our setup, the Domain ID and Process ID all match – so when R4 and R5 receive the Type 3 LSA they see it as inter-area:

R4#sh ip route 192.168.71.0
Routing entry for 192.168.71.0/24
  Known via "ospf 1", distance 110, metric 3, type inter area
  Last update from 10.4.11.11 on GigabitEthernet1.411, 00:01:13 ago
  Routing Descriptor Blocks:
  * 10.4.11.11, from 11.11.11.11, 00:01:13 ago, via GigabitEthernet1.411
      Route metric is 3, traffic share count is 1
R4#sh ip ospf database summary 192.168.71.0

            OSPF Router with ID (4.4.4.4) (Process ID 1)

                Summary Net Link States (Area 0)

  LS age: 86
  Options: (No TOS-capability, DC, Downward)
  LS Type: Summary Links(Network)
  Link State ID: 192.168.71.0 (summary Network Number)
  Advertising Router: 1.1.1.1
  LS Seq Number: 80000001
  Checksum: 0x36CF
  Length: 28
  Network Mask: /24
        MTID: 0         Metric: 2

  LS age: 86
  Options: (No TOS-capability, DC, Downward)
  LS Type: Summary Links(Network)
  Link State ID: 192.168.71.0 (summary Network Number)
  Advertising Router: 11.11.11.11
  LS Seq Number: 80000001
  Checksum: 0x9D4
  Length: 28
  Network Mask: /24
        MTID: 0         Metric: 2

R4#

This all looks well and good. It’s worth pointing out here, that OSPF has a preference for which path to select based on the route types. The order of preference is as follows:

  • Intra-Area (O)
  • Inter-Area (O IA)
  • External Type 1 (E1)
  • NSSA Type 1 (N1)
  • External Type 2 (E2)
  • NSSA Type 2 (N2)

It doesn’t matter what the OSPF cost is. If OSPF has the option of an intra-area route over an inter-area or external route, it will pick the intra-area option every time. Keeping that in mind, let’s bring up the backdoor link and see what happens…

The backdoor link

You might already be able to predict that as soon as we bring up the backdoor link, R4 and R5 will immediately see LAN1 as an intra-area route:

R5#conf t
Enter configuration commands, one per line.  End with CNTL/Z.
R5(config)#
%SYS-5-CONFIG_I: Configured from console by console
R5(config)#interface gi1.57
R5(config-subif)#no shut
R5(config-subif)#
%OSPF-5-ADJCHG: Process 1, Nbr 7.7.7.7 on GigabitEthernet1.57 from DOWN to INIT, 
  Received Hello
%OSPF-5-ADJCHG: Process 1, Nbr 7.7.7.7 on GigabitEthernet1.57 from INIT to 2WAY, 
 2-Way Received
%OSPF-5-ADJCHG: Process 1, Nbr 7.7.7.7 on GigabitEthernet1.57 from 2WAY to EXSTART,
 AdjOK?
R5(config-subif)#
%OSPF-5-ADJCHG: Process 1, Nbr 7.7.7.7 on GigabitEthernet1.57 from EXSTART to 
 EXCHANGE, Negotiation Done
%OSPF-5-ADJCHG: Process 1, Nbr 7.7.7.7 on GigabitEthernet1.57 from EXCHANGE to 
 LOADING, Exchange Done
%OSPF-5-ADJCHG: Process 1, Nbr 7.7.7.7 on GigabitEthernet1.57 from LOADING to FULL,
 Loading Done
R5(config-subif)#do sh ip route 192.168.71.0
Routing entry for 192.168.71.0/24
  Known via "ospf 1", distance 110, metric 101, type intra area
  Last update from 10.5.7.7 on GigabitEthernet1.57, 00:00:17 ago
  Routing Descriptor Blocks:
  * 10.5.7.7, from 7.7.7.7, 00:00:17 ago, via GigabitEthernet1.57
      Route metric is 101, traffic share count is 1
R5(config-subif)#
R5(config-subif)#do sh ip ospf database router 7.7.7.7

            OSPF Router with ID (5.5.5.5) (Process ID 1)

                Router Link States (Area 0)

  LS age: 37
  Options: (No TOS-capability, DC)
  LS Type: Router Links
  Link State ID: 7.7.7.7
  Advertising Router: 7.7.7.7
  LS Seq Number: 800000C1
  Checksum: 0x840E
  Length: 60
  AS Boundary Router
  Number of Links: 3

    Link connected to: a Stub Network
     (Link ID) Network/subnet number: 192.168.71.0
     (Link Data) Network Mask: 255.255.255.0
      Number of MTID metrics: 0
       TOS 0 Metrics: 1

    Link connected to: a Transit Network
     (Link ID) Designated Router address: 10.5.7.7
     (Link Data) Router Interface address: 10.5.7.7
      Number of MTID metrics: 0
       TOS 0 Metrics: 100

    Link connected to: a Transit Network
     (Link ID) Designated Router address: 10.3.7.7
     (Link Data) Router Interface address: 10.3.7.7
      Number of MTID metrics: 0
       TOS 0 Metrics: 1

R5(config-subif)#

You may also have spotted that the previous Type 3 LSA is no longer present. This is because the PE routers that were doing the redistribution from MP-BGP now prefer the local OSPF path. MP-BGP (iBGP from the reflectors in this case) has an administrative distance of 200. OSPF has an administrative distance of 110. OSPF wins and since redistribution takes place from the RIB, there are no MP-BGP routes to redistribute into OSPF:

R4#sh ip ospf database summary 192.168.71.0

            OSPF Router with ID (4.4.4.4) (Process ID 1)
R4#
R1#sh ip route vrf A 192.168.71.0

Routing Table: A
Routing entry for 192.168.71.0/24
  Known via "ospf 1", distance 110, metric 102, type intra area
  Redistributing via bgp 1
  Advertised by bgp 1 match internal external 1 & 2
  Last update from 10.1.5.5 on GigabitEthernet1.15, 00:04:30 ago
  Routing Descriptor Blocks:
  * 10.1.5.5, from 7.7.7.7, 00:04:30 ago, via GigabitEthernet1.15
      Route metric is 102, traffic share count is 1
R1#

Now you might be asking why I bothered to outline the difference between the PE redistributing the BGP prefix as inter-area versus external, if the R4 and R5 are just going to pick the intra-area route regardless. Well this becomes relevant when we consider how we are going to make the MPLS core the preferred path to reach LAN1.

As it stands at the moment, no matter how high we set the metric on the link between R5 and R7, traffic from Site 2 to LAN1 will always go over the backdoor link. In short, we need a way to make an intra-area route appear over the MPLS core. Here’s were sham-links come in.

Sham-Links

A sham-link is similar to an OSPF Virtual-Link but it can be run as any area and is designed for just these types of scenarios.  Essentially, the PEs at either end establish an OSPF neighborship and consider themselves to be directly connected within the same area. This will all allow Type 1 and Type 2 LSAs to appear over MPLS – simulating a point-to-point connection between PEs.  Let’s look at how this is setup…

Each PE creates a new loopback and puts it into vrf A. The sham-link is configured between these loopbacks.

Here’s the diagram and config for the setup:

blog11_image2_sham_link_initial

R3#conf t
Enter configuration commands, one per line.  End with CNTL/Z.
R3(config)#interface Loopback33
R3(config-if)#vrf forwarding A
R3(config-if)#ip address 33.3.3.3 255.255.255.255
R3(config-if)#exit
R3(config)#router ospf 1 vrf A
R3(config-router)#area 0 sham-link 33.3.3.3 111.11.11.11
R3(config-router)#exit
R3(config)#router bgp 1
R3(config-router)#address-family ipv4 vrf A
R3(config-router-af)#network 33.3.3.3 mask 255.255.255.255
RP/0/RP0/CPU0:XR1#conf
RP/0/RP0/CPU0:XR1(config)#interface Loopback111
RP/0/RP0/CPU0:XR1(config-if)#vrf A
RP/0/RP0/CPU0:XR1(config-if)#ipv4 address 111.11.11.11/32
RP/0/RP0/CPU0:XR1(config-if)#root
RP/0/RP0/CPU0:XR1(config)#router ospf 1
RP/0/RP0/CPU0:XR1(config-ospf)#vrf A
RP/0/RP0/CPU0:XR1(config-ospf-vrf)#address-family ipv4 unicast
RP/0/RP0/CPU0:XR1(config-ospf-vrf)#area 0
RP/0/RP0/CPU0:XR1(config-ospf-vrf-ar)#sham-link 111.11.11.11 33.3.3.3
RP/0/RP0/CPU0:XR1(config-ospf-vrf-ar)#root
RP/0/RP0/CPU0:XR1(config)#router bgp 1
RP/0/RP0/CPU0:XR1(config-bgp)#vrf A
RP/0/RP0/CPU0:XR1(config-bgp-vrf)#rd 1:1
RP/0/RP0/CPU0:XR1(config-bgp-vrf)#address-family ipv4 unicast
RP/0/RP0/CPU0:XR1(config-bgp-vrf-af)#network 111.11.11.11/32

Now it’s important to pause there and highlight a key requirement: We need to make sure that each PE has reachability to the others sham-link loopback over MPLS but not over OSPF. To that end, we should not enable OSPF on the PEs new loopbacks.

But why is this?

To answer this, consider how R3 learns about 111.11.11.11/32. If XR1 were to enable OSPF on this loopback, it would include it as a connected network in its Type 1 LSA. This would be then be communicated throughout the OSPF area, across the backdoor link and arrive at R3. All devices are in the same area so their view of the LSDB would be the same. Assuming loopback111 is also redistributed into BGP, R3 would now have two options to reach it – one via OSPF with administrative distance or 110 and one via iBGP with an administrative distance of 200.

blog11_image3_redistributing_loopbacks

OSPF would naturally win and the sham-link would be built over the backdoor link, which defeats the very goal we are trying to achieve! As such, we have to make sure that OSPF is not enabled on loopback 111 or loopback 33.

But, I hear you ask, what if we are still redistributing from MP-BGP into OSPF? Won’t R3 still see the path to loopback 111 via an external Type 5 LSA, which will still have a lower AD that iBGP’s 200?

Well, yes, but OSPF has a loop prevention mechanism built into it to prevent just such a thing…

When an LSA is created from redistributing from MP-BGP to OSPF, an OSPF feature called the down-bit is set in the resulting LSA. The down-bit ensures that any prefixes that are redistributed from MP-BGP into OSPF are not then redistributed back into MP-BGP. So whist R3 will see the Type 5 LSA in its LSDB it will not consider it as a valid route since it is already getting the prefix via MP-BGP and the down-bit indicates that it came from MP-BGP.

blog11_image4_down_bit

Here is the LSA as seen in the LSDB.

R5#sh ip ospf database external 111.11.11.11

            OSPF Router with ID (5.5.5.5) (Process ID 1)

                Type-5 AS External Link States

  LS age: 881
  Options: (No TOS-capability, DC, Downward)
  LS Type: AS External Link
  Link State ID: 111.11.11.11 (External Network Number )
  Advertising Router: 1.1.1.1
  LS Seq Number: 8000004D
  Checksum: 0x245C
  Length: 36
  Network Mask: /32
        Metric Type: 2 (Larger than any link state path)
        MTID: 0
        Metric: 1
        Forward Address: 0.0.0.0
        External Route Tag: 3489660929

  LS age: 1998
  Options: (No TOS-capability, DC, Downward)
  LS Type: AS External Link
  Link State ID: 111.11.11.11 (External Network Number )
  Advertising Router: 3.3.3.3
  LS Seq Number: 80000055
  Checksum: 0xD798
  Length: 36
  Network Mask: /32
        Metric Type: 2 (Larger than any link state path)
        MTID: 0
        Metric: 1
        Forward Address: 0.0.0.0
        External Route Tag: 3489660929

R5#

And if we check, we find that R3’s best path is via MP-BGP.

R3#sh ip route vrf A 111.11.11.11

Routing Table: A
Routing entry for 111.11.11.11/32
  Known via "bgp 1", distance 200, metric 0, type internal
  Redistributing via ospf 1
  Advertised by ospf 1 subnets
  Last update from 11.11.11.11 19:34:53 ago
  Routing Descriptor Blocks:
  * 11.11.11.11 (default), from 2.2.2.2, 19:34:53 ago
      Route metric is 0, traffic share count is 1
      AS Hops 0
      MPLS label: 24018
      MPLS Flags: MPLS Required
R3#

This loop prevention mechanism isn’t crucial to understanding the operation of the sham-link but it will come into play later on when we look at a potential routing loop.

Getting back to the sham-link, once we configure everything as outlined above the link comes up:

RP/0/RP0/CPU0:XR1#sh ospf vrf A sham-links

Sham Links for OSPF 1, VRF A

Sham Link OSPF_SL0 to address 33.3.3.3 is up
Area 0, source address 111.11.11.11
IfIndex = 1
  Run as demand circuit
  DoNotAge LSA allowed., Cost of using 1
  Transmit Delay is 1 sec, State POINT_TO_POINT,
  Timer intervals configured, Hello 10, Dead 40, Wait 40, Retransmit 5
    Hello due in 00:00:06:794
    Adjacency State FULL (Hello suppressed)
    Number of DBD retrans during last exchange 0
    Index 2/2, retransmission queue length 0, number of retransmission 0
    First 0(0)/0(0) Next 0(0)/0(0)
    Last retransmission scan length is 0, maximum is 0
    Last retransmission scan time is 0 msec, maximum is 0 msec
RP/0/RP0/CPU0:XR1#sh ospf vrf A neighbor

* Indicates MADJ interface
# Indicates Neighbor awaiting BFD session up

Neighbors for OSPF 1, VRF A

Neighbor ID     Pri   State           Dead Time   Address         Interface
3.3.3.3         1     FULL/  -           -        33.3.3.3        OSPF_SL0
    Neighbor is up for 00:01:20
4.4.4.4         1     FULL/BDR        00:00:39    10.4.11.4       Gi0/0/0/0.411
    Neighbor is up for 19:32:22

Total neighbor count: 2
RP/0/RP0/CPU0:XR1#
R3#sh ip ospf sham-links
Sham Link OSPF_SL8 to address 111.11.11.11 is up
Area 0 source address 33.3.3.3
  Run as demand circuit
  DoNotAge LSA allowed. Cost of using 1 State POINT_TO_POINT,
  Timer intervals configured, Hello 10, Dead 40, Wait 40,
    Hello due in 00:00:07
    Adjacency State FULL (Hello suppressed)
    Index 1/2/2, retransmission queue length 0, number of retransmission 0
    First 0x0(0)/0x0(0)/0x0(0) Next 0x0(0)/0x0(0)/0x0(0)
    Last retransmission scan length is 0, maximum is 0
    Last retransmission scan time is 0 msec, maximum is 0 msec
R3#

Both routers establish an OSPF adjacency and see each other as connected over a point-to-point link:

RP/0/RP0/CPU0:XR1#sh ospf vrf A database router 11.11.11.11
Thu Oct  3 12:31:10.478 UTC

            OSPF Router with ID (11.11.11.11) (Process ID 1, VRF A)

                Router Link States (Area 0)

  LS age: 151
  Options: (No TOS-capability, DC)
  LS Type: Router Links
  Link State ID: 11.11.11.11
  Advertising Router: 11.11.11.11
  LS Seq Number: 800000ef
  Checksum: 0xc78
  Length: 48
  Area Border Router
  AS Boundary Router
   Number of Links: 2

    Link connected to: another Router (point-to-point)
     (Link ID) Neighboring Router ID: 3.3.3.3
     (Link Data) Router Interface address: 0.0.0.1
      Number of TOS metrics: 0
       TOS 0 Metrics: 1

    Link connected to: a Transit Network
     (Link ID) Designated Router address: 10.4.11.11
     (Link Data) Router Interface address: 10.4.11.11
      Number of TOS metrics: 0
       TOS 0 Metrics: 1

RP/0/RP0/CPU0:XR1#

What’s interesting here is how XR1 sees the path to LAN1 over the sham-link:

RP/0/RP0/CPU0:XR1#sh route vrf A ipv4 192.168.71.0/24
Thu Oct  3 12:31:43.212 UTC

Routing entry for 192.168.71.0/24
  Known via "bgp 1", distance 200, metric 2, type internal
  Installed Oct  3 12:28:40.433 for 00:03:04
  Routing Descriptor Blocks
    3.3.3.3, from 2.2.2.2
     Nexthop in Vrf: "default", Table: "default", IPv4 Unicast, Table Id:0xe0000000
     Route metric is 2
  No advertising protos.
RP/0/RP0/CPU0:XR1#

It sees it as a BGP route and not an OSPF route! If we look at its BGP entry we see this:

RP/0/RP0/CPU0:XR1#sh bgp vpnv4 unicast vrf A 192.168.71.0
Thu Oct  3 12:32:15.246 UTC
BGP routing table entry for 192.168.71.0/24,Route Distinguisher: 1:1
Versions:
  Process           bRIB/RIB  SendTblVer
  Speaker                462         462
Last Modified: Oct  3 12:28:40.387 for 00:03:37
Paths: (2 available, best #1)
  Not advertised to any peer
  Path #1: Received by speaker 0
  Not advertised to any peer
  Local
    3.3.3.3 (metric 20) from 2.2.2.2 (3.3.3.3)
      Received Label 24
      Origin incomplete, metric 2, localpref 100, valid, internal, best, 
          group-best, import-candidate, imported
      Received Path ID 0, Local Path ID 1, version 462
      Extended community: OSPF domain-id:0x5:0x000000010200 
         OSPF route-type:0:2:0x0 OSPF router-id:3.3.3.3 RT:100:100
      Originator: 3.3.3.3, Cluster list: 2.2.2.2
      Source AFI: VPNv4 Unicast, Source VRF: A, Source Route Distinguisher: 1:1
  Path #2: Received by speaker 0
  Not advertised to any peer
  Local
    3.3.3.3 (metric 20) from 12.12.12.12 (3.3.3.3)
      Received Label 24
      Origin incomplete, metric 2, localpref 100, valid, internal, 
        import-candidate, imported
      Received Path ID 0, Local Path ID 0, version 0
      Extended community: OSPF domain-id:0x5:0x000000010200 
         OSPF route-type:0:2:0x0 OSPF router-id:3.3.3.3 RT:100:100
      Originator: 3.3.3.3, Cluster list: 12.12.12.12
      Source AFI: VPNv4 Unicast, Source VRF: A, Source Route Distinguisher: 1:1
RP/0/RP0/CPU0:XR1#

It is clearly an OSPF based route. The OSPF attributes are all present. But how can an OSPF path over the sham-link appear as a BGP route?

Remember that in order to send traffic across the MPLS core two labels will be needed. The top label represents the next-hop PE. This will typically be repeatedly swapped as the packet crosses the core (unless we’re using segment routing but that’s a whole other story). The second and bottom label is the VPN label used to represent this customers prefix or VRF. This label is needed since the core P routers won’t know anything of the customer subnets. This label is communicated in the VPNv4 update from R3 as it redistributes LAN1 into MP-BGP.

Here is the logical process that XR1 is follows:

  • XR1 runs the Dijkstra algorithm to find LAN1, taking the sham-link into account as a point-to-point link.
  • If the sham-link wins, XR1 will then use a VPNv4 route for LAN1, which in this case is being redistributed by R3. The best VPNv4 route will be used and placed in the BGP RIB instead of an OSPF route.

This is logic is due to the recursion that is taking place over the sham-link:

RP/0/RP0/CPU0:XR1#show cef vrf A 192.168.71.0
Thu Oct  3 12:41:27.680 UTC
192.168.71.0/24, version 679, internal 0x5000001 0x0 (ptr 0xdf126ec) [1], 0x0 
  (0xe0d88e8), 0xa08 (0xe4dc4e8)
 Updated Oct  3 12:28:40.444
 Prefix Len 24, traffic index 0, precedence n/a, priority 3
   via 3.3.3.3/32, 3 dependencies, recursive [flags 0x6000]
    path-idx 0 NHID 0x0 [0xd67f4f0 0x0]
    recursion-via-/32
    next hop VRF - 'default', table - 0xe0000000
    next hop 3.3.3.3/32 via 24001/0/21
     next hop 10.2.11.2/32 Gi0/0/0/0.211 labels imposed {16 24}
     next hop 10.11.12.12/32 Gi0/0/0/0.1112 labels imposed {24000 24}
RP/0/RP0/CPU0:XR1#

So R3’s redistribution of LAN1 is needed so that XR1 has a VPN label to send traffic across the MPLS core. Here label 24 is the VPN label assigned by R3 and 16 and 24000 are the transport labels for the next hop of R3 via ECMP through Gi0/0/0/0.211 and Gi0/0/0/0.1112 respectively.

If we verify the source of the VPN label we can see that R3 is indeed assigning label 24:

R3#sh mpls forwarding-table vrf A
Local      Outgoing   Prefix           Bytes Label   Outgoing   Next Hop
Label      Label      or Tunnel Id     Switched      interface
24         No Label   192.168.71.0/24[V]   \
                                       0             Gi1.37     10.3.7.7
31         Pop Label  33.3.3.3/32[V]   0             aggregate/A
34         No Label   10.3.7.0/24[V]   0             aggregate/A
41         No Label   7.7.7.7/32[V]    0             Gi1.37     10.3.7.7
48         No Label   10.5.7.0/24[V]   0             Gi1.37     10.3.7.7
R3#

As a side note, remember that the MP-BGP prefix that XR1 recursively uses is still in competition with any other VPNv4 route to the same destination (this becomes important later).

As a result of all of this, XR1 will not redistribute any OSPF routes into MP-BGP that it prefers over the sham-link. Redistribution takes place from the global RIB (or vrf RIB in this case) and there is no OSPF prefix in the RIB for LAN1 due to this recursive process.

Looking back at our communication between sites, we can now see that if the OSPF cost is lower across this sham-link when R4 and R5 run their Dijkstra algorithms, they will prefer this path as an intra-area link.

The below output shows that after increasing the metric on the backdoor link, a trace from the loopback of R5 to LAN1 goes via R4 to XR1 and over the MPLS core:

R5#conf t 
Enter configuration commands, one per line. End with CNTL/Z. 
%SYS-5-CONFIG_I: Configured from console by console 
R5(config)#interface gi1.57 
R5(config-subif)#ip ospf cost 100
R5(config-subif)#^Z
R5#sh ip route 192.168.71.0
Routing entry for 192.168.71.0/24
  Known via "ospf 1", distance 110, metric 5, type intra area
  Last update from 10.4.5.4 on GigabitEthernet1.45, 00:16:45 ago
  Routing Descriptor Blocks:
  * 10.4.5.4, from 7.7.7.7, 00:16:45 ago, via GigabitEthernet1.45
      Route metric is 5, traffic share count is 1
R5#trace 192.168.71.1 source loopback 0
Type escape sequence to abort.
Tracing the route to 192.168.71.1
VRF info: (vrf in name/id, vrf out name/id)
  1 10.4.5.4 10 msec 5 msec 6 msec
  2 10.4.11.11 39 msec 56 msec 51 msec
  3 10.11.12.12 [MPLS: Labels 24000/24 Exp 0] 85 msec 51 msec 49 msec
  4 10.3.7.3 [MPLS: Label 24 Exp 0] 38 msec 12 msec 34 msec
  5 10.3.7.7 18 msec *  23 msec
R5#

Success! You can even see the correct label stack in the trace. Traffic will now traverse the MPLS core as its primary path. Now let’s take a look at how, if you’re not careful how you add new subnets into OSPF, connectivity problems can pop up…

The quirk

Let’s pretend an engineer is tasked with configuring a new interface on R7 to be in LAN2 with a subnet of 192.168.72.0/24. Now let’s suppose that instead of enabling OSPF on the interface, the engineer uses the redistribute connected subnets command under the OSPF process:

blog11_image5_adding_second_lan

R7#conf t
Enter configuration commands, one per line.  End with CNTL/Z.
R7(config)#interface loopback 72
R7(config-if)#ip address 192.168.72.1 255.255.255.0
R7(config-if)#ospf network point-to-point
R7(config-if)#router ospf 1
R7(config-router)#redistribute connected subnets

Site 2 immediately reports issues reaching this new subnet and if we repeat a traceroute from R5 we can confirm it:

R5#trace 192.168.72.0 source loopback 0
Type escape sequence to abort.
Tracing the route to 192.168.72.0
VRF info: (vrf in name/id, vrf out name/id)
  1 10.4.5.4 7 msec 7 msec 2 msec
  2 10.4.11.11 48 msec 24 msec 51 msec
  3 10.1.5.1 [MPLS: Label 48 Exp 0] 9 msec 22 msec 7 msec
  4 10.1.5.5 19 msec 7 msec 17 msec
  5 10.4.5.4 21 msec 15 msec 12 msec
  6 10.4.11.11 26 msec 25 msec 28 msec
  7 10.1.5.1 [MPLS: Label 48 Exp 0] 22 msec 13 msec 12 msec
  8 10.1.5.5 25 msec 21 msec 16 msec
  9 10.4.5.4 23 msec 23 msec 9 msec
 10 10.4.11.11 21 msec 30 msec 24 msec
 11 10.1.5.1 [MPLS: Label 48 Exp 0] 19 msec 28 msec 33 msec
 12 10.1.5.5 29 msec 34 msec 21 msec
 13 10.4.5.4 19 msec 15 msec 19 msec
 14 10.4.11.11 26 msec 43 msec 32 msec
 15 10.1.5.1 [MPLS: Label 48 Exp 0] 14 msec 20 msec 23 msec
 16 10.1.5.5 31 msec 21 msec 21 msec
 17 10.4.5.4 30 msec 31 msec 23 msec
 18 10.4.11.11 43 msec 59 msec 54 msec
 19 10.1.5.1 [MPLS: Label 48 Exp 0] 44 msec 41 msec 35 msec
 20 10.1.5.5 24 msec 46 msec 28 msec
 21 10.4.5.4 84 msec 44 msec 67 msec
 22 10.4.11.11 78 msec 60 msec 35 msec
 23 10.1.5.1 [MPLS: Label 48 Exp 0] 43 msec 37 msec 33 msec
 24 10.1.5.5 58 msec 43 msec 28 msec
 25 10.4.5.4 43 msec 74 msec 35 msec
 26 10.4.11.11 37 msec 44 msec 38 msec
 27 10.1.5.1 [MPLS: Label 48 Exp 0] 44 msec 42 msec 56 msec
 28 10.1.5.5 60 msec 50 msec 40 msec
 29 10.4.5.4 35 msec 51 msec 55 msec
 30 10.4.11.11 50 msec 87 msec 86 msec
R5#

Visually it looks like this:

blog11_image6_looping_trace

It looks to be headed in the right direction to begin with, but XR1 is sending it over to R1 for some reason.  LAN1 still seems to work though:

R5#trace 192.168.71.1 source loopback 0
Type escape sequence to abort.
Tracing the route to 192.168.71.1
VRF info: (vrf in name/id, vrf out name/id)
  1 10.4.5.4 17 msec 5 msec 11 msec
  2 10.4.11.11 27 msec 14 msec 15 msec
  3 10.2.11.2 [MPLS: Labels 16/24 Exp 0] 18 msec
    10.11.12.12 [MPLS: Labels 24000/24 Exp 0] 12 msec
    10.2.11.2 [MPLS: Labels 16/24 Exp 0] 18 msec
  4 10.3.7.3 [MPLS: Label 24 Exp 0] 17 msec 26 msec 21 msec
  5 10.3.7.7 30 msec *  33 msec
R5#

Let’s start by looking at how R5 sees the path to LAN2 compared to LAN1:

R5#sh ip route 192.168.72.0 255.255.255.0
Routing entry for 192.168.72.0/24
  Known via "ospf 1", distance 110, metric 20, type extern 2, forward metric 4
  Last update from 10.4.5.4 on GigabitEthernet1.45, 00:22:59 ago
  Routing Descriptor Blocks:
  * 10.4.5.4, from 7.7.7.7, 00:22:59 ago, via GigabitEthernet1.45
      Route metric is 20, traffic share count is 1
R5#sh ip route 192.168.71.0 255.255.255.0
Routing entry for 192.168.71.0/24
  Known via "ospf 1", distance 110, metric 5, type intra area
  Last update from 10.4.5.4 on GigabitEthernet1.45, 00:23:02 ago
  Routing Descriptor Blocks:
  * 10.4.5.4, from 7.7.7.7, 00:23:02 ago, via GigabitEthernet1.45
      Route metric is 5, traffic share count is 1
R5#

The main difference here is that R5 sees this as an external E2 route. There is an external Type 5 LSA referencing LAN2 due to it being redistributed rather than having OSPF enabled on it:

R5#sh ip ospf database external 192.168.72.0

            OSPF Router with ID (5.5.5.5) (Process ID 1)

                Type-5 AS External Link States

  LS age: 1090
  Options: (No TOS-capability, DC, Upward)
  LS Type: AS External Link
  Link State ID: 192.168.72.0 (External Network Number )
  Advertising Router: 7.7.7.7
  LS Seq Number: 800000CE
  Checksum: 0xAC58
  Length: 36
  Network Mask: /24
        Metric Type: 2 (Larger than any link state path)
        MTID: 0
        Metric: 20
        Forward Address: 0.0.0.0
        External Route Tag: 0

R5#

The metric is 20 and the type is E2. This is the default for OSPF when redistributing connected routes. When an E2 route is used, the intra-area cost to the ASBR that originated the LSA (which in this case is R7) is not taken into consideration (outside of a tie-breaker scenario between two E2 routes). So, the metric is 20 and will stay 20. Also, note the down-bit is not set…

Looking at the next hop, R4, we see it has the same preference for an E2 route and it is still sending traffic in the right direction:

R4#sh ip route 192.168.72.0
Routing entry for 192.168.72.0/24
  Known via "ospf 1", distance 110, metric 20, type extern 2, forward metric 3
  Last update from 10.4.11.11 on GigabitEthernet1.411, 00:25:02 ago
  Routing Descriptor Blocks:
  * 10.4.11.11, from 7.7.7.7, 00:25:02 ago, via GigabitEthernet1.411
      Route metric is 20, traffic share count is 1
R4#

The point where the loop seems to start is XR1. Again, let’s compare how it reaches LAN2 compared to LAN1:

RP/0/RP0/CPU0:XR1#sh route vrf A ipv4 192.168.72.0/24

Routing entry for 192.168.72.0/24
  Known via "bgp 1", distance 200, metric 20, type internal
  Installed Oct  3 12:28:40.429 for 00:26:01
  Routing Descriptor Blocks
    1.1.1.1, from 2.2.2.2
     Nexthop in Vrf: "default", Table: "default", IPv4 Unicast, Table Id:0xe0000000
     Route metric is 20
  No advertising protos.
RP/0/RP0/CPU0:XR1#sh route vrf A ipv4 192.168.71.0/24

Routing entry for 192.168.71.0/24
  Known via "bgp 1", distance 200, metric 2, type internal
  Installed Oct  3 12:28:40.430 for 00:26:07
  Routing Descriptor Blocks
    3.3.3.3, from 2.2.2.2
     Nexthop in Vrf: "default", Table: "default", IPv4 Unicast, Table Id:0xe0000000
     Route metric is 2
  No advertising protos.
RP/0/RP0/CPU0:XR1#

Both are preferring MP-BGP but LAN2 is unexpectedly advertised and preferred via R1….

RP/0/RP0/CPU0:XR1#sh bgp vpnv4 unicast vrf A 192.168.72.0/24
Thu Oct  3 16:58:02.777 UTC
BGP routing table entry for 192.168.72.0/24, Route Distinguisher: 1:1
Versions:
  Process           bRIB/RIB  SendTblVer
  Speaker                463         463
Last Modified: Oct  3 12:28:40.387 for 04:29:24
Paths: (2 available, best #1)
  Not advertised to any peer
  Path #1: Received by speaker 0
  Not advertised to any peer
  Local
    1.1.1.1 (metric 10) from 2.2.2.2 (1.1.1.1)
      Received Label 48
      Origin incomplete, metric 20, localpref 100, valid, internal, best, 
         group-best, import-candidate, imported
      Received Path ID 0, Local Path ID 1, version 463
      Extended community: OSPF domain-id:0x5:0x000000010200 
         OSPF route-type:0:5:0x1 OSPF router-id:1.1.1.1 RT:100:100
      Originator: 1.1.1.1, Cluster list: 2.2.2.2
      Source AFI: VPNv4 Unicast, Source VRF: A, Source Route Distinguisher: 1:1
  Path #2: Received by speaker 0
  Not advertised to any peer
  Local
    1.1.1.1 (metric 10) from 12.12.12.12 (1.1.1.1)
      Received Label 48
      Origin incomplete, metric 20, localpref 100, valid, internal, 
         import-candidate, imported
      Received Path ID 0, Local Path ID 0, version 0
      Extended community: OSPF domain-id:0x5:0x000000010200 
          OSPF route-type:0:5:0x1 OSPF router-id:1.1.1.1 RT:100:100
      Originator: 1.1.1.1, Cluster list: 12.12.12.12
      Source AFI: VPNv4 Unicast, Source VRF: A, Source Route Distinguisher: 1:1
RP/0/RP0/CPU0:XR1#

Both paths from the reflectors are pointing to R1. Let’s take a look at R1 and see what’s going on.

R1#sh ip route vrf A 192.168.72.0 255.255.255.0

Routing Table: A
Routing entry for 192.168.72.0/24
  Known via "ospf 1", distance 110, metric 20, type extern 2, forward metric 5
  Redistributing via bgp 1
  Advertised by bgp 1 match internal external 1 & 2
  Last update from 10.1.5.5 on GigabitEthernet1.15, 21:26:40 ago
  Routing Descriptor Blocks:
  * 10.1.5.5, from 7.7.7.7, 21:26:40 ago, via GigabitEthernet1.15
      Route metric is 20, traffic share count is 1
R1#
R1#sh bgp vpnv4 unicast vrf A 192.168.72.0 255.255.255.0
BGP routing table entry for 1:1:192.168.72.0/24, version 146
Paths: (1 available, best #1, table A)
 Advertised to update-groups:
    7
 Refresh Epoch 1
 Local
   10.1.5.5 (via vrf A) from 0.0.0.0 (1.1.1.1)
    Origin incomplete, metric 20, localpref 100, weight 32768, valid, sourced, best
    Extended Community: RT:100:100 OSPF DOMAIN ID:0x0005:0x000000010200
      OSPF RT:0.0.0.0:5:1 OSPF ROUTER ID:1.1.1.1:0
    mpls labels in/out 48/nolabel
    rx pathid: 0, tx pathid: 0x0
R1#

Looks like R1 is using OSPF to reach LAN2.

This is simply an administrative distance decision from R1’s point of view. One path from iBGP, one from OSPF. OSPF wins. The Type 5 LSA is being seen over the backdoor link or over the sham-link. It hasn’t been through any redistribution. As such, no down-bit is being set and R1 has no reason not to redistribute it into MP-BGP as normal.

Now we are in a position to look at why XR1 sends the traffic to R1. Remember when the sham-link is the best OSPF path, the resulting route is a VPNv4 MP-BGP route to that destination, with the sham-link destination as the next-hop. This MP-BGP route must compete with all other MP-BGP routes using the best path selection algorithm.

To look at this process we can turn to one of the reflectors:

R2#sh bgp vpnv4 unicast rd 1:1 192.168.72.0
BGP routing table entry for 1:1:192.168.72.0/24, version 369
Paths: (3 available, best #1, no table)
  Advertised to update-groups:
     1
  Refresh Epoch 1
  Local, (Received from a RR-client)
    1.1.1.1 (metric 10) (via default) from 1.1.1.1 (1.1.1.1)
      Origin incomplete, metric 20, localpref 100, valid, internal, best
      Extended Community: RT:100:100 OSPF DOMAIN ID:0x0005:0x000000010200
        OSPF RT:0.0.0.0:5:1 OSPF ROUTER ID:1.1.1.1:0
      mpls labels in/out nolabel/48
      rx pathid: 0, tx pathid: 0x0
  Refresh Epoch 1
  Local, (Received from a RR-client)
    1.1.1.1 (metric 10) (via default) from 12.12.12.12 (12.12.12.12)
      Origin incomplete, metric 20, localpref 100, valid, internal
      Extended Community: RT:100:100 OSPF DOMAIN ID:0x0005:0x000000010200
        OSPF RT:0.0.0.0:5:1 OSPF ROUTER ID:1.1.1.1:0
      Originator: 1.1.1.1, Cluster list: 12.12.12.12
      mpls labels in/out nolabel/48
      rx pathid: 0, tx pathid: 0
  Refresh Epoch 1
  Local, (Received from a RR-client)
    3.3.3.3 (metric 10) (via default) from 3.3.3.3 (3.3.3.3)
      Origin incomplete, metric 20, localpref 100, valid, internal
      Extended Community: RT:100:100 OSPF DOMAIN ID:0x0005:0x000000010200
        OSPF RT:0.0.0.0:5:1 OSPF ROUTER ID:3.3.3.3:0
      mpls labels in/out nolabel/40
      rx pathid: 0, tx pathid: 0
R2#

R2 is choosing the prefix advertised by R1 as the best path. It will then reflect this on and at the same time withdraw any previous best paths – this includes the path via 3.3.3.3 which XR1 should be using to reach the other end of the sham-link. XR1, still needing to use a VPNv4 prefix, falls back its only available option, namely the VPNv4 prefix via R1.

You might think that it would fall back to another OSPF prefix, but remember, OSPF will simply run Dijkstra’s algorithm again and see the sham-link as the best path. The sham-link would still recurse to a MP-BGP VPNv4 prefix – and the R3-originated one has lost out to the R1-originated one. The sham-link can’t detect that an OSPF path using the sham-link has an VPNv4 prefix that avoids looping back into the same site. It just tells OSPF to use a VPNv4 prefix.  It’s simulating running OSPF over the MPLS core – hence the term sham. 

So now we know why XR1 is looping the traffic… but why are the reflectors preferring the path that R1 advertises? For that, we can run through the BGP best path selection algorithm:

blog11_image7_BGP_analysis1

The BGP Router ID is determining the best path! This is far from ideal. We can test this by actually changing R1s Router ID and clearing BGP (obviously never do this in a live environment):

R1#conf t
Enter configuration commands, one per line.  End with CNTL/Z.
R1(config)#router bgp 1
R1(config-router)#bgp router-id 100.100.100.100
R1(config-router)#
*Oct  3 17:16:18.280: %BGP-5-ADJCHANGE: neighbor 2.2.2.2 Down Router ID changed
*Oct  3 17:16:18.280: %BGP_SESSION-5-ADJCHANGE: neighbor 2.2.2.2 VPNv4 Unicast 
  topology base removed from session  Router ID changed
*Oct  3 17:16:18.296: %BGP-5-ADJCHANGE: neighbor 12.12.12.12 Down Router ID changed
*Oct  3 17:16:18.296: %BGP_SESSION-5-ADJCHANGE: neighbor 12.12.12.12 VPNv4 
  Unicast topology base removed from session  Router ID changed
*Oct  3 17:16:19.035: %BGP-5-ADJCHANGE: neighbor 2.2.2.2 Up
*Oct  3 17:16:19.046: %BGP-5-NBR_RESET: Neighbor 12.12.12.12 active reset (Peer 
  closed the session)
R1(config-router)#
*Oct  3 17:16:19.046: %BGP_SESSION-5-ADJCHANGE: neighbor 12.12.12.12 VPNv4 Unicast 
  topology base removed from session  Peer closed the session
R1(config-router)#
*Oct  3 17:16:28.869: %BGP-5-ADJCHANGE: neighbor 12.12.12.12 Up
R1(config-router)#
R2#sh bgp vpnv4 unicast rd 1:1 192.168.72.0
BGP routing table entry for 1:1:192.168.72.0/24, version 380
Paths: (3 available, best #3, no table)
  Advertised to update-groups:
     1
  Refresh Epoch 1
  Local, (Received from a RR-client)
    1.1.1.1 (metric 10) (via default) from 1.1.1.1 (100.100.100.100)
      Origin incomplete, metric 20, localpref 100, valid, internal
      Extended Community: RT:100:100 OSPF DOMAIN ID:0x0005:0x000000010200
        OSPF RT:0.0.0.0:5:1 OSPF ROUTER ID:1.1.1.1:0
      mpls labels in/out nolabel/54
      rx pathid: 0, tx pathid: 0
  Refresh Epoch 1
  Local, (Received from a RR-client)
    3.3.3.3 (metric 10) (via default) from 12.12.12.12 (12.12.12.12)
      Origin incomplete, metric 20, localpref 100, valid, internal
      Extended Community: RT:100:100 OSPF DOMAIN ID:0x0005:0x000000010200
        OSPF RT:0.0.0.0:5:1 OSPF ROUTER ID:3.3.3.3:0
      Originator: 3.3.3.3, Cluster list: 12.12.12.12
      mpls labels in/out nolabel/40
      rx pathid: 0, tx pathid: 0
  Refresh Epoch 1
  Local, (Received from a RR-client)
    3.3.3.3 (metric 10) (via default) from 3.3.3.3 (3.3.3.3)
      Origin incomplete, metric 20, localpref 100, valid, internal, best
      Extended Community: RT:100:100 OSPF DOMAIN ID:0x0005:0x000000010200
        OSPF RT:0.0.0.0:5:1 OSPF ROUTER ID:3.3.3.3:0
      mpls labels in/out nolabel/40
      rx pathid: 0, tx pathid: 0x0
R2#

It’s not a good thing if the communication between sites depends on the luck of the draw on how Router IDs are assigned. For consistency I’ll move the Router ID back to its default (in this case it will just use the highest numbered loopback).

R1#conf t
Enter configuration commands, one per line. End with CNTL/Z.
R1(config)#router bgp 1
R1(config-router)#no bgp router-id 100.100.100.100
R1(config-router)#
*Oct  3 17:20:55.448: %BGP-5-ADJCHANGE: neighbor 2.2.2.2 Down Router ID changed
*Oct  3 17:20:55.452: %BGP_SESSION-5-ADJCHANGE: neighbor 2.2.2.2 VPNv4 Unicast 
  topology base removed from session  Router ID changed
*Oct  3 17:20:55.456: %BGP-5-ADJCHANGE: neighbor 12.12.12.12 Down Router ID changed
*Oct  3 17:20:55.456: %BGP_SESSION-5-ADJCHANGE: neighbor 12.12.12.12 VPNv4 Unicast 
  topology base removed from session  Router ID changed
*Oct  3 17:20:55.873: %BGP-5-ADJCHANGE: neighbor 2.2.2.2 Up
*Oct  3 17:20:55.908: %BGP-5-NBR_RESET: Neighbor 12.12.12.12 active reset (Peer 
  closed the session)
R1(config-router)#
*Oct  3 17:20:55.909: %BGP_SESSION-5-ADJCHANGE: neighbor 12.12.12.12 VPNv4 Unicast 
  topology base removed from session  Peer closed the session
R1(config-router)#
*Oct  3 17:21:01.082: %BGP-5-ADJCHANGE: neighbor 12.12.12.12 Up
R1(config-router)#do sh bgp vpnv4 unicast all summary | inc identifier
BGP router identifier 1.1.1.1, local AS number 1
R1(config-router)#

You might also ask at this stage why LAN1 doesn’t suffer from this same problem. If we take a quick look at the reflectors, we can see that R1 is redistributing LAN1 just like LAN2 but the VPNv4 route from R3 is being preferred:

R2#sh bgp vpnv4 unicast rd 1:1 192.168.71.0
BGP routing table entry for 1:1:192.168.71.0/24, version 341
Paths: (3 available, best #2, no table)
  Advertised to update-groups:
     1
  Refresh Epoch 1
  Local, (Received from a RR-client)
    1.1.1.1 (metric 10) (via default) from 1.1.1.1 (1.1.1.1)
      Origin incomplete, metric 6, localpref 100, valid, internal
      Extended Community: RT:100:100 OSPF DOMAIN ID:0x0005:0x000000010200
        OSPF RT:0.0.0.0:2:0 OSPF ROUTER ID:1.1.1.1:0
      mpls labels in/out nolabel/22
      rx pathid: 0, tx pathid: 0
  Refresh Epoch 1
  Local, (Received from a RR-client)
    3.3.3.3 (metric 10) (via default) from 3.3.3.3 (3.3.3.3)
      Origin incomplete, metric 2, localpref 100, valid, internal, best
      Extended Community: RT:100:100 OSPF DOMAIN ID:0x0005:0x000000010200
        OSPF RT:0.0.0.0:2:0 OSPF ROUTER ID:3.3.3.3:0
      mpls labels in/out nolabel/24
      rx pathid: 0, tx pathid: 0x0
  Refresh Epoch 1
  Local, (Received from a RR-client)
    3.3.3.3 (metric 10) (via default) from 12.12.12.12 (12.12.12.12)
      Origin incomplete, metric 2, localpref 100, valid, internal
      Extended Community: RT:100:100 OSPF DOMAIN ID:0x0005:0x000000010200
        OSPF RT:0.0.0.0:2:0 OSPF ROUTER ID:3.3.3.3:0
      Originator: 3.3.3.3, Cluster list: 12.12.12.12
      mpls labels in/out nolabel/24
      rx pathid: 0, tx pathid: 0
R2#

If we do the BGP best path calculation again we can see why:

blog11_image8_BGP_analysis1 2

The reason why LAN1 doesn’t loop is because of the MED (the cluster list might be the ultimate reason but the prefix from R1 is eliminated due to MED).

Remember when OSPF is redistributed into MP-BGP the OSPF cost is set to the MED value. When LAN2 was redistributed into MP-BGP by R1, it was an E2 route, meaning the intra-area cost to the ASBR was not taken into consideration. It stayed as 20 and thus MED was not a tie breaker.

LAN1 however is learned via R7’s intra-area Type1 LSA. When R1 redistributes this into MP-BGP it will take into consideration the cost to the ASBR. In this case it is 6 (assuming each OSPF link is cost 1 since the reference-bandwidth hasn’t been changed):

  1. Link to R5
  2. Link to R4
  3. Link to XR1
  4. Cost of the sham-link
  5. Link to R7
  6. Link to the loopback

R3 will redistribute it into MP-BGP after only two of those hops, hence the lower MED.

Whilst this technically does work for LAN1, it is arguably not the wisest solution to the problem. Even if the engineer had enabled OSPF on the interface rather than using redistribution we could have run into problems. Maybe there’s a better solution…

The Search

When it comes to searching for a solution to this quirk we have to keep in mind what we are trying to achieve as an end goal.

Perhaps one of the simplest solutions on the face of it is to make sure that the PE for the site that the network in question comes from, sets a higher local preference when redistributing into MP-BGP:

blog11_image9_redist

This would ensure that the reflectors would pick the correct VPNv4 route. And indeed if we configure it like that, it does appear to work:

R3#conf t
Enter configuration commands, one per line.  End with CNTL/Z.
R3(config)#ip prefix-list R7-LANS seq 5 permit 192.168.71.0/24
R3(config)#ip prefix-list R7-LANS seq 10 permit 192.168.72.0/24
R3(config)#route-map SET-LOCAL-PREF-HIGH permit 10
R3(config-route-map)#match ip address prefix-list R7-LANS
R3(config-route-map)#set local-preference 200
R3(config-route-map)#route-map SET-LOCAL-PREF-HIGH permit 20
R3(config-route-map)#router bgp 1
R3(config-router)#address-family ipv4 vrf A
R3(config-router-af)# redistribute ospf 1 match internal external 1 
  external 2 route-map SET-LOCAL-PREF-HIGH
R3(config-router-af)#
R2#sh bgp vpnv4 unicast rd 1:1 192.168.72.0
BGP routing table entry for 1:1:192.168.72.0/24, version 416
Paths: (3 available, best #1, no table)
  Advertised to update-groups:
     1
  Refresh Epoch 1
  Local, (Received from a RR-client)
    3.3.3.3 (metric 10) (via default) from 3.3.3.3 (3.3.3.3)
      Origin incomplete, metric 20, localpref 200, valid, internal, best
      Extended Community: RT:100:100 OSPF DOMAIN ID:0x0005:0x000000010200
        OSPF RT:0.0.0.0:5:1 OSPF ROUTER ID:3.3.3.3:0
      mpls labels in/out nolabel/40
      rx pathid: 0, tx pathid: 0x0
  Refresh Epoch 1
  Local, (Received from a RR-client)
    1.1.1.1 (metric 10) (via default) from 1.1.1.1 (1.1.1.1)
      Origin incomplete, metric 20, localpref 100, valid, internal
      Extended Community: RT:100:100 OSPF DOMAIN ID:0x0005:0x000000010200
        OSPF RT:0.0.0.0:5:1 OSPF ROUTER ID:1.1.1.1:0
      mpls labels in/out nolabel/48
      rx pathid: 0, tx pathid: 0
  Refresh Epoch 1
  Local, (Received from a RR-client)
    3.3.3.3 (metric 10) (via default) from 12.12.12.12 (12.12.12.12)
      Origin incomplete, metric 20, localpref 200, valid, internal
      Extended Community: RT:100:100 OSPF DOMAIN ID:0x0005:0x000000010200
        OSPF RT:0.0.0.0:5:1 OSPF ROUTER ID:3.3.3.3:0
      Originator: 3.3.3.3, Cluster list: 12.12.12.12
      mpls labels in/out nolabel/40
      rx pathid: 0, tx pathid: 0
R2#
R5#trace 192.168.72.1 source loopback 0
Type escape sequence to abort.
Tracing the route to 192.168.72.1
VRF info: (vrf in name/id, vrf out name/id)
  1 10.4.5.4 13 msec 7 msec 10 msec
  2 10.4.11.11 62 msec 50 msec 8 msec
  3 10.11.12.12 [MPLS: Labels 24000/40 Exp 0] 36 msec
    10.2.11.2 [MPLS: Labels 16/40 Exp 0] 27 msec 15 msec
  4 10.3.7.3 [MPLS: Label 40 Exp 0] 20 msec 28 msec 24 msec
  5 10.3.7.7 17 msec *  17 msec
R5#

It’s worth pointing out here that even though the backdoor link is also advertising an E2 Type 5 LSA, for which the intra-area cost is not taken into consideration, if two E2 routes have the same lowest cost, the intra-area cost to the ASBR is taken into consideration as a tie breaker. In this case, it is quicker to get to R7 going over the sham-link.

However we have to think about how this design is intended to work. On the one hand we want the backdoor link to be used as a backup link, but we also want Site 2 to be dual-homed. This means that if XR1 somehow becomes unavailable (perhaps because R4 or its uplink to XR1 goes down) we want R1 to be the primary path out of the site. But as things stand, if XR1 goes down we will end up using the backdoor link. This is because R1 doesn’t have a sham-link. It will prefer its local OSPF route over MP-BGP as we saw earlier.

We can simulate just such as scenario by shutting down R4’s uplink and tracing to LAN2 before bringing it back up so traffic goes back over the sham-link.

R4#conf t
Enter configuration commands, one per line.  End with CNTL/Z.
R4(config)#interface gi1.411
R4(config-subif)#shut
R4(config-subif)#
%OSPF-5-ADJCHG: Process 1, Nbr 11.11.11.11 on GigabitEthernet1.411 from FULL to 
  DOWN, Neighbor Down: Interface down or detached
R4(config-subif)#do trace 192.168.72.1 source loopback 0
Type escape sequence to abort.
Tracing the route to 192.168.72.1
VRF info: (vrf in name/id, vrf out name/id)
  1 10.4.5.5 23 msec 10 msec 6 msec
  2 10.5.7.7 11 msec *  14 msec
R4(config-subif)#no shut
R4(config-subif)#
%OSPF-5-ADJCHG: Process 1, Nbr 11.11.11.11 on GigabitEthernet1.411 from DOWN to 
  INIT, Received Hello
%OSPF-5-ADJCHG: Process 1, Nbr 11.11.11.11 on GigabitEthernet1.411 from INIT to 
  2WAY, 2-Way Received
%OSPF-5-ADJCHG: Process 1, Nbr 11.11.11.11 on GigabitEthernet1.411 from 2WAY to 
  EXSTART, AdjOK?
R4(config-subif)#
%OSPF-5-ADJCHG: Process 1, Nbr 11.11.11.11 on GigabitEthernet1.411 from EXSTART to 
  EXCHANGE, Negotiation Done
%OSPF-5-ADJCHG: Process 1, Nbr 11.11.11.11 on GigabitEthernet1.411 from EXCHANGE to
  LOADING, Exchange Done
%OSPF-5-ADJCHG: Process 1, Nbr 11.11.11.11 on GigabitEthernet1.411 from LOADING to 
  FULL, Loading Done
R4(config-subif)#do trace 192.168.72.1 source loopback 0
Type escape sequence to abort.
Tracing the route to 192.168.72.1
VRF info: (vrf in name/id, vrf out name/id)
  1 10.4.11.11 10 msec 11 msec 6 msec
  2 10.11.12.12 [MPLS: Labels 24000/40 Exp 0] 39 msec 46 msec
    10.2.11.2 [MPLS: Labels 16/40 Exp 0] 8 msec
  3 10.3.7.3 [MPLS: Label 40 Exp 0] 18 msec 10 msec 22 msec
  4 10.3.7.7 38 msec *  19 msec
R4(config-subif)#

You could potentially run a different protocol across the backdoor link and rely on redistribution manipulation, but that could introduce more issues – I will leave those options open to discussion.

Possibly the best solution, in order to maintain OSPF as a contiguous area 0 running between both sites, is to give R1 a sham-link as well. This will allow R1 to form an adjacency with R3 and will prevent the redistribution of any OSPF routes into MP-BGP that would be preferred over the sham-link.

The Work

The work involved in configuration of the sham-link from R1 to R3 is analogous to what we saw on the R3 to XR1 link – the only difference being that both ends are IOS-XE routers.

R1#conf t
Enter configuration commands, one per line. End with CNTL/Z.
R1(config)#interface Loopback100
R1(config-if)# vrf forwarding A
R1(config-if)# ip address 100.100.100.100 255.255.255.255
R1(config-if)#router bgp 1
R1(config-router)#address-family ipv4 unicast vrf A
R1(config-router-af)#network 100.100.100.100 mask 255.255.255.255
R1(config-router-af)#router ospf 1 vrf A
R1(config-router)# area 0 sham-link 100.100.100.100 33.3.3.3

R3(config)#router ospf 1 vrf A
R3(config-router)#area 0 sham-link 33.3.3.3 100.100.100.100
%OSPF-5-ADJCHG:Process 2, Nbr 1.1.1.1 on OSPF_SL9 from LOADING to FULL,Loading Done
R3(config-router)#

R1#sh ip ospf sham-links
Sham Link OSPF_SL0 to address 33.3.3.3 is up
Area 0 source address 100.100.100.100
  Run as demand circuit
  DoNotAge LSA allowed. Cost of using 1 State POINT_TO_POINT,
  Timer intervals configured, Hello 10, Dead 40, Wait 40,
    Hello due in 00:00:06
    Adjacency State FULL (Hello suppressed)
    Index 1/2/2, retransmission queue length 0, number of retransmission 0
    First 0x0(0)/0x0(0)/0x0(0) Next 0x0(0)/0x0(0)/0x0(0)
    Last retransmission scan length is 0, maximum is 0
    Last retransmission scan time is 0 msec, maximum is 0 msec
R1#
R1#sh ip route vrf A 192.168.72.0

Routing Table: A
Routing entry for 192.168.72.0/24
  Known via "ospf 1", distance 110, metric 20, type extern 2, forward metric 2
  Redistributing via bgp 1
  Advertised by bgp 1 match internal external 1 & 2
  Last update from 3.3.3.3 00:00:36 ago
  Routing Descriptor Blocks:
  * 3.3.3.3 (default), from 7.7.7.7, 00:00:36 ago
      Route metric is 20, traffic share count is 1
      MPLS label: 46
      MPLS Flags: MPLS Required
R1#

blog11_image10_dual sham links
We can now test to see that if XR1 is lost, traffic will still follow the same path.

R4#conf t
Enter configuration commands, one per line.  End with CNTL/Z.
R4(config)#interface gi1.411
R4(config-subif)#shut
R4(config-subif)#do
%OSPF-5-ADJCHG: Process 1, Nbr 11.11.11.11 on GigabitEthernet1.411 from FULL to 
  DOWN, Neighbor Down: Interface down or detached
R4(config-subif)#do trace 192.168.72.1 source loopback 0
Type escape sequence to abort.
Tracing the route to 192.168.72.1
VRF info: (vrf in name/id, vrf out name/id)
  1 10.4.5.5 6 msec 5 msec 4 msec
  2 10.1.5.1 5 msec 7 msec 8 msec
  3 10.1.2.2 [MPLS: Labels 16/46 Exp 0] 10 msec 17 msec 11 msec
  4 10.3.7.3 [MPLS: Label 46 Exp 0] 10 msec 6 msec 7 msec
  5 10.3.7.7 13 msec *  13 msec
R4(config-subif)#

R1 is now acting as a redundant link out of Site 2. Depending the LSA types, you could even adjust which of XR1 or R1 is the primary exit for Site 2 by adjusting the costs of the sham links! As with nearly anything that requires a full-mesh, scalability could become an issue but for our purposes here it works well. 

Sham-links aren’t the most widely used tools across Service Providers but hopefully this blog has given some insight into how they work and what to consider to avoid some possible pitfalls. Are there any alternate solution you can see that might work? I’m always keen to hear alternate ideas or comments. I came across this scenario whilst working through an INE lab, so if you haven’t seen ine.com you should definitely check them out! Thank you for reading and until next time.

From MPLS L3VPN to PBB-EVPN

This blog introduces PBB-EVPN over an MPLS network. But rather than just describe the technology from scratch, I have tried to structure the explanation assuming the reader is familiar with plain old MPLS L3VPN and is new to PBB and/or EVPN. This was certainly the case with me when I first studied this topic and I’m hoping others in a similar position will find this approach insightful.

I won’t be exploring a specific quirk or scenario – rather I will look at EVPN followed by PBB, giving analogies and comparisons to MPLS L3VPN as I go, before combining them into PBB-EVPN. I will focus on how traffic is identified, learned and forwarded in each section.

So what is PBB-EVPN? Well, besides being hard to say 3 times fast, it is essentially an L2VPN technology. It enables a Layer 2 bridge domain to be stretched across a Service Provider core while utilizing MAC aggregation to deal with scaling issues.

Let’s look at EVPN first.

EVPN

EVPN, or Ethernet VPN, over an MPLS network works on a similar principle to MPLS L3VPN. The best way to conceptualize the difference is to draw an analogy (colour coded to highlight points of comparison)…

MPLS L3VPN assigns PE interfaces to VRFs. It then uses MP-BGP (with the vpnv4 unicast address family) to advertise customer IP Subnets as VPNv4 routes to Route Reflectors or other PEs. Remote PEs that have a VRF configured to import the correct route targets, accept the MP-BGP update and install an ipv4 route into the routing table for that VRF.

EVPN uses PE interfaces linked to bridge-domains with an EVI. It then uses MP-BGP (with the l2vpn evpn address family) to advertise customer MAC addresses as EVPN routes to Route Reflectors or other PEs. Remote PEs that have an EVI configured to import the correct route target, accept the MP-BGP update and install a MAC address into the bridge domain for that EVI.

This analogy is a little crude, but in both cases packets or frames destined for a given subnet or MAC will be imposed with two labels – an inner VPN label and an outer Transport label. The Transport label is typical communicated via something like LDP and will correspond to the next hop loopback of the egress PE. The VPN label is communicated in the MP-BGP updates.

These diagrams illustrate the comparison:

Blog6_image1a_and_b

In EVPN, customer devices tend to be switches rather than routers. PE-CE routing protocols, like eBGP, aren’t used since it operates over layer 2. The Service Provider appears as one big switch. In this sense, it accomplishes the same as VPLS but (among other differences) uses BGP to distribute MAC address information, rather than using a full mesh of pseudowires.

EVPN uses an EVI, or Ethernet Virtual Identifier, to identify a specific instance of EVPN as it maps to a bridge domain. For the purposes of this overview, you can think of an EVI as being quasi-equivalent to a VRF. A customer facing interface will be put into a bridge domain (layer 2 broadcast domain), which will have an EVI identifier associated with it.

The MAC address learning that EVPN utilizes what is called control-plane learning, since it is BGP (a control-plane routing protocol) that distributes the MAC address information. This is in contrast to data-plane learning, which is how a standard switch learns MAC addresses – by associating the source MAC address of a frame to the receiving interface.

The following Cisco IOS-XR config shows an EVPN bridge domain and edge interface setup, side by side with a MPLS L3VPN setup for comparison:

Blog6_output1a_and_b

NB. For MPLS L3VPN config  the RD config (which is usually configured under CE-PE eBGP config) is not shown. PBB config is shown in the EVPN Bridge domain, this will be explained further into the blog.

EVPN seems simple enough at first glance, but it has a scaling problem, which PBB can ultimately help with…

Any given customer site can have hundreds or even thousands of MAC addresses, as opposed to just one subnet (as in an MPLS L3VPN environment). The number of updates and withdrawals that BGP would have to send could be overwhelming if it needed to make adjustments for MAC addresses appearing and disappearing – not to mention the memory requirements. And you can’t summarise MAC addresses like you can IP ranges. It would be like an MPLS L3VPN environment advertising /32 prefixes for every host rather than just one prefix for the subnet. We need a way to summarise or aggregate the MAC addresses.

Here’s where PBB comes in…

PBB – Provider Backbone Bridging (802.1ah)

PBB can help solve the EVPN scaling issue by performing one key function – it maps each customer MAC address to the MAC address of the attaching PE. Customer MAC addresses are called C-MACs. The PE MAC addresses are call B-MACs (or Bridge MACs).

This works by adding an extra layer 2 header to frame as it is forwarded from one site to another across the provider core. The outer layer 2 header has a destination B-MAC address of the PE device that the inner frames destination C-MAC is associated with.  As a result, PBB is often called MAC-in-MAC. This diagram illustrates the concept:

Blog6_image2_pbb

NB. In PBB terminology the provider devices are called Bridges. So a BEB (Backbone Edge Bridge) is a PE and a BCB (Backbone Core Bridge) is a P. For sake of simplicity, I will continue to use PE/P terminology. Also worth noting is that PBB diagrams often show service provider devices as switches, to illustrate the layer 2 nature of the technology – which I’ve done above.

In the above diagram the SID (or Service ID) represents a layer 2 broadcast domain similar to what an EVI represents in EVPN.

Frames arriving on a PE interface will be inspected and, based on certain characteristics, it will be mapped or assigned to a particular Service ID (SID).

The characteristics that determine what SID a frame belongs to can be a number of things:

  • The customer assigned VLAN
  • The Service Provider assigned VLAN
  • Existing SID identifiers
  • The interface it arrives on
  • A combination of the above or other factors

To draw an analogy to MPLS L3VPN – the VRF that an incoming packet is assigned to is determined by whatever VRF is configured on the receiving interface (using ip vrf forwarding CUST_1 in Cisco IOS interface CLI).

Once the SID has been allocated, the entire frame is then encapsulated in the outer layer 2 header with destination MAC of the egress PE.

In this way C-MACs are mapped to either B-MACs or local attachment circuits. Most importantly however the core P routers do not need to learn all of the MAC addresses of the customers. They only deal with the MAC addresses of the PEs. This allows a PE to aggregate all of the attached C-MACs for a given customer behind its own B-MAC.

But how does a remote PE learn which C-MAC maps to which B-MAC?

In PBB learning is done in the data-plane, much like a regular layer 2 switch. When a PE receives a frame from the PBB core, it will strip off the outer layer 2 header and make a note of the source B-MAC (the ingress PE). It will map this source B-MAC to the source C-MAC found on the inner layer 2 header. When a frame arrives on a local attachment circuit, the PE will map the source C-MAC to the attachment circuit in the usual way.

PBB must deal with BUM traffic too. BUM traffic is Broadcast, Unknown Unicast or Multicast traffic. An example of BUM traffic is the arrival or frame for which the destination MAC address is unknown. Rather than broadcast like a regular layer 2 switch would, a PPB PE will set the destination MAC address of the outer layer 2 header to a special multicast MAC address that is built based on the SID and includes all the egress PEs that are part of the same bridge domain. EVPN uses a different method or handling BUM traffic but I will go into that later in the blog.

Overall, PBB is more complicated than the explanation given here, but this is the general principle (if you’re interested, see section 3 of my VPLS, PBB, EVPN and VxLAN Diagrams document that details how PBB can be combined the 802.1ad to add an aggregation layer to a provider network).

Now that we have the MAC-in-MAC features of PBB at our disposal, we can use it to solve the EVPN scaling problem and combine the two…

PBB-EVPN

With the help of PBB, EVPN can be adapted so that it deals with only the B-MACs.

To accomplish this, each EVPN EVI is linked to two bridge domains. One bridge domain is dedicated to customer MAC addresses and connected to the local attachment circuits. The other is dedicated to the PE routers B-MAC addresses. Both of these bridge domains are combined under the same bridge group.

Blog6_image3_bridge_domains

The PE devices will uses data-plane learning to build a MAC database, mapping each C-MAC to either an attachment circuit or the B-MAC of an egress PE. Source C-MAC addresses are learned and associated as traffic flows through the network just like PBB does.

The overall setup would look like this:

Blog6_image4_pbb_evpn_overview

The only thing EVPN needs to concern itself with is advertising the B-MACs of the PE devices. EVPN uses control-plane learning and includes the B-MACs in the MP-BGP l2vpn evpn updates. For example, if you were to look at MAC address known to a particular EVI on a route-reflector, you would only see MAC address for PE routers.

Looking again at the configuration output that we saw above, we can get a better idea of how PBB-EVPN works:

Blog6_output2_pbb_evpn_detail

NB. I have added the concept of a BVI, or Bridged Virtual Interface, to the above output. This can be used to provide a layer 3 breakout or gateway similar to how an SVI works on a L3 switch.

You can view the MAC addresses information using the following command:

Blog6_output3_macs

Now lets look at how PBB-EVPN handles BUM traffic. Unlike PBB on its own, which just sends to a multicast MAC address, PBB-EVPN will use unicast replication and send copies of the frame to all of the remote PEs that are in the same EVI. This is an EVPN method and the PE knows which remote PEs belong to the same EVI by looking in what is called a flood list.

But how does it build this flood list? To learn that, we need to look at EVPN route-types…

MPLS L3VPN sends VPNv4 routes in its updates. But EVPN send more than one “type” of update. The type of update, or route-type as it is called, will denote what kind of information is carried in the update. The route-type is part of the EVPN NLRI.

For the purposes of this blog we will only look at two route-types.

  • Route-Type 2s, which carry MAC addresses (analogous to VPNv4 updates)
  • Route-Type 3s, which carry information on the egress PEs that belong to an EVI.

It is these Route-Type 3s (or RT-3s for short) that are used to build the flood list.

When BUM traffic is received by a PE, it will send copies of the frame to all of its attachment circuits (except the one it received the frame on) and all of the PEs for which it has received a Route-Type 3 update. In other words, it will send to everything in its flood-list.

So the overall process for a BUM packet being forwarded across a PBB-EVPN backbone will look as follows:

Blog6_image5_bum_traffic

So that’s it, in a nutshell. In this way PBB and EVPN can work together to create an L2VPN network across a Service Provider.

There are other aspects of both PBB and EVPN, such as EVPN multi-homing using Ethernet Segment Identifiers or PBB MAC clearing with MIRP to name just a couple, but the purpose of this blog was to provide an introductory overview – specifically for those used to dealing with MPLS L3VPN. Thoughts are welcome, and as always, thank you for reading.

MPLS Management misconfiguration

There are many different ways for ISPs to manage MPLS devices like routers and firewalls that are deployed to customer sites. This quirk explores one such solution and looks at a scenario where a misconfiguration results in VRF route leaking between customers.

The quirk

When an ISP deploys Customer Edge (CE) devices to customers sites they might, and often do, want to maintain management. For customers with a simple public internet connection this is usually straight forward – the device is reachable over the internet and  an ACL or similar policy will be configured, allowing access from only a list of approved ISP IP addresses (for extra security VPNs could be used).

However when Peer-to-Peer L3VPN MPLS is used, it is more complicated. The customer network is not directly accessible from the internet without going through some kind of a breakout site. The ISP will either need a link into their customers MPLS network or must configure access through the breakout. This can become complicated as the number of customers, and the number of sites per customer, increases.

One option, presented in this quirk, is to have all MPLS customers PE-CE WAN subnets come from a common supernet range. These WAN subnets can then be exported into a common management VRF using a specific RT. The network that will be used to demonstrate this looks as follows:

blog4_image1_base_setup

This is available for download as a GNS3 lab from here. It includes the solution to the quirk as detailed below.

The ISPs ASN is 500. The two customer have ASNs 100 and 200 (depending on the setup these would typically be private ASNs, but they have been shown here as 100 and 200 for simplicity). A management router (MGMT) in ASN 64512 has access to the PE-CE WAN ranges for all of the customers, all of which come from the supernet 172.30.0.0/16. A special subnet within this range, 172.30.254.0/24, is reserved for the Management network itself. The MGMT router, or MPLS jump box as it may also be called, is connected to this range – as would any other devices requiring access to the MPLS customers devices (backup or monitoring systems for instance… not shown).

The basic idea is that each customer VRF exports their PE-CE WAN ranges with an RT of 500:501. The MGMT VRF then imports this RT.

Along side this, the MGMT VRF will exports its own routes (from the 172.30.254.0/24 supernet) with an RT of 500:500. All of the customer VRFs import 500:500.

This has two key features:

  • Customer WAN ranges will all be from the 172.30.0.0/16 and must not overlap between customers.
  • WAN ranges and site subnets are not, at any point, leaked between customer VRFs.

To get a better idea of how it works, take a look at the following diagram:

blog4_image2_mpls_mgmt_concept

The CLI for each customer VRF setup looks as follows:

ip vrf CUST_1
 description Customer_1_VRF
 rd 500:1
 vpn id 500:1
 export map VRF_EXPORT_MAP
 route-target export 500:1
 route-target import 500:1
 route-target import 500:500
!
route-map VRF_EXPORT_MAP permit 10
 match ip address prefix-list VRF_WANS_EXCEPT_MGMT
 set extcommunity rt 500:501 additive
route-map VRF_EXPORT_MAP permit 20
!
ip prefix-list VRF_WANS_EXCEPT_MGMT seq 10 deny 172.30.254.0/24 le 32
ip prefix-list VRF_WANS_EXCEPT_MGMT seq 20 permit 172.30.0.0/16 le 32

Note that the export map used on customer VRFs makes a point to exclude the routes that the Management supernet (172.30.254.0/24). This is done on the off chance that the range exists within the customers VRF table.

The VRF for the Management network is configured as follows (note this is only configured on CE3 in the above lab):

ip vrf MGMT_VRF
 description VRF for Management of Customer CEs
 rd 500:500
 vpn id 500:500
 route-target export 500:500
 route-target import 500:500
 route-target import 500:501

This results in the WAN ranges for customers being tagged with the 500:501 RT but not the LAN ranges.

PE1#sh bgp vpnv4 unicast vrf CUST_1 172.30.1.0/30
BGP routing table entry for 500:1:172.30.1.0/30, version 9
Paths: (1 available, best #1, table CUST_1)
  Advertised to update-groups:
    1         3

  Local
    0.0.0.0 from 0.0.0.0 (1.1.1.1)
      Origin incomplete, metric 0, localpref 100, weight 32768, valid, 
       sourced, best
      Extended Community: RT:500:1 RT:500:501
      mpls labels in/out 23/aggregate(CUST_1)

PE1#sh bgp vpnv4 unicast vrf CUST_1 192.168.50.0/24
BGP routing table entry for 500:1:192.168.50.0/24, version 3
Paths: (1 available, best #1, table CUST_1)
  Advertised to update-groups:
    3

  100
    172.30.1.2 from 172.30.1.2 (192.168.50.1)
      Origin incomplete, metric 0, localpref 100, valid, external, best
      Extended Community: RT:500:1
      mpls labels in/out 24/nolabel
PE1#

192.168.50.0/24, above, is a one of the LAN ranges and does not have the 500:501 RT.

Every VRF can see the management network and the management network can see all the PE-CE WAN ranges for every customer:

PE1#sh ip route vrf CUST_2

Routing Table: CUST_2
Codes: C - connected, S - static, R - RIP, M - mobile, B - BGP
       D - EIGRP, EX - EIGRP external, O - OSPF, IA - OSPF inter area
       N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
       E1 - OSPF external type 1, E2 - OSPF external type 2
       i - IS-IS, su - IS-IS summary, L1 - IS-IS level-1
       L2 - IS-IS level-2, ia - IS-IS inter area, * - candidate default
       U - per-user static route, o - ODR
       P - periodic downloaded static route

Gateway of last resort is not set

B       192.168.60.0/24 [20/0] via 172.30.1.10, 01:32:17
        172.30.0.0/30 is subnetted, 3 subnets
B         172.30.254.0 [200/0] via 3.3.3.3, 01:32:09
B         172.30.1.4 [200/0] via 2.2.2.2, 01:32:09
C         172.30.1.8 is directly connected, FastEthernet1/0
B       192.168.50.0/24 [200/0] via 2.2.2.2, 01:32:09

PE1#
PE3#sh ip route vrf MGMT_VRF

Routing Table: MGMT_VRF
Codes: C - connected, S - static, R - RIP, M - mobile, B - BGP
       D - EIGRP, EX - EIGRP external, O - OSPF, IA - OSPF inter area
       N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
       E1 - OSPF external type 1, E2 - OSPF external type 2
       i - IS-IS, su - IS-IS summary, L1 - IS-IS level-1
       L2 - IS-IS level-2, ia - IS-IS inter area, * - candidate default
       U - per-user static route, o - ODR
       P - periodic downloaded static route

Gateway of last resort is not set

        172.30.0.0/30 is subnetted, 4 subnets
C         172.30.254.0 is directly connected, FastEthernet0/0
B         172.30.1.0 [200/0] via 1.1.1.1, 01:32:24
B         172.30.1.4 [200/0] via 2.2.2.2, 01:32:24
B         172.30.1.8 [200/0] via 1.1.1.1, 01:32:24

PE3#

Also, note that the routing table for Customer 2 (vrf CUST_2) cannot see the 172.30.1.0/30 WAN range for Customer 1 (vrf CUST_1).

Given the proper config, the MGMT router can access the WAN ranges for customers:

MGMT#telnet 172.30.1.2
Trying 172.30.1.2 ... Open

User Access Verification
Password:
CE1-1>

NB. I’m not advocating using telnet in such an environment. Use SSH as a minimum when you can.

The quirk comes in when a simple misconfiguration introduces route leaking between customer VRFs.

Consider an engineer accidentally configuring a VRF that exports all its vpnv4 prefixes with RT 500:500 (rather than only exporting its PE-CE WAN routes with RT500:501 as described above). The mistake is easy enough to make and will cause routes from the newly configured VRF to be imported by all other customer VRFs. This will have a severe impact for any customers with the same route within their VRF.

To demonstrate this, imagine that the CUST_1 VRF is not yet configured. Pinging from site Customer 2 Site 2 (CE2-2 on the lower left side of the diagram) with a source of 192.168.60.1 to Customer 2 Site 1 (CE1-2) with a destination of 192.168.50.1 works fine

CE2-2#trace 192.168.50.1 source lo1

Type escape sequence to abort.
Tracing the route to 192.168.50.1
 1 172.30.1.9 12 msec 24 msec 24 msec
 2 10.10.14.4 [AS 500] [MPLS: Labels 16/24 Exp 0] 92 msec 64 msec 44 msec
 3 172.30.1.5 [AS 500] [MPLS: Label 24 Exp 0] 48 msec 68 msec 52 msec
 4 172.30.1.6 [AS 500] 116 msec 88 msec 104 msec

CE2-2#

If the CUST_1 VRF is now setup with the aforementioned misconfiguration, route leaking between CUST_1 and CUST_2 will result:

PE1(config)#ip vrf CUST_1
PE1(config-vrf)# description Customer_1_VRF
PE1(config-vrf)# rd 500:1
PE1(config-vrf)# vpn id 500:1
PE1(config-vrf)# route-target export 500:1
PE1(config-vrf)# route-target import 500:1
PE1(config-vrf)# route-target export 500:500
PE1(config-vrf)#
PE1(config-vrf)# interface FastEthernet0/1
PE1(config-if)# description Link to CE 1 for Customer 1
PE1(config-if)# ip vrf forwarding CUST_1
PE1(config-if)# ip address 172.30.1.1 255.255.255.252
PE1(config-if)# duplex auto
PE1(config-if)# speed auto
PE1(config-if)# no shut
PE1(config-if)#exit
PE1(config)#router bgp 500
PE1(config-router)# address-family ipv4 vrf CUST_1
PE1(config-router-af)# redistribute connected
PE1(config-router-af)# redistribute static
PE1(config-router-af)# neighbor 172.30.1.2 remote-as 100
PE1(config-router-af)# neighbor 172.30.1.2 description Customer 1 Site 1
PE1(config-router-af)# neighbor 172.30.1.2 activate
PE1(config-router-af)# neighbor 172.30.1.2 default-originate
PE1(config-router-af)# neighbor 172.30.1.2 as-override
PE1(config-router-af)# neighbor 172.30.1.2 route-map CUST_1_SITE_1_IN in
PE1(config-router-af)# no synchronization
PE1(config-router-af)# exit-address-family
PE1(config-router)#

VRF CUST_1 will export its routes (including 192.168.50.0/24 from Customer 1 Site 1 – CE1-1) and the VRF CUST_2 will import these routes due to the RT of 500:500.

Looking at the BGP and routing table for the CUST_2 VRF shows that the next hop for 192.68.50.0/24 is now the CE1-1 router.

PE1#sh ip route vrf CUST_2 192.168.50.0
Routing entry for 192.168.50.0/24
  Known via "bgp 500", distance 20, metric 0
  Tag 100, type external
  Last update from 172.30.1.2 00:02:45 ago
  Routing Descriptor Blocks:
  * 172.30.1.2 (CUST_1), from 172.30.1.2, 00:02:45 ago
      Route metric is 0, traffic share count is 1
      AS Hops 1
      Route tag 100

PE1#sh bgp vpnv4 unicast vrf CUST_2 192.168.50.0
BGP routing table entry for 500:2:192.168.50.0/24, version 21
Paths: (2 available, best #1, table CUST_2)
  Advertised to update-groups:
    2

  100, imported path from 500:1:192.168.50.0/24
    172.30.1.2 from 172.30.1.2 (192.168.50.1)
      Origin incomplete, metric 0, localpref 100, valid, external, best
      Extended Community: RT:500:1 RT:500:500

  200
    2.2.2.2 (metric 20) from 5.5.5.5 (5.5.5.5)
      Origin incomplete, metric 0, localpref 100, valid, internal
      Extended Community: RT:500:2
      Originator: 2.2.2.2, Cluster list: 5.5.5.5
      mpls labels in/out nolabel/24

PE1#

There are now two possible paths to reach 192.168.50.0/24. One imported from the VRF for CUST_1 and one from its own (coming from CE1-2). The path via AS 100 is being preferred due to the lower IGP metric. Note the 500:500 RT in this path.

Once this is done CE2-2 cannot reach its 192.168.50/24 subnet on CE1-2.

CE2-2#trace 192.168.50.1 source lo1
Type escape sequence to abort.

Tracing the route to 192.168.50.1
1 172.30.1.9 8 msec 12 msec 12 msec
2 * * *
3 * * *
4 * * *
...output omitted for brevity

Granted, this issue is caused by a mistake, but the difference between the correct and incorrect commands is minimal. An engineer under pressure or working quickly could potentially disrupt a massive MPLS infrastructure resulting in outages for multiple customers.

The search

As mentioned at the beginning of this blog, there are multiple ways to manage an MPLS network.

One possibility is to have a single router that, rather than import and export WAN routes based on RTs, has a single loopback address in each VRF. It is from this loopback that the router will source SSH or telnet sessions to the customer CE devices. For example:

interface loopback 1
 description Loopback source for Customer 1
 ip vrf forwarding CUST_1
 ip address 100.100.100.100 255.255.255.255
!
interface loopback 2
 description Loopback source for Customer 2
 ip vrf forwarding CUST_2
 ip address 100.100.100.100 255.255.255.255

MGMT# telnet 172.30.1.2 /vrf CUST_1

This has a number of advantages:

  • This router acts as a single jump host (rather than a subnet), which could be considered more secure
  • There is no restriction on the WAN addresses for each customer. They can be any WAN range at all and can overlap between customers.
  • The same IP address can be used for each VRFs loopback (as long as it doesn’t clash with any existing IPs already in the customers VRF).

However there are a number of disadvantages:

  • Each VRF must be configured on this jump router
  • This jump router is a single point of failure
  • The command to log on is more complex and requires the users to know the VRFs exact name rather than just the router IP.
  • Migrating to this solution, from the aforementioned RT import/export solution, would be a cumbersome and long process.
  • Centralised MPLS backups could be complicated if there is a not a common subnet (like 172.30.254.0/24) reachable by all CE devices.

For these reasons it was decided not to use this solution. Rather, it was decided to use import filtering, to prevent this issue from taking place even if the misconfiguration occurred. The import filtering uses a route-map that makes the followed sequential check:

    1. If a route has the RT 500:500 and is from the management range (172.30.254.0/24) allow it.
    2. If any other route has the RT 500:500, deny it.
    3. Allow the import of all other routes.

Essentially, rather than just importing 500:500, this route-map checks to make sure that a vpnv4 prefix comes from the management range of 172.30.254.0/24. The biggest issue in this scenario was the deployment of this route-map to all VRFs on all PEs. But with a little bit of scripting (I won’t go into the details here), this was far more plausible than the option of deploying a multi-VRF jump router.

The work

The route map described in the above section looks as follows:

ip extcommunity-list standard VRF_MGMT_COMMUNITY permit rt 500:500
ip prefix-list VRF_MGMT_LAN seq 5 permit 172.30.254.0/24 le 32
!
route-map VRF_IMPORT_MAP permit 10
 match ip address prefix-list VRF_MGMT_LAN
 match extcommunity VRF_MGMT_COMMUNITY
!
route-map VRF_IMPORT_MAP deny 20
 match extcommunity VRF_MGMT_COMMUNITY
!
route-map VRF_IMPORT_MAP permit 30

NB. This is a good example of and/or operation in a route map. If the types differ (in this case a prefix list and an extcommunity list) the operation is treated as a conjunction (AND) operation. If the types are the same it is a disjunction (OR) operation.

This will prevent the issue from occurring as it will stop the import of any vpnv4 prefix that has an RT of 500:500 unless it is from the management range.

Here is the configuration of this import map on PE1 (the other PEs are not shown but it should be configured on them too):

PE1#conf t
Enter configuration commands, one per line. End with CNTL/Z.
PE1(config)# ip extcommunity-list standard VRF_MGMT_COMMUNITY permit 
rt 500:500
PE1(config)#ip prefix-list VRF_MGMT_LAN seq 5 permit 172.30.254.0/24 
le 32
PE1(config)#!
PE1(config)#route-map VRF_IMPORT_MAP permit 10
PE1(config-route-map)# match ip address prefix-list VRF_MGMT_LAN
PE1(config-route-map)# match extcommunity VRF_MGMT_COMMUNITY
PE1(config-route-map)#!
PE1(config-route-map)#route-map VRF_IMPORT_MAP deny 20
PE1(config-route-map)# match extcommunity VRF_MGMT_COMMUNITY
PE1(config-route-map)#!
PE1(config-route-map)#route-map VRF_IMPORT_MAP permit 30
PE1(config-route-map)#
PE1(config-route-map)#ip vrf CUST_2
PE1(config-vrf)#import map VRF_IMPORT_MAP

After this addition, in the event that the misconfiguration takes place when creating the CUST_1 VRF, the import map will block the 192.168.50.0/24 subnet. The only path that the CUST_2 VRF has to 192.168.50.0/24 is from CE1-2, which is correct. Here is the configuration and resulting verification:

PE1(config)#ip vrf CUST_1
PE1(config-vrf)# description Customer_1_VRF
PE1(config-vrf)# rd 500:1
PE1(config-vrf)# vpn id 500:1
PE1(config-vrf)# route-target export 500:1
PE1(config-vrf)# route-target import 500:1
PE1(config-vrf)# route-target export 500:500
PE1#sh ip route vrf CUST_2 192.168.50.0
Routing entry for 192.168.50.0/24
  Known via "bgp 500", distance 200, metric 0
  Tag 200, type internal
  Last update from 2.2.2.2 00:22:12 ago
  Routing Descriptor Blocks:
  * 2.2.2.2 (Default-IP-Routing-Table), from 5.5.5.5, 00:22:12 ago
    Route metric is 0, traffic share count is 1
    AS Hops 1
    Route tag 200

PE1#sh bgp vpnv4 unicast vrf CUST_2 192.168.50.0
BGP routing table entry for 500:2:192.168.50.0/24, version 12
Paths: (1 available, best #1, table CUST_2)
Advertised to update-groups:
    2
  200
    2.2.2.2 (metric 20) from 5.5.5.5 (5.5.5.5)
      Origin incomplete, metric 0, localpref 100, valid, internal, best
      Extended Community: RT:500:2
      Originator: 2.2.2.2, Cluster list: 5.5.5.5
      mpls labels in/out nolabel/24
PE1#
CE2-2#trace 192.168.50.1 source lo1

Type escape sequence to abort.
Tracing the route to 192.168.50.1

 1 172.30.1.9 12 msec 24 msec 8 msec
 2 10.10.14.4 [AS 500] [MPLS: Labels 18/24 Exp 0] 60 msec 68 msec 64 msec
 3 172.30.1.5 [AS 500] [MPLS: Label 24 Exp 0] 52 msec 68 msec 44 msec
 4 172.30.1.6 [AS 500] 84 msec 56 msec 56 msec

CE2-2#

Management of the correct WAN device is still working as well…

MGMT#telnet 172.30.1.10
Trying 172.30.1.10 ... Open

User Access Verification

Password:
CE2-2>

Just for good measure, and to double check that our route-map is making a difference, let’s see what happens if we remove the import map from the CUST_2 VRF.

PE1#conf t
Enter configuration commands, one per line. End with CNTL/Z.
PE1(config)#ip vrf CUST_2
PE1(config-vrf)#no import map VRF_IMPORT_MAP
PE1(config-vrf)#^Z
PE1#
*Mar 1 00:27:45.259: %SYS-5-CONFIG_I: Configured from console by console
PE1#sh bgp vpnv4 unicast vrf CUST_2 192.168.50.0
BGP routing table entry for 500:2:192.168.50.0/24, version 22
Paths: (2 available, best #1, table CUST_2)
Flag: 0x820
  Advertised to update-groups:
    2
  100, imported path from 500:1:192.168.50.0/24
    172.30.1.2 from 172.30.1.2 (192.168.50.1)
      Origin incomplete, metric 0, localpref 100, valid, external, best
      Extended Community: RT:500:1 RT:500:500
  200
    2.2.2.2 (metric 20) from 5.5.5.5 (5.5.5.5)
      Origin incomplete, metric 0, localpref 100, valid, internal
      Extended Community: RT:500:2
      Originator: 2.2.2.2, Cluster list: 5.5.5.5
      mpls labels in/out nolabel/24
PE1#

The offending route is imported into the CUST_2 VRF pretty quickly, proving that our route-map works. If the route map is put back in place, and we wait for the BGP Scanner to run (after 30 seconds or less) the vpnv4 prefix is blocked again:

PE1#conf t
Enter configuration commands, one per line. End with CNTL/Z.
PE1(config)#ip vrf CUST_2
PE1(config-vrf)#import map VRF_IMPORT_MAP
PE1(config-vrf)#^Z
PE1#
*Mar 1 00:29:51.443: %SYS-5-CONFIG_I: Configured from console by console
PE1#sh bgp vpnv4 unicast vrf CUST_2 192.168.50.0
BGP routing table entry for 500:2:192.168.50.0/24, version 24
Paths: (1 available, best #1, table CUST_2)
Flag: 0x820
  Advertised to update-groups:
    2
  200
    2.2.2.2 (metric 20) from 5.5.5.5 (5.5.5.5)
      Origin incomplete, metric 0, localpref 100, valid, internal, best
      Extended Community: RT:500:2
      Originator: 2.2.2.2, Cluster list: 5.5.5.5
      mpls labels in/out nolabel/24
PE1#

This quirk shows just one way to successfully configure MPLS management and protect against misconfiguration. Give me a shout if anything was unclear or if you have any thoughts. As mentioned earlier, the GNS3 lab is available for download so have a tinker and see what you think.

Site by Site MPLS Breakout Migration

This months quirk is a bit late. I have been studying furiously and managed to pass my Deploying Cisco Service Provider Advanced Network Routing exam last week. Only two to go before I get CCNP SP. 🙂

Another plus side is that I have a tonne of study notes that I will be uploading over the next few weeks. So anyone interested in Multicast, BGP or IPv6 watch this space.

Anyways, this quirk looks at a design solution whereby a 100+ site MPLS customer needed to change the Service Provider for their primary internet breakout one site at a time…

 

The quirk     

The customer had an L3VPN MPLS cloud with a new ISP, but still had their primary internet breakout with their old ISP.

The below diagram shows a stripped down version of such a network, illustrating the basic idea:

blog3_image1_base_setup

So whilst all of the MPLS sites connected to the new ISPs core, the link to the internet was still going out through a site that connected to the old provider.

The customer needed to move the default route and primary breakout over but did not want to do a single “big bang” migration and move all of the sites at once. Rather, they wanted to migrate one site at a time.

The search

The first step in looking at how to accomplish this was to break down the requirements. The following conditions needed to be met:

  • Each site must still be able to access all other sites and the file/application servers at the primary breakout site. These servers would be moved to the new ISP connection and breakout site 2 last of all.
  • As each site moves over to the new breakout, they only need PAT to gain access to the internet – no public services are run at the remote sites.
  • The PI space held by the customer, used for public facing services on the application servers, would be moved to the new provider once all site were migrated.
  • Sites must be able to be moved one at a time without affecting any other sites.
  • The majority of MPLS sites were single homed with a static default.

Looking at these requirements gave us a good idea of what we needed to achieve.

Policy based routing was considered first. Adjusting either the next hop or VRF using the source address. However this would require too much overhead in identifying the site that had been moved, either the by community value or source prefix, combined with setting the next hop or VRF to use.

Ultimately, the use of a second VRF with “all but default” route leaking was decided upon. This involved creating a second VRF with a default route pointing to the new ISP breakout. All routes except the defaults were to be leaked between these VRFs.

This meant that all we needed to migrate a site, was change the VRF to which the attachment circuit belonged.

It is worth highlight that had there been a significant number of multihomed sites implementing BGP, using policy based routing may have been preferred. This is because a large number of BGP neighborships would need to be reconfigured to the correct VRF.

The work

The below output has been taken from a simulation. The MPLS sites have been represented using loopbacks1-3 on PE_RTR.

First we will take a look at a traceroute to the internet (to IP 50.50.50.50) and the routing table for the original VRF before any changes were made: 

PE_RTR#sh ip route vrf CUST-A-OLD-ISP

Routing Table: CUST-A-OLD-ISP
Codes: C - connected, S - static, R - RIP, M - mobile, B - BGP
 D - EIGRP, EX - EIGRP external, O - OSPF, IA - OSPF inter area
 N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
 E1 - OSPF external type 1, E2 - OSPF external type 2
 i - IS-IS, su - IS-IS summary, L1 - IS-IS level-1, L2 - IS-IS level-2
 ia - IS-IS inter area, * - candidate default, U - per-user static route
 o - ODR, P - periodic downloaded static route

Gateway of last resort is 1.1.1.1 to network 0.0.0.0

 10.0.0.0/30 is subnetted, 3 subnets
B 10.10.11.0/30 [200/0] via 2.2.2.2, 00:15:34
B 10.10.10.0/30 [200/0] via 1.1.1.1, 00:15:34
B 10.20.20.0/30 [200/0] via 2.2.2.2, 00:15:34
C 192.168.1.0/24 is directly connected, Loopback1
C 192.168.2.0/24 is directly connected, Loopback2
C 192.168.3.0/24 is directly connected, Loopback3
B 192.168.50.0/24 [200/0] via 1.1.1.1, 00:15:34
B 192.168.51.0/24 [200/0] via 2.2.2.2, 00:15:34
B 192.168.101.0/24 [200/0] via 1.1.1.1, 00:16:19
B* 0.0.0.0/0 [200/5] via 1.1.1.1, 00:15:43
PE_RTR#

PE_RTR#trace vrf CUST-A-OLD-ISP 50.50.50.50 source lo1

Type escape sequence to abort.
Tracing the route to 50.50.50.50

 1 10.1.3.2 [MPLS: Labels 17/20 Exp 0] 116 msec 72 msec 48 msec
 2 10.10.10.1 [MPLS: Label 20 Exp 0] 24 msec 44 msec 24 msec
 3 10.10.10.2 20 msec 20 msec 36 msec
 4 192.168.50.1 28 msec 56 msec 24 msec
 5 100.100.100.1 116 msec 52 msec 72 msec
 6 100.111.111.1 64 msec 140 msec 60 msec
PE_RTR#

So the WAN range of the breakout in this simulation is 100.100.100.0/29. This is their PI space. Notice the range 192.168.101.0/24, which is the subnet that the file/application servers are on.

The VRF configuration on the PEs is straightforward.

ip vrf CUST-A-OLD-ISP
 description VRF for Old ISP Breakout
 rd 100:1
 route-target export 100:1
 route-target import 100:1

Before we created the new VRF, we needed a way to differentiate what can and cannot be leaked. For this we used filtering when exporting RTs. We designated the RT 100:100 for routes that should be leaked.

First we started by making a prefix list that catches the default route:

ip prefix-list defaultRoute seq 5 permit 0.0.0.0/0
ip prefix-list defaultRoute seq 50 deny 0.0.0.0/0 le 32

Then we specified a route-map that attached the RT 100:100 to prefixes that are not the default route

route-map ALL-EXCEPT-DEFAULT permit 10
 match ip address prefix-list defaultRoute
!
route-map ALL-EXCEPT-DEFAULT permit 20
 set extcommunity rt 100:100 additive

Note the use of the additive keyword so as not to overwrite any existing communities.

Once we had these setup, we created the new VRF and applied this route-map in the form of an export-map to set the correct RTs. We made sure to import 100:100 and then applied the same to original VRF.

ip vrf CUST-A-NEW-ISP
 description VRF for New ISP Breakout
 rd 100:2
 export map ALL-EXCEPT-DEFAULT
 route-target export 100:2
 route-target import 100:100
 route-target import 100:2
!
ip vrf CUST-A-OLD-ISP
 description VRF for Old ISP Breakout
 rd 100:1
 export map ALL-EXCEPT-DEFAULT
 route-target export 100:1
 route-target import 100:100
 route-target import 100:1

From here, after deploying this to all the relevant PEs and injecting a new default route, the migration from one VRF to another was fairly straight forward. Below shows an example using a simulated loopback (the principle would be the same for the incoming attachment circuit to a customer site):

PE_RTR(config)#interface Loopback1
PE_RTR(config-if)# ip vrf forwarding CUST-A-NEW-ISP
% Interface Loopback1 IP address 192.168.1.1 removed due to enabling 
VRF CUST-A-NEW-ISP
PE_RTR(config-if)# ip address 192.168.1.1 255.255.255.0

If we look at the routing table for this new vrf we see the following:

PE_RTR#sh ip route vrf CUST-A-NEW-ISP

Routing Table: CUST-A-NEW-ISP
Codes: C - connected, S - static, R - RIP, M - mobile, B - BGP
 D - EIGRP, EX - EIGRP external, O - OSPF, IA - OSPF inter area
 N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
 E1 - OSPF external type 1, E2 - OSPF external type 2
 i - IS-IS, su - IS-IS summary, L1 - IS-IS level-1, L2 - IS-IS level-2
 ia - IS-IS inter area, * - candidate default, U - per-user static route
 o - ODR, P - periodic downloaded static route

Gateway of last resort is 2.2.2.2 to network 0.0.0.0

 10.0.0.0/30 is subnetted, 3 subnets
B 10.10.11.0/30 [200/0] via 2.2.2.2, 00:16:16
B 10.10.10.0/30 [200/0] via 1.1.1.1, 00:16:16
B 10.20.20.0/30 [200/0] via 2.2.2.2, 00:16:16
C 192.168.1.0/24 is directly connected, Loopback1
B 192.168.2.0/24 is directly connected, 00:16:17, Loopback2
B 192.168.3.0/24 is directly connected, 00:16:23, Loopback3
B 192.168.50.0/24 [200/0] via 1.1.1.1, 00:16:16
B 192.168.51.0/24 [200/0] via 2.2.2.2, 00:16:18
B 192.168.101.0/24 [200/0] via 1.1.1.1, 00:16:20
B* 0.0.0.0/0 [200/0] via 2.2.2.2, 00:16:18
PE_RTR#

An interesting side note here is that even though Loopback2 and 3 are directly connected, they are shown as having been learned through BGP. This is the result of the import from the original VRF. Indeed upon closer inspection of one of the prefixes we see the 100:100 community:

PE_RTR#sh bgp vpnv4 unicast vrf CUST-A-NEW-ISP 192.168.3.0/24
BGP routing table entry for 100:2:192.168.3.0/24, version 47
Paths: (1 available, best #1, table CUST-A-NEW-ISP)
 Not advertised to any peer
 Local, imported path from 100:1:192.168.3.0/24
 0.0.0.0 from 0.0.0.0 (3.3.3.3)
 Origin incomplete, metric 0, localpref 100, weight 32768, valid, 
external, best
 Extended Community: RT:100:1 RT:100:100
 mpls labels in/out nolabel/aggregate(CUST-A-OLD-ISP)

And looking at the default route we see no such community and a different next hop from the original table.

PE_RTR#sh bgp vpnv4 unicast vrf CUST-A-NEW-ISP 0.0.0.0
BGP routing table entry for 100:2:0.0.0.0/0, version 40
Paths: (1 available, best #1, table CUST-A-NEW-ISP)
 Not advertised to any peer
 65489
 2.2.2.2 (metric 3) from 2.2.2.2 (2.2.2.2)
 Origin incomplete, metric 5, localpref 200, valid, internal, best
 Extended Community: RT:100:2
 mpls labels in/out nolabel/23

The old VRFs table still shows a route for the newly migrated site (although now learned via BGP) and the default route is still as it was originally:

PE_RTR#sh ip route vrf CUST-A-OLD-ISP

Routing Table: CUST-A-OLD-ISP
Codes: C - connected, S - static, R - RIP, M - mobile, B - BGP
 D - EIGRP, EX - EIGRP external, O - OSPF, IA - OSPF inter area
 N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
 E1 - OSPF external type 1, E2 - OSPF external type 2
 i - IS-IS, su - IS-IS summary, L1 - IS-IS level-1, L2 - IS-IS level-2
 ia - IS-IS inter area, * - candidate default, U - per-user static route
 o - ODR, P - periodic downloaded static route

Gateway of last resort is 1.1.1.1 to network 0.0.0.0

 10.0.0.0/30 is subnetted, 3 subnets
B 10.10.11.0/30 [200/0] via 2.2.2.2, 00:15:34
B 10.10.10.0/30 [200/0] via 1.1.1.1, 00:15:34
B 10.20.20.0/30 [200/0] via 2.2.2.2, 00:15:34
B 192.168.1.0/24 is directly connected, 00:15:36, Loopback1
C 192.168.2.0/24 is directly connected, Loopback2
C 192.168.3.0/24 is directly connected, Loopback3
B 192.168.50.0/24 [200/0] via 1.1.1.1, 00:15:34
B 192.168.51.0/24 [200/0] via 2.2.2.2, 00:15:34
B 192.168.101.0/24 [200/0] via 1.1.1.1, 00:16:19
B* 0.0.0.0/0 [200/5] via 1.1.1.1, 00:15:43
PE_RTR#
PE_RTR#sh bgp vpnv4 unicast vrf CUST-A-OLD-ISP 0.0.0.0
BGP routing table entry for 100:1:0.0.0.0/0, version 15
Paths: (1 available, best #1, table CUST-A-OLD-ISP)
 Not advertised to any peer
 65489
 1.1.1.1 (metric 3) from 1.1.1.1 (1.1.1.1)
 Origin incomplete, metric 0, localpref 100, valid, internal, best
 Extended Community: RT:100:1
 mpls labels in/out nolabel/26

Finally, a traceroute test shows that the newly migrated site accesses the internet via a different site and can still access the application server subnet

PE_RTR#trace vrf CUST-A-NEW-ISP 50.50.50.50 source lo1

Type escape sequence to abort.
Tracing the route to 50.50.50.50

 1 10.1.3.2 [MPLS: Labels 16/20 Exp 0] 44 msec 40 msec 52 msec
 2 10.20.20.1 [MPLS: Label 20 Exp 0] 32 msec 36 msec 52 msec
 3 10.20.20.2 52 msec 40 msec 32 msec
 4 192.168.51.1 54 msec 39 msec 31 msec
 5 200.200.200.2 68 msec 60 msec 32 msec
 6 200.222.222.2 65 msec 143 msec 62 msec

PE_RTR#
PE_RTR#trace vrf CUST-A-NEW-ISP 192.168.101.1 source lo1

Type escape sequence to abort.
Tracing the route to 192.168.101.1

 1 10.1.3.2 [MPLS: Labels 16/22 Exp 0] 56 msec 52 msec 44 msec
 2 10.10.10.1 [MPLS: Label 22 Exp 0] 36 msec 24 msec 24 msec
 3 10.10.10.2 40 msec 40 msec 36 msec
 4 192.168.50.1 26 msec 57 msec 23 msec
 5 192.168.101.1 32 msec 48 msec 36 msec
PE_RTR#

One final point to make is that advertising the PI space to both providers for backup purposes was a possibility. as-path prepend could have been used from breakout site 2 to make it less preferred. But complications come into play depending on how each provider advertises the PI space and whether they honour any adjustments that the customer makes. Should return traffic not follow the same path, stateful firewall sessions would also encounter also difficulty.

So a pretty straight forward solution in the end but interesting from the perspective of a migration standpoint. I am interest to hear thoughts on whether anyone would have taken a different approach. Perhaps we should have done policy based routing or maybe another solution? As usual thoughts are always welcome.