From MPLS L3VPN to PBB-EVPN

This blog introduces PBB-EVPN over an MPLS network. But rather than just describe the technology from scratch, I have tried to structure the explanation assuming the reader is familiar with plain old MPLS L3VPN and is new to PBB and/or EVPN. This was certainly the case with me when I first studied this topic and I’m hoping others in a similar position will find this approach insightful.

I won’t be exploring a specific quirk or scenario – rather I will look at EVPN followed by PBB, giving analogies and comparisons to MPLS L3VPN as I go, before combining them into PBB-EVPN. I will focus on how traffic is identified, learned and forwarded in each section.

So what is PBB-EVPN? Well, besides being hard to say 3 times fast, it is essentially an L2VPN technology. It enables a Layer 2 bridge domain to be stretched across a Service Provider core while utilizing MAC aggregation to deal with scaling issues.

Let’s look at EVPN first.

EVPN

EVPN, or Ethernet VPN, over an MPLS network works on a similar principle to MPLS L3VPN. The best way to conceptualize the difference is to draw an analogy (colour coded to highlight points of comparison)…

MPLS L3VPN assigns PE interfaces to VRFs. It then uses MP-BGP (with the vpnv4 unicast address family) to advertise customer IP Subnets as VPNv4 routes to Route Reflectors or other PEs. Remote PEs that have a VRF configured to import the correct route targets, accept the MP-BGP update and install an ipv4 route into the routing table for that VRF.

EVPN uses PE interfaces linked to bridge-domains with an EVI. It then uses MP-BGP (with the l2vpn evpn address family) to advertise customer MAC addresses as EVPN routes to Route Reflectors or other PEs. Remote PEs that have an EVI configured to import the correct route target, accept the MP-BGP update and install a MAC address into the bridge domain for that EVI.

This analogy is a little crude, but in both cases packets or frames destined for a given subnet or MAC will be imposed with two labels – an inner VPN label and an outer Transport label. The Transport label is typical communicated via something like LDP and will correspond to the next hop loopback of the egress PE. The VPN label is communicated in the MP-BGP updates.

These diagrams illustrate the comparison:

Blog6_image1a_and_b

In EVPN, customer devices tend to be switches rather than routers. PE-CE routing protocols, like eBGP, aren’t used since it operates over layer 2. The Service Provider appears as one big switch. In this sense, it accomplishes the same as VPLS but (among other differences) uses BGP to distribute MAC address information, rather than using a full mesh of pseudowires.

EVPN uses an EVI, or Ethernet Virtual Identifier, to identify a specific instance of EVPN as it maps to a bridge domain. For the purposes of this overview, you can think of an EVI as being quasi-equivalent to a VRF. A customer facing interface will be put into a bridge domain (layer 2 broadcast domain), which will have an EVI identifier associated with it.

The MAC address learning that EVPN utilizes what is called control-plane learning, since it is BGP (a control-plane routing protocol) that distributes the MAC address information. This is in contrast to data-plane learning, which is how a standard switch learns MAC addresses – by associating the source MAC address of a frame to the receiving interface.

The following Cisco IOS-XR config shows an EVPN bridge domain and edge interface setup, side by side with a MPLS L3VPN setup for comparison:

Blog6_output1a_and_b

NB. For MPLS L3VPN config  the RD config (which is usually configured under CE-PE eBGP config) is not shown. PBB config is shown in the EVPN Bridge domain, this will be explained further into the blog.

EVPN seems simple enough at first glance, but it has a scaling problem, which PBB can ultimately help with…

Any given customer site can have hundreds or even thousands of MAC addresses, as opposed to just one subnet (as in an MPLS L3VPN environment). The number of updates and withdrawals that BGP would have to send could be overwhelming if it needed to make adjustments for MAC addresses appearing and disappearing – not to mention the memory requirements. And you can’t summarise MAC addresses like you can IP ranges. It would be like an MPLS L3VPN environment advertising /32 prefixes for every host rather than just one prefix for the subnet. We need a way to summarise or aggregate the MAC addresses.

Here’s where PBB comes in…

PBB – Provider Backbone Bridging (802.1ah)

PBB can help solve the EVPN scaling issue by performing one key function – it maps each customer MAC address to the MAC address of the attaching PE. Customer MAC addresses are called C-MACs. The PE MAC addresses are call B-MACs (or Bridge MACs).

This works by adding an extra layer 2 header to frame as it is forwarded from one site to another across the provider core. The outer layer 2 header has a destination B-MAC address of the PE device that the inner frames destination C-MAC is associated with.  As a result, PBB is often called MAC-in-MAC. This diagram illustrates the concept:

Blog6_image2_pbb

NB. In PBB terminology the provider devices are called Bridges. So a BEB (Backbone Edge Bridge) is a PE and a BCB (Backbone Core Bridge) is a P. For sake of simplicity, I will continue to use PE/P terminology. Also worth noting is that PBB diagrams often show service provider devices as switches, to illustrate the layer 2 nature of the technology – which I’ve done above.

In the above diagram the SID (or Service ID) represents a layer 2 broadcast domain similar to what an EVI represents in EVPN.

Frames arriving on a PE interface will be inspected and, based on certain characteristics, it will be mapped or assigned to a particular Service ID (SID).

The characteristics that determine what SID a frame belongs to can be a number of things:

  • The customer assigned VLAN
  • The Service Provider assigned VLAN
  • Existing SID identifiers
  • The interface it arrives on
  • A combination of the above or other factors

To draw an analogy to MPLS L3VPN – the VRF that an incoming packet is assigned to is determined by whatever VRF is configured on the receiving interface (using ip vrf forwarding CUST_1 in Cisco IOS interface CLI).

Once the SID has been allocated, the entire frame is then encapsulated in the outer layer 2 header with destination MAC of the egress PE.

In this way C-MACs are mapped to either B-MACs or local attachment circuits. Most importantly however the core P routers do not need to learn all of the MAC addresses of the customers. They only deal with the MAC addresses of the PEs. This allows a PE to aggregate all of the attached C-MACs for a given customer behind its own B-MAC.

But how does a remote PE learn which C-MAC maps to which B-MAC?

In PBB learning is done in the data-plane, much like a regular layer 2 switch. When a PE receives a frame from the PBB core, it will strip off the outer layer 2 header and make a note of the source B-MAC (the ingress PE). It will map this source B-MAC to the source C-MAC found on the inner layer 2 header. When a frame arrives on a local attachment circuit, the PE will map the source C-MAC to the attachment circuit in the usual way.

PBB must deal with BUM traffic too. BUM traffic is Broadcast, Unknown Unicast or Multicast traffic. An example of BUM traffic is the arrival or frame for which the destination MAC address is unknown. Rather than broadcast like a regular layer 2 switch would, a PPB PE will set the destination MAC address of the outer layer 2 header to a special multicast MAC address that is built based on the SID and includes all the egress PEs that are part of the same bridge domain. EVPN uses a different method or handling BUM traffic but I will go into that later in the blog.

Overall, PBB is more complicated than the explanation given here, but this is the general principle (if you’re interested, see section 3 of my VPLS, PBB, EVPN and VxLAN Diagrams document that details how PBB can be combined the 802.1ad to add an aggregation layer to a provider network).

Now that we have the MAC-in-MAC features of PBB at our disposal, we can use it to solve the EVPN scaling problem and combine the two…

PBB-EVPN

With the help of PBB, EVPN can be adapted so that it deals with only the B-MACs.

To accomplish this, each EVPN EVI is linked to two bridge domains. One bridge domain is dedicated to customer MAC addresses and connected to the local attachment circuits. The other is dedicated to the PE routers B-MAC addresses. Both of these bridge domains are combined under the same bridge group.

Blog6_image3_bridge_domains

The PE devices will uses data-plane learning to build a MAC database, mapping each C-MAC to either an attachment circuit or the B-MAC of an egress PE. Source C-MAC addresses are learned and associated as traffic flows through the network just like PBB does.

The overall setup would look like this:

Blog6_image4_pbb_evpn_overview

The only thing EVPN needs to concern itself with is advertising the B-MACs of the PE devices. EVPN uses control-plane learning and includes the B-MACs in the MP-BGP l2vpn evpn updates. For example, if you were to look at MAC address known to a particular EVI on a route-reflector, you would only see MAC address for PE routers.

Looking again at the configuration output that we saw above, we can get a better idea of how PBB-EVPN works:

Blog6_output2_pbb_evpn_detail

NB. I have added the concept of a BVI, or Bridged Virtual Interface, to the above output. This can be used to provide a layer 3 breakout or gateway similar to how an SVI works on a L3 switch.

You can view the MAC addresses information using the following command:

Blog6_output3_macs

Now lets look at how PBB-EVPN handles BUM traffic. Unlike PBB on its own, which just sends to a multicast MAC address, PBB-EVPN will use unicast replication and send copies of the frame to all of the remote PEs that are in the same EVI. This is an EVPN method and the PE knows which remote PEs belong to the same EVI by looking in what is called a flood list.

But how does it build this flood list? To learn that, we need to look at EVPN route-types…

MPLS L3VPN sends VPNv4 routes in its updates. But EVPN send more than one “type” of update. The type of update, or route-type as it is called, will denote what kind of information is carried in the update. The route-type is part of the EVPN NLRI.

For the purposes of this blog we will only look at two route-types.

  • Route-Type 2s, which carry MAC addresses (analogous to VPNv4 updates)
  • Route-Type 3s, which carry information on the egress PEs that belong to an EVI.

It is these Route-Type 3s (or RT-3s for short) that are used to build the flood list.

When BUM traffic is received by a PE, it will send copies of the frame to all of its attachment circuits (except the one it received the frame on) and all of the PEs for which it has received a Route-Type 3 update. In other words, it will send to everything in its flood-list.

So the overall process for a BUM packet being forwarded across a PBB-EVPN backbone will look as follows:

Blog6_image5_bum_traffic

So that’s it, in a nutshell. In this way PBB and EVPN can work together to create an L2VPN network across a Service Provider.

There are other aspects of both PBB and EVPN, such as EVPN multi-homing using Ethernet Segment Identifiers or PBB MAC clearing with MIRP to name just a couple, but the purpose of this blog was to provide an introductory overview – specifically for those used to dealing with MPLS L3VPN. Thoughts are welcome, and as always, thank you for reading.

MPLS Management misconfiguration

There are many different ways for ISPs to manage MPLS devices like routers and firewalls that are deployed to customer sites. This quirk explores one such solution and looks at a scenario where a misconfiguration results in VRF route leaking between customers.

The quirk

When an ISP deploys Customer Edge (CE) devices to customers sites they might, and often do, want to maintain management. For customers with a simple public internet connection this is usually straight forward – the device is reachable over the internet and  an ACL or similar policy will be configured, allowing access from only a list of approved ISP IP addresses (for extra security VPNs could be used).

However when Peer-to-Peer L3VPN MPLS is used, it is more complicated. The customer network is not directly accessible from the internet without going through some kind of a breakout site. The ISP will either need a link into their customers MPLS network or must configure access through the breakout. This can become complicated as the number of customers, and the number of sites per customer, increases.

One option, presented in this quirk, is to have all MPLS customers PE-CE WAN subnets come from a common supernet range. These WAN subnets can then be exported into a common management VRF using a specific RT. The network that will be used to demonstrate this looks as follows:

blog4_image1_base_setup

This is available for download as a GNS3 lab from here. It includes the solution to the quirk as detailed below.

The ISPs ASN is 500. The two customer have ASNs 100 and 200 (depending on the setup these would typically be private ASNs, but they have been shown here as 100 and 200 for simplicity). A management router (MGMT) in ASN 64512 has access to the PE-CE WAN ranges for all of the customers, all of which come from the supernet 172.30.0.0/16. A special subnet within this range, 172.30.254.0/24, is reserved for the Management network itself. The MGMT router, or MPLS jump box as it may also be called, is connected to this range – as would any other devices requiring access to the MPLS customers devices (backup or monitoring systems for instance… not shown).

The basic idea is that each customer VRF exports their PE-CE WAN ranges with an RT of 500:501. The MGMT VRF then imports this RT.

Along side this, the MGMT VRF will exports its own routes (from the 172.30.254.0/24 supernet) with an RT of 500:500. All of the customer VRFs import 500:500.

This has two key features:

  • Customer WAN ranges will all be from the 172.30.0.0/16 and must not overlap between customers.
  • WAN ranges and site subnets are not, at any point, leaked between customer VRFs.

To get a better idea of how it works, take a look at the following diagram:

blog4_image2_mpls_mgmt_concept

The CLI for each customer VRF setup looks as follows:

ip vrf CUST_1
 description Customer_1_VRF
 rd 500:1
 vpn id 500:1
 export map VRF_EXPORT_MAP
 route-target export 500:1
 route-target import 500:1
 route-target import 500:500
!
route-map VRF_EXPORT_MAP permit 10
 match ip address prefix-list VRF_WANS_EXCEPT_MGMT
 set extcommunity rt 500:501 additive
route-map VRF_EXPORT_MAP permit 20
!
ip prefix-list VRF_WANS_EXCEPT_MGMT seq 10 deny 172.30.254.0/24 le 32
ip prefix-list VRF_WANS_EXCEPT_MGMT seq 20 permit 172.30.0.0/16 le 32

Note that the export map used on customer VRFs makes a point to exclude the routes that the Management supernet (172.30.254.0/24). This is done on the off chance that the range exists within the customers VRF table.

The VRF for the Management network is configured as follows (note this is only configured on CE3 in the above lab):

ip vrf MGMT_VRF
 description VRF for Management of Customer CEs
 rd 500:500
 vpn id 500:500
 route-target export 500:500
 route-target import 500:500
 route-target import 500:501

This results in the WAN ranges for customers being tagged with the 500:501 RT but not the LAN ranges.

PE1#sh bgp vpnv4 unicast vrf CUST_1 172.30.1.0/30
BGP routing table entry for 500:1:172.30.1.0/30, version 9
Paths: (1 available, best #1, table CUST_1)
  Advertised to update-groups:
    1         3

  Local
    0.0.0.0 from 0.0.0.0 (1.1.1.1)
      Origin incomplete, metric 0, localpref 100, weight 32768, valid, 
       sourced, best
      Extended Community: RT:500:1 RT:500:501
      mpls labels in/out 23/aggregate(CUST_1)

PE1#sh bgp vpnv4 unicast vrf CUST_1 192.168.50.0/24
BGP routing table entry for 500:1:192.168.50.0/24, version 3
Paths: (1 available, best #1, table CUST_1)
  Advertised to update-groups:
    3

  100
    172.30.1.2 from 172.30.1.2 (192.168.50.1)
      Origin incomplete, metric 0, localpref 100, valid, external, best
      Extended Community: RT:500:1
      mpls labels in/out 24/nolabel
PE1#

192.168.50.0/24, above, is a one of the LAN ranges and does not have the 500:501 RT.

Every VRF can see the management network and the management network can see all the PE-CE WAN ranges for every customer:

PE1#sh ip route vrf CUST_2

Routing Table: CUST_2
Codes: C - connected, S - static, R - RIP, M - mobile, B - BGP
       D - EIGRP, EX - EIGRP external, O - OSPF, IA - OSPF inter area
       N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
       E1 - OSPF external type 1, E2 - OSPF external type 2
       i - IS-IS, su - IS-IS summary, L1 - IS-IS level-1
       L2 - IS-IS level-2, ia - IS-IS inter area, * - candidate default
       U - per-user static route, o - ODR
       P - periodic downloaded static route

Gateway of last resort is not set

B       192.168.60.0/24 [20/0] via 172.30.1.10, 01:32:17
        172.30.0.0/30 is subnetted, 3 subnets
B         172.30.254.0 [200/0] via 3.3.3.3, 01:32:09
B         172.30.1.4 [200/0] via 2.2.2.2, 01:32:09
C         172.30.1.8 is directly connected, FastEthernet1/0
B       192.168.50.0/24 [200/0] via 2.2.2.2, 01:32:09

PE1#
PE3#sh ip route vrf MGMT_VRF

Routing Table: MGMT_VRF
Codes: C - connected, S - static, R - RIP, M - mobile, B - BGP
       D - EIGRP, EX - EIGRP external, O - OSPF, IA - OSPF inter area
       N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
       E1 - OSPF external type 1, E2 - OSPF external type 2
       i - IS-IS, su - IS-IS summary, L1 - IS-IS level-1
       L2 - IS-IS level-2, ia - IS-IS inter area, * - candidate default
       U - per-user static route, o - ODR
       P - periodic downloaded static route

Gateway of last resort is not set

        172.30.0.0/30 is subnetted, 4 subnets
C         172.30.254.0 is directly connected, FastEthernet0/0
B         172.30.1.0 [200/0] via 1.1.1.1, 01:32:24
B         172.30.1.4 [200/0] via 2.2.2.2, 01:32:24
B         172.30.1.8 [200/0] via 1.1.1.1, 01:32:24

PE3#

Also, note that the routing table for Customer 2 (vrf CUST_2) cannot see the 172.30.1.0/30 WAN range for Customer 1 (vrf CUST_1).

Given the proper config, the MGMT router can access the WAN ranges for customers:

MGMT#telnet 172.30.1.2
Trying 172.30.1.2 ... Open

User Access Verification
Password:
CE1-1>

NB. I’m not advocating using telnet in such an environment. Use SSH as a minimum when you can.

The quirk comes in when a simple misconfiguration introduces route leaking between customer VRFs.

Consider an engineer accidentally configuring a VRF that exports all its vpnv4 prefixes with RT 500:500 (rather than only exporting its PE-CE WAN routes with RT500:501 as described above). The mistake is easy enough to make and will cause routes from the newly configured VRF to be imported by all other customer VRFs. This will have a severe impact for any customers with the same route within their VRF.

To demonstrate this, imagine that the CUST_1 VRF is not yet configured. Pinging from site Customer 2 Site 2 (CE2-2 on the lower left side of the diagram) with a source of 192.168.60.1 to Customer 2 Site 1 (CE1-2) with a destination of 192.168.50.1 works fine

CE2-2#trace 192.168.50.1 source lo1

Type escape sequence to abort.
Tracing the route to 192.168.50.1
 1 172.30.1.9 12 msec 24 msec 24 msec
 2 10.10.14.4 [AS 500] [MPLS: Labels 16/24 Exp 0] 92 msec 64 msec 44 msec
 3 172.30.1.5 [AS 500] [MPLS: Label 24 Exp 0] 48 msec 68 msec 52 msec
 4 172.30.1.6 [AS 500] 116 msec 88 msec 104 msec

CE2-2#

If the CUST_1 VRF is now setup with the aforementioned misconfiguration, route leaking between CUST_1 and CUST_2 will result:

PE1(config)#ip vrf CUST_1
PE1(config-vrf)# description Customer_1_VRF
PE1(config-vrf)# rd 500:1
PE1(config-vrf)# vpn id 500:1
PE1(config-vrf)# route-target export 500:1
PE1(config-vrf)# route-target import 500:1
PE1(config-vrf)# route-target export 500:500
PE1(config-vrf)#
PE1(config-vrf)# interface FastEthernet0/1
PE1(config-if)# description Link to CE 1 for Customer 1
PE1(config-if)# ip vrf forwarding CUST_1
PE1(config-if)# ip address 172.30.1.1 255.255.255.252
PE1(config-if)# duplex auto
PE1(config-if)# speed auto
PE1(config-if)# no shut
PE1(config-if)#exit
PE1(config)#router bgp 500
PE1(config-router)# address-family ipv4 vrf CUST_1
PE1(config-router-af)# redistribute connected
PE1(config-router-af)# redistribute static
PE1(config-router-af)# neighbor 172.30.1.2 remote-as 100
PE1(config-router-af)# neighbor 172.30.1.2 description Customer 1 Site 1
PE1(config-router-af)# neighbor 172.30.1.2 activate
PE1(config-router-af)# neighbor 172.30.1.2 default-originate
PE1(config-router-af)# neighbor 172.30.1.2 as-override
PE1(config-router-af)# neighbor 172.30.1.2 route-map CUST_1_SITE_1_IN in
PE1(config-router-af)# no synchronization
PE1(config-router-af)# exit-address-family
PE1(config-router)#

VRF CUST_1 will export its routes (including 192.168.50.0/24 from Customer 1 Site 1 – CE1-1) and the VRF CUST_2 will import these routes due to the RT of 500:500.

Looking at the BGP and routing table for the CUST_2 VRF shows that the next hop for 192.68.50.0/24 is now the CE1-1 router.

PE1#sh ip route vrf CUST_2 192.168.50.0
Routing entry for 192.168.50.0/24
  Known via "bgp 500", distance 20, metric 0
  Tag 100, type external
  Last update from 172.30.1.2 00:02:45 ago
  Routing Descriptor Blocks:
  * 172.30.1.2 (CUST_1), from 172.30.1.2, 00:02:45 ago
      Route metric is 0, traffic share count is 1
      AS Hops 1
      Route tag 100

PE1#sh bgp vpnv4 unicast vrf CUST_2 192.168.50.0
BGP routing table entry for 500:2:192.168.50.0/24, version 21
Paths: (2 available, best #1, table CUST_2)
  Advertised to update-groups:
    2

  100, imported path from 500:1:192.168.50.0/24
    172.30.1.2 from 172.30.1.2 (192.168.50.1)
      Origin incomplete, metric 0, localpref 100, valid, external, best
      Extended Community: RT:500:1 RT:500:500

  200
    2.2.2.2 (metric 20) from 5.5.5.5 (5.5.5.5)
      Origin incomplete, metric 0, localpref 100, valid, internal
      Extended Community: RT:500:2
      Originator: 2.2.2.2, Cluster list: 5.5.5.5
      mpls labels in/out nolabel/24

PE1#

There are now two possible paths to reach 192.168.50.0/24. One imported from the VRF for CUST_1 and one from its own (coming from CE1-2). The path via AS 100 is being preferred due to the lower IGP metric. Note the 500:500 RT in this path.

Once this is done CE2-2 cannot reach its 192.168.50/24 subnet on CE1-2.

CE2-2#trace 192.168.50.1 source lo1
Type escape sequence to abort.

Tracing the route to 192.168.50.1
1 172.30.1.9 8 msec 12 msec 12 msec
2 * * *
3 * * *
4 * * *
...output omitted for brevity

Granted, this issue is caused by a mistake, but the difference between the correct and incorrect commands is minimal. An engineer under pressure or working quickly could potentially disrupt a massive MPLS infrastructure resulting in outages for multiple customers.

The search

As mentioned at the beginning of this blog, there are multiple ways to manage an MPLS network.

One possibility is to have a single router that, rather than import and export WAN routes based on RTs, has a single loopback address in each VRF. It is from this loopback that the router will source SSH or telnet sessions to the customer CE devices. For example:

interface loopback 1
 description Loopback source for Customer 1
 ip vrf forwarding CUST_1
 ip address 100.100.100.100 255.255.255.255
!
interface loopback 2
 description Loopback source for Customer 2
 ip vrf forwarding CUST_2
 ip address 100.100.100.100 255.255.255.255

MGMT# telnet 172.30.1.2 /vrf CUST_1

This has a number of advantages:

  • This router acts as a single jump host (rather than a subnet), which could be considered more secure
  • There is no restriction on the WAN addresses for each customer. They can be any WAN range at all and can overlap between customers.
  • The same IP address can be used for each VRFs loopback (as long as it doesn’t clash with any existing IPs already in the customers VRF).

However there are a number of disadvantages:

  • Each VRF must be configured on this jump router
  • This jump router is a single point of failure
  • The command to log on is more complex and requires the users to know the VRFs exact name rather than just the router IP.
  • Migrating to this solution, from the aforementioned RT import/export solution, would be a cumbersome and long process.
  • Centralised MPLS backups could be complicated if there is a not a common subnet (like 172.30.254.0/24) reachable by all CE devices.

For these reasons it was decided not to use this solution. Rather, it was decided to use import filtering, to prevent this issue from taking place even if the misconfiguration occurred. The import filtering uses a route-map that makes the followed sequential check:

    1. If a route has the RT 500:500 and is from the management range (172.30.254.0/24) allow it.
    2. If any other route has the RT 500:500, deny it.
    3. Allow the import of all other routes.

Essentially, rather than just importing 500:500, this route-map checks to make sure that a vpnv4 prefix comes from the management range of 172.30.254.0/24. The biggest issue in this scenario was the deployment of this route-map to all VRFs on all PEs. But with a little bit of scripting (I won’t go into the details here), this was far more plausible than the option of deploying a multi-VRF jump router.

The work

The route map described in the above section looks as follows:

ip extcommunity-list standard VRF_MGMT_COMMUNITY permit rt 500:500
ip prefix-list VRF_MGMT_LAN seq 5 permit 172.30.254.0/24 le 32
!
route-map VRF_IMPORT_MAP permit 10
 match ip address prefix-list VRF_MGMT_LAN
 match extcommunity VRF_MGMT_COMMUNITY
!
route-map VRF_IMPORT_MAP deny 20
 match extcommunity VRF_MGMT_COMMUNITY
!
route-map VRF_IMPORT_MAP permit 30

NB. This is a good example of and/or operation in a route map. If the types differ (in this case a prefix list and an extcommunity list) the operation is treated as a conjunction (AND) operation. If the types are the same it is a disjunction (OR) operation.

This will prevent the issue from occurring as it will stop the import of any vpnv4 prefix that has an RT of 500:500 unless it is from the management range.

Here is the configuration of this import map on PE1 (the other PEs are not shown but it should be configured on them too):

PE1#conf t
Enter configuration commands, one per line. End with CNTL/Z.
PE1(config)# ip extcommunity-list standard VRF_MGMT_COMMUNITY permit 
rt 500:500
PE1(config)#ip prefix-list VRF_MGMT_LAN seq 5 permit 172.30.254.0/24 
le 32
PE1(config)#!
PE1(config)#route-map VRF_IMPORT_MAP permit 10
PE1(config-route-map)# match ip address prefix-list VRF_MGMT_LAN
PE1(config-route-map)# match extcommunity VRF_MGMT_COMMUNITY
PE1(config-route-map)#!
PE1(config-route-map)#route-map VRF_IMPORT_MAP deny 20
PE1(config-route-map)# match extcommunity VRF_MGMT_COMMUNITY
PE1(config-route-map)#!
PE1(config-route-map)#route-map VRF_IMPORT_MAP permit 30
PE1(config-route-map)#
PE1(config-route-map)#ip vrf CUST_2
PE1(config-vrf)#import map VRF_IMPORT_MAP

After this addition, in the event that the misconfiguration takes place when creating the CUST_1 VRF, the import map will block the 192.168.50.0/24 subnet. The only path that the CUST_2 VRF has to 192.168.50.0/24 is from CE1-2, which is correct. Here is the configuration and resulting verification:

PE1(config)#ip vrf CUST_1
PE1(config-vrf)# description Customer_1_VRF
PE1(config-vrf)# rd 500:1
PE1(config-vrf)# vpn id 500:1
PE1(config-vrf)# route-target export 500:1
PE1(config-vrf)# route-target import 500:1
PE1(config-vrf)# route-target export 500:500
PE1#sh ip route vrf CUST_2 192.168.50.0
Routing entry for 192.168.50.0/24
  Known via "bgp 500", distance 200, metric 0
  Tag 200, type internal
  Last update from 2.2.2.2 00:22:12 ago
  Routing Descriptor Blocks:
  * 2.2.2.2 (Default-IP-Routing-Table), from 5.5.5.5, 00:22:12 ago
    Route metric is 0, traffic share count is 1
    AS Hops 1
    Route tag 200

PE1#sh bgp vpnv4 unicast vrf CUST_2 192.168.50.0
BGP routing table entry for 500:2:192.168.50.0/24, version 12
Paths: (1 available, best #1, table CUST_2)
Advertised to update-groups:
    2
  200
    2.2.2.2 (metric 20) from 5.5.5.5 (5.5.5.5)
      Origin incomplete, metric 0, localpref 100, valid, internal, best
      Extended Community: RT:500:2
      Originator: 2.2.2.2, Cluster list: 5.5.5.5
      mpls labels in/out nolabel/24
PE1#
CE2-2#trace 192.168.50.1 source lo1

Type escape sequence to abort.
Tracing the route to 192.168.50.1

 1 172.30.1.9 12 msec 24 msec 8 msec
 2 10.10.14.4 [AS 500] [MPLS: Labels 18/24 Exp 0] 60 msec 68 msec 64 msec
 3 172.30.1.5 [AS 500] [MPLS: Label 24 Exp 0] 52 msec 68 msec 44 msec
 4 172.30.1.6 [AS 500] 84 msec 56 msec 56 msec

CE2-2#

Management of the correct WAN device is still working as well…

MGMT#telnet 172.30.1.10
Trying 172.30.1.10 ... Open

User Access Verification

Password:
CE2-2>

Just for good measure, and to double check that our route-map is making a difference, let’s see what happens if we remove the import map from the CUST_2 VRF.

PE1#conf t
Enter configuration commands, one per line. End with CNTL/Z.
PE1(config)#ip vrf CUST_2
PE1(config-vrf)#no import map VRF_IMPORT_MAP
PE1(config-vrf)#^Z
PE1#
*Mar 1 00:27:45.259: %SYS-5-CONFIG_I: Configured from console by console
PE1#sh bgp vpnv4 unicast vrf CUST_2 192.168.50.0
BGP routing table entry for 500:2:192.168.50.0/24, version 22
Paths: (2 available, best #1, table CUST_2)
Flag: 0x820
  Advertised to update-groups:
    2
  100, imported path from 500:1:192.168.50.0/24
    172.30.1.2 from 172.30.1.2 (192.168.50.1)
      Origin incomplete, metric 0, localpref 100, valid, external, best
      Extended Community: RT:500:1 RT:500:500
  200
    2.2.2.2 (metric 20) from 5.5.5.5 (5.5.5.5)
      Origin incomplete, metric 0, localpref 100, valid, internal
      Extended Community: RT:500:2
      Originator: 2.2.2.2, Cluster list: 5.5.5.5
      mpls labels in/out nolabel/24
PE1#

The offending route is imported into the CUST_2 VRF pretty quickly, proving that our route-map works. If the route map is put back in place, and we wait for the BGP Scanner to run (after 30 seconds or less) the vpnv4 prefix is blocked again:

PE1#conf t
Enter configuration commands, one per line. End with CNTL/Z.
PE1(config)#ip vrf CUST_2
PE1(config-vrf)#import map VRF_IMPORT_MAP
PE1(config-vrf)#^Z
PE1#
*Mar 1 00:29:51.443: %SYS-5-CONFIG_I: Configured from console by console
PE1#sh bgp vpnv4 unicast vrf CUST_2 192.168.50.0
BGP routing table entry for 500:2:192.168.50.0/24, version 24
Paths: (1 available, best #1, table CUST_2)
Flag: 0x820
  Advertised to update-groups:
    2
  200
    2.2.2.2 (metric 20) from 5.5.5.5 (5.5.5.5)
      Origin incomplete, metric 0, localpref 100, valid, internal, best
      Extended Community: RT:500:2
      Originator: 2.2.2.2, Cluster list: 5.5.5.5
      mpls labels in/out nolabel/24
PE1#

This quirk shows just one way to successfully configure MPLS management and protect against misconfiguration. Give me a shout if anything was unclear or if you have any thoughts. As mentioned earlier, the GNS3 lab is available for download so have a tinker and see what you think.

Site by Site MPLS Breakout Migration

This months quirk is a bit late. I have been studying furiously and managed to pass my Deploying Cisco Service Provider Advanced Network Routing exam last week. Only two to go before I get CCNP SP. 🙂

Another plus side is that I have a tonne of study notes that I will be uploading over the next few weeks. So anyone interested in Multicast, BGP or IPv6 watch this space.

Anyways, this quirk looks at a design solution whereby a 100+ site MPLS customer needed to change the Service Provider for their primary internet breakout one site at a time…

 

The quirk     

The customer had an L3VPN MPLS cloud with a new ISP, but still had their primary internet breakout with their old ISP.

The below diagram shows a stripped down version of such a network, illustrating the basic idea:

blog3_image1_base_setup

So whilst all of the MPLS sites connected to the new ISPs core, the link to the internet was still going out through a site that connected to the old provider.

The customer needed to move the default route and primary breakout over but did not want to do a single “big bang” migration and move all of the sites at once. Rather, they wanted to migrate one site at a time.

The search

The first step in looking at how to accomplish this was to break down the requirements. The following conditions needed to be met:

  • Each site must still be able to access all other sites and the file/application servers at the primary breakout site. These servers would be moved to the new ISP connection and breakout site 2 last of all.
  • As each site moves over to the new breakout, they only need PAT to gain access to the internet – no public services are run at the remote sites.
  • The PI space held by the customer, used for public facing services on the application servers, would be moved to the new provider once all site were migrated.
  • Sites must be able to be moved one at a time without affecting any other sites.
  • The majority of MPLS sites were single homed with a static default.

Looking at these requirements gave us a good idea of what we needed to achieve.

Policy based routing was considered first. Adjusting either the next hop or VRF using the source address. However this would require too much overhead in identifying the site that had been moved, either the by community value or source prefix, combined with setting the next hop or VRF to use.

Ultimately, the use of a second VRF with “all but default” route leaking was decided upon. This involved creating a second VRF with a default route pointing to the new ISP breakout. All routes except the defaults were to be leaked between these VRFs.

This meant that all we needed to migrate a site, was change the VRF to which the attachment circuit belonged.

It is worth highlight that had there been a significant number of multihomed sites implementing BGP, using policy based routing may have been preferred. This is because a large number of BGP neighborships would need to be reconfigured to the correct VRF.

The work

The below output has been taken from a simulation. The MPLS sites have been represented using loopbacks1-3 on PE_RTR.

First we will take a look at a traceroute to the internet (to IP 50.50.50.50) and the routing table for the original VRF before any changes were made: 

PE_RTR#sh ip route vrf CUST-A-OLD-ISP

Routing Table: CUST-A-OLD-ISP
Codes: C - connected, S - static, R - RIP, M - mobile, B - BGP
 D - EIGRP, EX - EIGRP external, O - OSPF, IA - OSPF inter area
 N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
 E1 - OSPF external type 1, E2 - OSPF external type 2
 i - IS-IS, su - IS-IS summary, L1 - IS-IS level-1, L2 - IS-IS level-2
 ia - IS-IS inter area, * - candidate default, U - per-user static route
 o - ODR, P - periodic downloaded static route

Gateway of last resort is 1.1.1.1 to network 0.0.0.0

 10.0.0.0/30 is subnetted, 3 subnets
B 10.10.11.0/30 [200/0] via 2.2.2.2, 00:15:34
B 10.10.10.0/30 [200/0] via 1.1.1.1, 00:15:34
B 10.20.20.0/30 [200/0] via 2.2.2.2, 00:15:34
C 192.168.1.0/24 is directly connected, Loopback1
C 192.168.2.0/24 is directly connected, Loopback2
C 192.168.3.0/24 is directly connected, Loopback3
B 192.168.50.0/24 [200/0] via 1.1.1.1, 00:15:34
B 192.168.51.0/24 [200/0] via 2.2.2.2, 00:15:34
B 192.168.101.0/24 [200/0] via 1.1.1.1, 00:16:19
B* 0.0.0.0/0 [200/5] via 1.1.1.1, 00:15:43
PE_RTR#

PE_RTR#trace vrf CUST-A-OLD-ISP 50.50.50.50 source lo1

Type escape sequence to abort.
Tracing the route to 50.50.50.50

 1 10.1.3.2 [MPLS: Labels 17/20 Exp 0] 116 msec 72 msec 48 msec
 2 10.10.10.1 [MPLS: Label 20 Exp 0] 24 msec 44 msec 24 msec
 3 10.10.10.2 20 msec 20 msec 36 msec
 4 192.168.50.1 28 msec 56 msec 24 msec
 5 100.100.100.1 116 msec 52 msec 72 msec
 6 100.111.111.1 64 msec 140 msec 60 msec
PE_RTR#

So the WAN range of the breakout in this simulation is 100.100.100.0/29. This is their PI space. Notice the range 192.168.101.0/24, which is the subnet that the file/application servers are on.

The VRF configuration on the PEs is straightforward.

ip vrf CUST-A-OLD-ISP
 description VRF for Old ISP Breakout
 rd 100:1
 route-target export 100:1
 route-target import 100:1

Before we created the new VRF, we needed a way to differentiate what can and cannot be leaked. For this we used filtering when exporting RTs. We designated the RT 100:100 for routes that should be leaked.

First we started by making a prefix list that catches the default route:

ip prefix-list defaultRoute seq 5 permit 0.0.0.0/0
ip prefix-list defaultRoute seq 50 deny 0.0.0.0/0 le 32

Then we specified a route-map that attached the RT 100:100 to prefixes that are not the default route

route-map ALL-EXCEPT-DEFAULT permit 10
 match ip address prefix-list defaultRoute
!
route-map ALL-EXCEPT-DEFAULT permit 20
 set extcommunity rt 100:100 additive

Note the use of the additive keyword so as not to overwrite any existing communities.

Once we had these setup, we created the new VRF and applied this route-map in the form of an export-map to set the correct RTs. We made sure to import 100:100 and then applied the same to original VRF.

ip vrf CUST-A-NEW-ISP
 description VRF for New ISP Breakout
 rd 100:2
 export map ALL-EXCEPT-DEFAULT
 route-target export 100:2
 route-target import 100:100
 route-target import 100:2
!
ip vrf CUST-A-OLD-ISP
 description VRF for Old ISP Breakout
 rd 100:1
 export map ALL-EXCEPT-DEFAULT
 route-target export 100:1
 route-target import 100:100
 route-target import 100:1

From here, after deploying this to all the relevant PEs and injecting a new default route, the migration from one VRF to another was fairly straight forward. Below shows an example using a simulated loopback (the principle would be the same for the incoming attachment circuit to a customer site):

PE_RTR(config)#interface Loopback1
PE_RTR(config-if)# ip vrf forwarding CUST-A-NEW-ISP
% Interface Loopback1 IP address 192.168.1.1 removed due to enabling 
VRF CUST-A-NEW-ISP
PE_RTR(config-if)# ip address 192.168.1.1 255.255.255.0

If we look at the routing table for this new vrf we see the following:

PE_RTR#sh ip route vrf CUST-A-NEW-ISP

Routing Table: CUST-A-NEW-ISP
Codes: C - connected, S - static, R - RIP, M - mobile, B - BGP
 D - EIGRP, EX - EIGRP external, O - OSPF, IA - OSPF inter area
 N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
 E1 - OSPF external type 1, E2 - OSPF external type 2
 i - IS-IS, su - IS-IS summary, L1 - IS-IS level-1, L2 - IS-IS level-2
 ia - IS-IS inter area, * - candidate default, U - per-user static route
 o - ODR, P - periodic downloaded static route

Gateway of last resort is 2.2.2.2 to network 0.0.0.0

 10.0.0.0/30 is subnetted, 3 subnets
B 10.10.11.0/30 [200/0] via 2.2.2.2, 00:16:16
B 10.10.10.0/30 [200/0] via 1.1.1.1, 00:16:16
B 10.20.20.0/30 [200/0] via 2.2.2.2, 00:16:16
C 192.168.1.0/24 is directly connected, Loopback1
B 192.168.2.0/24 is directly connected, 00:16:17, Loopback2
B 192.168.3.0/24 is directly connected, 00:16:23, Loopback3
B 192.168.50.0/24 [200/0] via 1.1.1.1, 00:16:16
B 192.168.51.0/24 [200/0] via 2.2.2.2, 00:16:18
B 192.168.101.0/24 [200/0] via 1.1.1.1, 00:16:20
B* 0.0.0.0/0 [200/0] via 2.2.2.2, 00:16:18
PE_RTR#

An interesting side note here is that even though Loopback2 and 3 are directly connected, they are shown as having been learned through BGP. This is the result of the import from the original VRF. Indeed upon closer inspection of one of the prefixes we see the 100:100 community:

PE_RTR#sh bgp vpnv4 unicast vrf CUST-A-NEW-ISP 192.168.3.0/24
BGP routing table entry for 100:2:192.168.3.0/24, version 47
Paths: (1 available, best #1, table CUST-A-NEW-ISP)
 Not advertised to any peer
 Local, imported path from 100:1:192.168.3.0/24
 0.0.0.0 from 0.0.0.0 (3.3.3.3)
 Origin incomplete, metric 0, localpref 100, weight 32768, valid, 
external, best
 Extended Community: RT:100:1 RT:100:100
 mpls labels in/out nolabel/aggregate(CUST-A-OLD-ISP)

And looking at the default route we see no such community and a different next hop from the original table.

PE_RTR#sh bgp vpnv4 unicast vrf CUST-A-NEW-ISP 0.0.0.0
BGP routing table entry for 100:2:0.0.0.0/0, version 40
Paths: (1 available, best #1, table CUST-A-NEW-ISP)
 Not advertised to any peer
 65489
 2.2.2.2 (metric 3) from 2.2.2.2 (2.2.2.2)
 Origin incomplete, metric 5, localpref 200, valid, internal, best
 Extended Community: RT:100:2
 mpls labels in/out nolabel/23

The old VRFs table still shows a route for the newly migrated site (although now learned via BGP) and the default route is still as it was originally:

PE_RTR#sh ip route vrf CUST-A-OLD-ISP

Routing Table: CUST-A-OLD-ISP
Codes: C - connected, S - static, R - RIP, M - mobile, B - BGP
 D - EIGRP, EX - EIGRP external, O - OSPF, IA - OSPF inter area
 N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
 E1 - OSPF external type 1, E2 - OSPF external type 2
 i - IS-IS, su - IS-IS summary, L1 - IS-IS level-1, L2 - IS-IS level-2
 ia - IS-IS inter area, * - candidate default, U - per-user static route
 o - ODR, P - periodic downloaded static route

Gateway of last resort is 1.1.1.1 to network 0.0.0.0

 10.0.0.0/30 is subnetted, 3 subnets
B 10.10.11.0/30 [200/0] via 2.2.2.2, 00:15:34
B 10.10.10.0/30 [200/0] via 1.1.1.1, 00:15:34
B 10.20.20.0/30 [200/0] via 2.2.2.2, 00:15:34
B 192.168.1.0/24 is directly connected, 00:15:36, Loopback1
C 192.168.2.0/24 is directly connected, Loopback2
C 192.168.3.0/24 is directly connected, Loopback3
B 192.168.50.0/24 [200/0] via 1.1.1.1, 00:15:34
B 192.168.51.0/24 [200/0] via 2.2.2.2, 00:15:34
B 192.168.101.0/24 [200/0] via 1.1.1.1, 00:16:19
B* 0.0.0.0/0 [200/5] via 1.1.1.1, 00:15:43
PE_RTR#
PE_RTR#sh bgp vpnv4 unicast vrf CUST-A-OLD-ISP 0.0.0.0
BGP routing table entry for 100:1:0.0.0.0/0, version 15
Paths: (1 available, best #1, table CUST-A-OLD-ISP)
 Not advertised to any peer
 65489
 1.1.1.1 (metric 3) from 1.1.1.1 (1.1.1.1)
 Origin incomplete, metric 0, localpref 100, valid, internal, best
 Extended Community: RT:100:1
 mpls labels in/out nolabel/26

Finally, a traceroute test shows that the newly migrated site accesses the internet via a different site and can still access the application server subnet

PE_RTR#trace vrf CUST-A-NEW-ISP 50.50.50.50 source lo1

Type escape sequence to abort.
Tracing the route to 50.50.50.50

 1 10.1.3.2 [MPLS: Labels 16/20 Exp 0] 44 msec 40 msec 52 msec
 2 10.20.20.1 [MPLS: Label 20 Exp 0] 32 msec 36 msec 52 msec
 3 10.20.20.2 52 msec 40 msec 32 msec
 4 192.168.51.1 54 msec 39 msec 31 msec
 5 200.200.200.2 68 msec 60 msec 32 msec
 6 200.222.222.2 65 msec 143 msec 62 msec

PE_RTR#
PE_RTR#trace vrf CUST-A-NEW-ISP 192.168.101.1 source lo1

Type escape sequence to abort.
Tracing the route to 192.168.101.1

 1 10.1.3.2 [MPLS: Labels 16/22 Exp 0] 56 msec 52 msec 44 msec
 2 10.10.10.1 [MPLS: Label 22 Exp 0] 36 msec 24 msec 24 msec
 3 10.10.10.2 40 msec 40 msec 36 msec
 4 192.168.50.1 26 msec 57 msec 23 msec
 5 192.168.101.1 32 msec 48 msec 36 msec
PE_RTR#

One final point to make is that advertising the PI space to both providers for backup purposes was a possibility. as-path prepend could have been used from breakout site 2 to make it less preferred. But complications come into play depending on how each provider advertises the PI space and whether they honour any adjustments that the customer makes. Should return traffic not follow the same path, stateful firewall sessions would also encounter also difficulty.

So a pretty straight forward solution in the end but interesting from the perspective of a migration standpoint. I am interest to hear thoughts on whether anyone would have taken a different approach. Perhaps we should have done policy based routing or maybe another solution? As usual thoughts are always welcome.