Troubleshooting VLAN and Switch Problems
Let’s talk about the essential part of troubleshooting VLAN and switch problems. In this post, we’ll discuss common general switch issues, VLAN related issues, and spanning-tree issues. We’ll also cover VLAN/switch troubleshooting techniques. Later, in part two I’ll look further into a “No connectivity” issue.
Common General Switch Issues
One of the things to keep in mind is that there are some things that can just generally happen on a switch. One example is a physical or connectivity related issue.
Physical Interface/Connectivity Issues
What is “Inside Microsoft Teams”?
“Inside Microsoft Teams” is a webcast series, now in Season 4 for IT pros hosted by Microsoft Product Manager, Stephen Rose. Stephen & his guests comprised of customers, partners, and real-world experts share best practices of planning, deploying, adopting, managing, and securing Teams. You can watch any episode at your convenience, find resources, blogs, reviews of accessories certified for Teams, bonus clips, and information regarding upcoming live broadcasts. Our next episode, “Polaris Inc., and Microsoft Teams- Reinventing how we work and play” will be airing on Oct. 28th from 10-11am PST.
- Interface is down/down – This means it’s not receiving keepalives and it’s not physically connected
- Interface is up/down – Meaning, it’s physically up but the Layer 2 protocol is down
- Interface is administratively down
- Check your cabling. Always start by assuming the problem is with the cable. Swap with a known good cable. This may not be the case all the time, but in some instances, you might have to substitute the crossover cable. It may not have the auto-MDIX crossover function, so you may actually have to do something manually.You can also verify that the hardware is functional. You can use the show controllers command to see if there’s something physically wrong with it, or try a different port on the switch to see if the same problem is happening.
- Check your interface. Verify that the interface is operational and use the no shutdown command. That would take care of “administratively down” cases or if it’s been put into error-disabled state by one of the Layer 2 protocols and so forth.
Physical Interface Speed/Duplex Issues
Other problems that can happen frequently across two interfaces are speed and duplex issues or mismatches. This can be particularly true if you have a gigabit connection on one side and a 10/100 on the other.
- You’ll see a syslog message that says %CDP-4-DUPLEX_MISMATCH. That’s going to tell you that there’s a duplex mismatch.
- If you have something hard-coded on one side and auto on the other, or you have them hard-coded on both sides but they’re done differently, it’s not going to be able to auto-sense anything, so you can have a speed and duplex mismatch as a result.
- Set the speed and duplex settings to autonegotiate on both ends.
- Manually configure speed and duplex settings on both ends (i.e., if one device has issues) so that they’re the same.
Common VLAN Related Issues
- You notice interface flapping on a port set for access-only mode.
- Execute a show running-configurationcommand. Examine the output and verify whether the following entries are on the port that’s affected:
- switchport mode access, and
- switchport access vlan
If something’s missing from that, add what you need. Some of the more automated trunking type mechanisms and similar stuff can create this type of issue if you don’t have it specifically set for access mode and the specific VLAN.
Another reason a VLAN could be down is because there’s no physical port associated with that particular VLAN. Now, with a Layer 3 switch, this typically doesn’t tend to be as big an issue. On Layer 2 switches, it can be.
- VLAN is created on the switch but in a down state.
- Execute the show vlan command. If it shows “down,” make sure there’s at least one port that’s identified as part of the specified VLAN, or a switch virtual interface in that VLAN.
VLAN trunking issues
- You’ve connected the cables but a trunk is still not establishing across the configured link.
- If you’re using ISL trunking, make sure the switch on the other side supports ISL. If it doesn’t, then you need to change it.
- If you’re using 802.1Q trunking, you may have different native VLANs configured on either side. If that’s the case, change the native VLANs to match.
- Verify the trunking settings on both ends of the link are the same (e.g. DTP, mode encapsulation, etc.).
VLAN Trunking Protocol (VTP) issues
- VLANs are not propagating from servers to clients the way they should be.
- The first thing you need to make sure is that the links on both sides, between the client and the server, are configured as trunks and that their trunking types match.
- Verify that the VTP domains match and adjust if necessary.
- Verify that the switch you intend to serve as master is no longer in transparent mode or client mode. Make sure it’s in server mode and that the other switch is in client mode.
Inter-VLAN Routing Issues
- VLANs cannot reach one another. For instance, in the figure above, VLAN 1 and VLAN 11 cannot connect.
- If you’re using an external router, first make sure that that router’s reachable. Going back to our figure, if the workstation on VLAN 1 can’t reach the VLAN 1 interface on Router 1, there may be a connectivity issue or misconfiguration issue.If you’re having some other issue, you may have to troubleshoot routing. But if VLAN 1 workstation can reach Router 1’s VLAN interface and VLAN 11 can do the same thing with Router 1’s VLAN interface, then there may be something in the router you need to look at.
- If you’re using a Layer 4 Route Processor, make sure that the Switched Virtual Interfaces (SVI) have been configured with the correct VLAN ID and IP subnet information.
- Verify that a default gateway exists on the switch.
Common Spanning Tree Issues
802.1D Spanning Tree Issues
- A port has gone into an error-disabled state or has become non-functional after a configuration event.
- If you’re using Portfast and you have any of those guard features enabled, make sure there are no other devices creating those protocol units being sent to that port.
- Make sure no uni-directional links or one-way links exist.
- In a worst case scenario, just issue a shutdown/no shutdown command to reset that port.
Another spanning tree issue is one that has something to do with Etherchannel.
- Etherchannel is not forming a Port-Channel between configured links.
- One of the things that you have to make sure is that Etherchannel parameters have to match at both ends. They have to be the same type on the switch (e.g. FastEthernet, Gigabit Ethernet, etc.).You can have a FastEthernet on one switch going into a Gigabit on the other, but if you have a FastEthernet and a Gigabit Ethernet configured on that switch to go to the other switch, it’s not going to work.
- Verify that the same protocol has been configured on all ports (e.g. PAGP, LACP, etc.). Make sure that they’re the same on both ends.
- Make sure you use identical trunking configurations, including native VLANs, when using 802.1Q.
Troubleshooting VLAN/Switch Problems
Now that we’ve already taken up some common problems, here are some basic ideas on how to do troubleshooting on switches and VLANs.
- Always start with the Physical Layer. Confirm that the interface is Up/Up. Verify that the cabling is operational. People often spend a lot of time troubleshooting other things, only to realize the problem is just the cable.
- Use the Cisco Discovery Protocol to verify Layer 2 connectivity. If you have it turned off, turn it on just for testing purposes. Execute the show cdp neighbors command and verify whether the device names you’re expecting to see and the types on both ends of the links are actually there.
- If there are no neighbors being shown and you think you have everything configured the way they should be, then you may have a Layer 2 issue of some kind. In that case, you’ll be able to isolate the problem to a specific layer in the OSI model.
- Look at your ARP Mappings. Use the show arp command on both devices and watch for entries listing incorrect MAC addresses or a description of incomplete. If it’s incomplete, you may have some other kind of issue.Also, to verify ARP Mappings, issue a ping command to the IP address on the opposite end of the link. If the ping fails or the ARP entries appear incorrect, examine the possible causes.
VLAN/Switch Lab Troubleshooting Exercises
Now it’s time to look at how this actually works in a simulated environment. We’re going to start by giving you a general background of some situation that could actually exist. Three Trouble Tickets will be involved here. You’ll get them from the system and use for troubleshooting and resolution purposes.
The three Trouble Tickets will be: Internet is Down, No Connectivity, and Network is Slow.
As we walk you through each step of the simulated troubleshooting process, we’ll present it in a way as if you’re the one doing the troubleshooting and that you’re doing it the way an expert would.
Here’s the basic layout. Let’s call it our Site 1 Topology:
It consists of a large campus with 300 employees spread across three separate buildings. The Internet connectivity is across the WAN. In other words, this campus environment is getting Internet access from another location.
There are two routers that provide redundancy both to the WAN and the Internet. See routers R1-1 and R1-2? Those two connect to the Wide Area Network.
Now here’s the situation.
Building 3, which is being serviced by R1-3, has been experiencing a number of service outages. Your role as the Tier 1 help desk technician on duty is to receive the trouble ticket, diagnose the issue, and ultimately resolve it.
Trouble Ticket: Internet is Down
You arrive at work to find a high-priority trouble ticket assigned to you, and it says the Internet is down. The problem has been going on for over an hour without any resolution. After some investigation, you discover that someone on the network team has made an undocumented configuration change.
Your task is to pick up the ticket, assign it to yourself, contact the requestor and inform that person that you are now actively working on the problem, and then of course proceed with troubleshooting and resolution.
Here’s what greets you the moment you arrive at work:
Now, while these messages may sound really harsh (see the last one), it’s just normal for tensions to run high if something isn’t working and a person’s job depended on it. So even if you don’t particularly like the way this person’s talking to you, you have to take all that into account.
Note in the upper-right corner of that last screenshot that the Status is Open and the Priority is High. The first thing you do is send the person a message assuring him/her that you are already working on the issue. After that, you proceed to your troubleshooting activity.
To begin troubleshooting, you bring up your console. Because R1-3 is the one experiencing problems, you right-click on it and select Telnet/SSH to device.
First, you check for connectivity. Since you got a Trouble Ticket from the manager indicating that although the Internet’s down, everything else seems to be working at least locally, you assume that the workstations are still able to reach you.
You proceed by issuing the command:
The first one, enclosed in a box marked #1, is something that would have required some deeper inspection. However, it’s not being used, so you skip it.
The second one (marked #2), on the other hand, is a bunch of LAN interfaces, and they’re Up. That means they’re working the way they should be. In other words, the Physical Layer is working.
Next, you execute the show interfaces command and see if everything’s working as expected. In the screenshot below, FastEthernet is showing Up/Up. That’s a good sign.
While you’re doing all this, you’re following a plan. Here’s the plan you drew up and filled out for this particular troubleshooting activity:
Next, you do show cdp neighbors.
Switch 1-3 (SW1-3) is the upstream switch, so you know that is functional. At this point, you think of ruling out both Layer 1 and Layer 2.
Next, you conduct some ping tests on VLAN1 (the Management VLAN) and VLAN11 (the Production VLAN).
Everything looks fine on the Management VLAN:
However, on the Production VLAN, you experience some problems:
You want to find out whether the upstream switch can be pinged, so you try to obtain the IP addresses by executing the show cdp neighbors detail command.
It’s not listing an IP address here, so you try pinging the switches.
Unlike Switch 1 and Switch 2, which are doing fine, Switch 3 is experiencing connectivity problems.
You try pinging the Internet, and still you can’t get outside on VLAN11. That can be the reason why the Internet is down.
So you’ve got successful connectivity on VLAN 1 to Router 1-1 and everything in between. However, you can’t get on VLAN11.
Another thing you consider looking into is routing. To check routing, you execute the command:
Seeing signs indicating that you may have a routing problem, you proceed to conduct further investigation by executing the show ip eigrp interfaces.
It reveals that you have zero peers even though you can get out on your VLAN1, which is the Management VLAN. The Production VLAN isn’t getting any routing. At this point, you cannot be sure but, judging from the way things are working, it would be logical to suspect a switch related problem and that the problem is not on this router.
When you do a show cdp neighbors, you see that the next upstream is Switch1-3, so you take a look at that next.
You again execute show cdp neighbors. That output includes Router 1-3 as well as an Etherchannel (Switch 1-2) across two interfaces, so you know that you’re looking at a Layer 2 connectivity.
Next, you execute show interfaces trunk. You notice that both Native VLAN properties of both the link back to the router (Fa0/1) and the port channel (Po4) that’s up to the next upstream switch, SW1-2, are matching. Everything appears to be in order here.
After that, you issue the show spanning-tree vlan 11 command. There you see your root port (Po4) and your designated port (Fa0/1).
So far, everything here appears to be functional, but because you want to make sure that all the necessary configurations have been carried out, you do a show vlan. The results show that both VLAN 1 and VLAN 11 have really been configured.
You then execute the command: show vtp status
It shows that the configuration has been successfully sent, the domain is correct, it’s operating in client mode, and there are 7 existing VLANs.
At this point, you eliminate Switch 1-3 from your list of possible culprits and proceed to Switch 1-2.
You try executing a show ip interface brief command. Everything looks good there.
Then you try show cdp neighbors. Same story there.
You also try a show spanning-tree vlan 11.
Still you see that everything’s functioning the way they’re supposed to.
To make sure the vlans are there, you issue the show vlan command.
VLAN1 and VLAN11, which are the ones that are critical, are there.
Next, you do a show vtp status.
Again, the information shown tells you that everything should be working properly, but that’s before you take a much closer look. Closer inspection reveals that some of the letters of the VTP Domain Name are in lower case.
That may not sound like a big deal but, to this switch, it may mean something different. Now you have what looks like a potential issue. Since everything else is working, you certainly would like to eliminate every possible cause, negligible as they may seem.
Having found a potential issue, you now conduct further inspection in that particular direction. You remember to make only one change at a time, knowing fully well that if you make multiple changes simultaneously, you would run the risk of not knowing which one actually worked.
The next thing you do is issue the configure terminal command, followed by vtp domain CCNP-TSHOOT.
You then go back to your Router 1-3 and ping 192.168.1.1, which was successful earlier, and 192.168.11.1, which wasn’t. Now, you find them both reachable.
You issue configuration terminal here and then execute logging on (just in case the logging got turned off), followed by show ip route.
Next, you do a show ip eigrp neighbors. Surprisingly, you still don’t see any neighbors even though you already have connectivity back up.
So you follow that with a show running-config to see if something’s out of order.
After scrolling down the results, you notice one particular interface with an error where IP authentication for eigrp has been put in place.
To take that out, you execute:
no ip authentication mode eigrp 100 md5
After that, things start coming back up.
You try show ip eigrp neighbors one more time. This time, you’re shown the three you were expecting.
You try pinging the Internet. It’s now back up as well.
At this point, you do a little analysis and put together the information you’ve been able to gather so far.
- The fault was identified on Device SW1-2.
- The fault was Layer 2 (Data Link Layer) in nature, specifically VLAN Trunking Protocol.
- More specifically, the fault was due to a VTP domain name mistyping (a human error)
- It was resolved by executing the vtp domain CCNP-TSHOOT command, with CCNP-TSHOOT all in capital letters.
Since the problem has been resolved, you go back to the trouble ticket sent by the requestor, change the status to resolved, and put in necessary notes.
When you go back to the Home tab, you now see the number of Requests Overdue is already down to two.
Your day has just started and you still have two more trouble tickets to resolve. I’ll go over those in Part 2 of this post.