We often say in IT that it either works or not. That it’s “1” or “0”. In the past few days, I encountered this “once in a lifetime” case, where a wierd application problem had to be drilled down to the binary level to be diagnosed and solved. How lucky am I?
Cisco ACI IP fabric
I am a network architect, so I draw cool network plans and eventually supervise the deployment. I’ve drawn this network for a big governmental hosting client, a Cisco ACI IP fabric, the best of the breed, the coolest SDN technology available today. It’s still in deployment, so the Inter-Pod Network (IPN) is not redundant yet.
We had our first client in the last few weeks, who tried to build a VMWare NSX farm between two sites (two different pods). However, things did not turn well. They said they were facing “performance issues” between both sites, and that some REST API calls were taking forever, while others were going through just fine.
Intra-site, all the API calls between the VMware components were fine, and there were no performance issues, even when going through leaf-spine-leaf. But as soon as we were trying to query between both sites, some calls were fine, while others either crashed or were very slow to succeed. The following schema shows what was going on:
Performance testing: what iperf doesn’t tell you…
At first, I strongly suspected an application-level problem. We had done all our tests on the fabric, and our management traffic was doing fine. So first things first, performance problems required throughput testing. I spent a whole night writing a reliable, repeatable test procedure for the network with iperf3. In the end, I decided to go with very capable, physical machines, since loading 10Gbps links is really resource intensive.
All my perf tests with iperf3 went fine. I could completely load all the links between our datacenters and I had no packet loss (UDP) or retransmits (TCP). Our WAN links are L2 directly connected to Nexus 9000 NX-OS switches (IPN), so I assumed that CRC would detect dropped packets if the WAN links were unreliable.
However, iperf does not tell the whole story. In UDP mode, the server checks how much data was received and if some packets were dropped along the way. In TCP mode, the client shows if it had to retransmit lost or dropped packets. Here is a standard iperf3 test:
What iperf does not check for is integrity. There is no checksum of the payload of the packets, since all the traffic consists of randomly generated bits using the client’s random number generator (which is very resource intensive at 10Gbps and above…). So you can have outstanding performance, yet transmit total garbage: iperf3 could not tell the difference, as soon as the packets CRC check along the way.
Troubleshooting in the dark
Since performance testing was leading me nowhere, because performance was consistently outstanding with iperf3, I had to troubleshoot the API calls themselves. The calls are HTTPS / TLS1.2 GET requests on a web service. The simplest API calls in the world. Between the two network pods, most calls would succeed, but some would fail. It was so wierd that the VMWare guys asked me if some IPS/IDS was messing with the packets along the way. This gave me my first troubleshooting idea:
Simultaneous tcpdump at both endpoints to make sure the payload sent is exactly the same payload that is received:
We did this packet capture at both ends and in both direction, alternatively having the client and server at POD 1 and POD 2. All the curl API calls that were failing were doing so at the same step: SSL handshake. However, the failure from POD 1 to POD 2 was at CLIENT HELLO, while from POD 2 to POD 1, the call failed at SERVER HELLO.
The mystery gets fuzzier…
At first, the packets at both ends seemed identical. All the sent packets were received, the TCP flags were OK, and yet, at some point, the client or the server seemed to simply stop responding, to the point where there were retransmissions, and no reaction at the other end, even though the packet went through:
Here is the normal SSL handshake sequence (omitting a few TCP ACKs for clarity), from the first packet sent, between a client and a server:
As you can see, the first 3 packets, TCP SYN, SYNACK and ACK are part of the TCP connection handshake. These are very small packets that do not contain any payload, only some TCP flags and a few other fields that the sender (client) and a receiver (server) agree on.
The first “big” packets sent are the “CLIENT HELLO”, from client to server, and “SERVER HELLO”, from server to client.
Calling from POD 1 to POD 2, when it failed, the “SERVER HELLO” was OK, but then the CLIENT KEY EXCHANGE / CHANGE CIPHER SPEC would keep retransmitting for 30 seconds, and it would fail and start over (as in the capture shown above). Usually, the next call would succeed, which could look like a performance issue. It succeeded, but only after 30 seconds, instead of less than a second.
Calling from POD 2 to POD 1, when it failed, the “CLIENT HELLO” was fine, but the SERVER HELLO was failing. In the packet capture, I got an SSL “MALFORMED PACKET” right away, the client detecting that something was wrong with the SSL Handshake packet.
Comparing end-to-end packet captures for faulty traffic
I started comparing end to end captures, byte for byte, bit for bit, and then my colleague Christian had this great idea: why not compare the same packet hex values in notepad++? So I hoarded a successful call and a failing call, and got this VERY wierd pattern on failing calls.
On the left, you have the original packet that was transmitted by the server in POD 2, and on the right, the received packet on the client in POD 1 (extracts of them in fact, with no privileged info 😉 ) This is the SERVER HELLO, the first big packet going from the server to the client:
What you see only partially here is the very regular bit flipping pattern. There was a bit flip almost at every 208 bytes or so, all along the 13KB-ish server hello packet.
Googling “Nexus 9000” and “208 bytes” turned up some very useful info. I just discovered that Cisco uses en assembly of exactly 208 bytes memory cells on its Nexus 9500 Fabric modules and Nexus 9300 ASICs. This could be a single defective memory bit in those cells that always turns a single bit to 1, while it should be 0.
Finding the faulty switch
The calls were failing randomly because of a superb feature in the IP fabric: Equal Cost Multi Path (ECMP). All the trafic is routed from one switch to the other in the IP fabric, and to be able to use all the paths simultaneously for added bandwidth, ECMP is used. It balances trafic over multiple equal cost links. So from Leaf 1 to Leaf 2, there are 2 paths available with a cost of 2. One through Spine 1, and one through Spine 2. The same is true from any leaf to the IPN.
I knew that from one leaf to the other, there were no problems. I also tested directly between IPNs, and all the connections were fine because I had been transferring stuff from IPN to IPN over the WAN for weeks with no problem. This pointed me to the IPN – SPINE links.
Trying to isolate the faulty path, we started shutting the IPN – SPINE links one at a time in each POD. The behavior was always the same, until we shut the link between the IPN and SPINE 2 in POD 2. This forced the trafic to avoid SPINE 2 and to go through SPINE 1, whatever the leaf the traffic had to reach.
Then, magic happened: the problem was no longer present. Avoiding SPINE 2 in POD 2, all the calls would go through, an no “bit flip” happened on the way!
We sent all this info to our Cisco architects and TAC. They found that this exact “bit flipping” issue was encountered by another client in December 2021, and an updated ACI software version was released with code to test for these “bit flipping” issues.
Our technical guys proceeded to install the updated software on all our ACI spines and leafs, and the problem was instantly detected by the software self test: A fabric module was faulty in SPINE 2 of POD 2. For the moment, it has been shut down, and it will be replaced in the coming weeks.
This is the type of hard to solve case that I will still tell about 20 years from now. The kind of case I will never encounter again in my career. These hardware problems are some of the most challenging and tricky to solve, because we usually think our hardware can self diagnose through software. But what happens when a sub component of a sub component fails? This little damn intermittent glitch? Being methodic in the troubleshooting is essential, and remember this one thing: packets never lie.