iPXE troubleshooting session #2 debrief

Just finished up a two day troubleshooting session in Cambridge. Awesome as always! So what did we fixes where we aiming to sort out this time:

  1. NTLM only using local names and not being able to use DOMAIN for authentication.
  2. iPXE issues with USB devices – causing random lockups/resets of the machine.
  3. Issues with USB native driver dongles with the XHCI controller in play.
  4. Issues with iPXE when using HTTPS + Client Cert + BranchCache extension.
  5. Issues where a Windows 10/2016 Hyper-V (later than 1607) and WinPE (Later than 1703) booted from iPXE would BSOD at seemingly random times.

NTLM Using Domain Names

The first issue was fairly known, and easy to fix. At implementation time we forgot to implement a way to feed the DOMAIN part of the username:password structure from the URL to the NTLM authentication part. Now iPXE URL strings can be constructed as follows:
http://[DOMAIN\]USERNAME:PASSWORD@SERVER.FQDN using the %5C for the backslash. So a valid request would look like the following:

initrd http://2PSTEST%5Cadministrator:mypassword@rig10c20.2pstest.local/myfile.bin

This change has been pushed and will now be part of our upcoming iPXE Anywhere 2.6 release. This is fixed in commit: https://git.ipxe.org/ipxe.git/commit/6737a8795f20c21bb48d410c2d9266f8c9c11bbc

USB Issues

There were a numerous issues with USB, one big one being seemingly random reboots. To do the testing we had a spectacularly evil Intel NUC with an Atom CPU and with some pretty royally crappy firmware. First issue was that we couldn’t actually transfer the USB keyboard from the UEFI shell to iPXE. So we used a script in order to see if the system survived a download of a file. It did not. Troubleshooting without keyboard is hard, so we hat to get a keyboard working, but how?
In order to get around that, we loaded a new keyboard driver from the UEFI shell, and then loaded the iPXE binary, causing it to use the working keyboard driver. Hurrah. TGB! (This doesn’t mean To Great Britain) but is the short for “TanGentBord” which means keyboard in Swedish).

Now having keyboard, trying to access the same file again, using our USB dongle NIC, caused iPXE to lock up and die. Something in the USB stack was not liking us. So what was the issue? After playing around, putting debug messages at key points in the code, we found that the USB stack sent the same message twice, thinking that the message was still on the bus when it wasn’t. This can result in duplicate completions and consequent corruption of the submission TRB ring structures.

So with this fixed, we still had issues with lockups, as the keyboard and or other EFI items were calling into the iPXE codebase with random timers, we knew we had to do something about that.

Going Deep

So EFI has no concept of interrupts, instead it works on timers. This is one of the reasons why their network stack is so slow, it depends on timers to know when a package has been retrieved/sent. Problem is, timers can fire at any time, requesting a callback into the iPXE code. So you get the worst of the worst, not having interrupts (inefficiency and inability to sleep until something interesting happens) , and having the issue of dealing with callbacks from timers requesting attention at any given time. Lovely. So by increasing the TPL level (Lovely documentation on that BTW… NOT) to CALLBACK level we ignore the timers calling into iPXE, giving use time to execute whatever we are doing at this time.

So with these fixes in place, we can execute and use the USB keyboard on ECM+NCM builds of iPXE and downloading data from a USB dongle just fine, even with USB keyboard plugged in.

So why is this important to our customers?

In the scenario of using USB dongles, you can now use a USB dongle with iPXE, boot from USB media (from a USB/NIC Hub combo presumable?) on any type of hardware, using the native USB dongle interface (Fast!). This means that for your laptop, tablet armada, you can use the same dongle setup for any type of hardware, gone is the requirement to use dedicated dongles for certain type of hardware for PXE booting. Long gone is the slowness of USB dongles using SNP drivers. So use one single dongle setup for everything, epic!

These are the commit messages:

The tough one – Download Lockups with HTTPS and BranchCache

This was an odd one, we were seeing lockups in the following scenario, and only the following scenarios:

  1. Customer using the upcoming HTTPS capabilites of iPXE Anywhere 2.6
  2. Customer was using BranchCache to enable P2P download of the boot.wim
  3. Customer had not precached any or only partial content to peers

Ok, so what was going on here? If switching to HTTP on the above scenario, it all just worked. This was puzzling and lead us to look at the memory usage of having a large number of HTTPS connection back the server. See by default, we run 32 (PeerMux) simultaneous block downloads. Failing to find local blocks from peers, we then head back to the server to get them. This was causing the issue. But why? If we played with the PeerMux and put it down to 2-3 it just worked all the time. This lead us to believe that the SNP driver could be having issues with a large number of HTTPS connections, but with proper debug messages we learned that this was not the case. The problem was within the iPXE code, not the SNP driver.

We were almost giving up at this point, and accepting that a low PeerMux would have to be used, possibly the issue was only for SNP, and regular drivers would not experience this. We tested with a regular UNDI and this seemed to work better, leading us to believe the issue was still in SNP for EUFI. But further testing revealed that the UNDI version was experiencing the same issue, so most likely this would be the same for native drivers. Enter frustration.

After a bit of network traffic sniffing (hours) we were still scratching our heads, until we slowly started to clue in. Mr. Brown asked if client certificates was enabled on the server, which of course it is when downloading content from a ConfigMgr DP in HTTPS mode when the DP is set to only answer Internet Clients. So we tested by disabling the client certificate requirement on the server, and did a retest. This time things were flying through. Voila, we knew the failure path.

So the TLS part of iPXE that deals with the HTTPS side is superfast and optimized when it comes to the server certificate securing the SSL, and completes in milliseconds. Problem arises with the client certificate parsing of the private key, which suffers two issues:

  1. It’s not code optimized, which means each validation of the private key takes several 2-3 seconds.
  2. It does not use not support resume, which means that each new connection requires the whole TLS authentication process (including privatekey) to happens for each new connection.

With the above situation, it was easy to see where the downloads were failing, it’s was just the sheer number of connections re-authenticating all the time, which caused the choking effect. This made perfect sense why it worked with 100% local peer content, It never needed to mass test the client certificate, only the first hash transfer. It also explained why setting a lower PeerMux worked, iPXE could deal with a few parallel client certificate calculations, just not the amount of 32.

Sadly, the fix requires to implement TLS session resume, which is documented in RFC 5077. Although doable, it was not doable in the short time period that we had available.

Then once we have TLS session resume, we might look into speeding up the client private key calculations, as there are rooms for improvements, and who doesn’t have appetite for some crypto math any day of the week, right?
As we need to ship iPXE Anywhere 2.6 yesterday, we don’t have time to wait for the TLS resume fix, so we are then faced with either two ways forward:

  1. Set a very low PeerMux, which makes P2P very slow, but still equally fast where BranchCache is not used.
  2. Keep the PeerMux at 32, and risk to fail if 100% of the P2P content is not available locally.

Having to choose between two bad positions, we decided to ship with a high PeerMux value, which means that all downloads will be working fast for internal sources that is not requiring client certificates. Although it will be slow for client cert required paths, but fast for BranchCache aware content that has 100% content on local peers. In an upcoming release we will release new binaries that will be faster once we see that the TLS issues goes away with TSL resume. We expect to ship that within 1-2 months.

Sadly we ran out of time to fix the issue where WinPE booted from iPXE on Hyper-V causes a BSOD at seemingly random times, we will however try to sort that with some remote debugging over the internet, using windbg attached to a Hyper-V VM. That should be an interesting blog post.