An IPMI House of Cards
We’ve recently added features using IPMI to our ReAssure testbed, for example to support reimaging of our Sun experimental PCs and rebooting into a Live CD, so that researchers can run any OS they want on our testbed. IPMI stands for the “Intelligent Platform Management Interface”, so we have a dedicated, isolated network on which commands are sent to cards in the experimental PCs. An OS running in these cards can provide status responses and can perform power management actions, for example a power cycle that will reboot the computer. This is supposed to be useful if the OS running in the computer locks up, for example. So, we were hoping that we’d need fewer trips to the facility where the experimental PCs are hosted, have greater reliability and that we’d have more convenient management capabilities.
However, what we got was more headaches. Some IPMI cards failed entirely; as we had daisy-chained them, the IPMI cards of the other PCs became inaccessible. Others simply locked up, requiring a trip to the facility even though the OS on the computer was fine… One of them sometimes responds to status commands and sometimes not at all, seemingly at random. The result is that using the IPMI cards actually made ReAssure less reliable and require more maintenance, because the reliability-enhancing component was so unreliable! The irony. I don’t know if we’ve just been unlucky, but now I’m keeping an eye out for a way to make that more reliable or an alternative, hoping that it doesn’t introduce even more problems. That is rather unlikely, as I’ve discovered that even though the LAN interface is standard, the physical side of those cards isn’t; AFAIK you can’t take a generic IPMI card and install it, it needs to be a proprietary solution by the hardware vendor (e.g., you need a Tyan card for a Tyan motherboard, a Sun IPMI card for a Sun computer, etc…). So if the IPMI solution provided by your hardware vendor has flaws, you’re stuck with it; it’s not like a NIC card that you can replace from any vendor. I don’t know of any way to replace the software on the IPMI cards either, in a manner similar to how you can replace the bad firmware of consumer routers with better open source software. I suppose that the lessons from this story are that:
- You can’t make something more reliable by adding low-quality components in a “backup” role, because then you need to maintain them as well and make sure that they’ll work when they’ll be needed;
- It’s not because something is on a separate card that it is more reliable;
- IPMI is a weak standard—only the exposed interfaces are standardized, for example enabling the development of OpenIPMI (from the managed OS side) and IPMItools (LAN interface), but the middle of the “sandwich” isn’t—the implementations and parts are proprietary, incompatible between vendors, inflexible and fragile;
- Proprietary, non-standard solutions prevent choosing better components.