The VMware vSphere Blog just recently published a great blog regarding a new storage feature in vSphere 5. That feature has the descriptive name of Permanent Device Loss or PDL.
To quote the VMware vSphere Blog:
„APD is what occurs on an ESX host when a storage device is removed in an uncontrolled manner from the host (or the device fails), and the VMkernel core storage stack does not know how long the loss of device access will last.“
„The APD condition could be transient, since the device or switch might come back. Or it could be permanent in so far as the device might never come back. In the past, we kept the I/O queued indefinitely, and this resulted in I/Os to the device hanging (not acknowledged). This became particularly problematic when someone issued a rescan of the SAN from a host or cluster which was typically the first thing customer tried when they found a device was missing. The rescan operation caused hostd to block waiting for a response from the devices (which never comes). Because hostd is blocked waiting on these responses, it can’t be used by other services, like the vpx agent (vpxa) which is responsible for communication between the host and vCenter. The end result is the host becoming disconnected from vCenter.“
I like a good Visio image to explain these kind of features so I created this one:
So enough explaining APD from vSphere 4. What does the PDL feature do. No need to rewrite the above mention blog post so here comes some more quotes:
„In vSphere 5.0, we have made improvements to how we handle APD. First, what we’ve tried to do is differentiate between a device to which connectivity is permanently lost, and a device to which connectivity is transiently or temporarily lost. We now refer to a device which is never coming back as a Permanent Device Loss (PDL).
- APD is now considered a transient condition, i.e. the device may come back at some time in the future.
- PDL is considered a permanent condition where the device is never coming back.
As mentioned earlier, I/O to devices which were APD would be queued indefinitely. With PDL devices (those devices which are never coming back), we will now fail the I/Os to those devices immediately. This means that we will not end up in a situation where processes such as hostd get blocked waiting on I/O to these devices, which also means that we don’t end up in the situation where the host disconnects from vCenter.“
How do they differentiate between APD and PDL? Bring in the quotes!
„The answer is via SCSI sense codes. SCSI devices can indicate PDL state with a number of sense codes returned on a SCSI command. One such sense code is 5h / ASC=25h / ASCQ=0 (ILLEGAL REQUEST / LOGICAL UNIT NOT SUPPORTED). The sense code returned is a function of the device. The array is in the best position to determine if the requests are for a device that no longer exists, or for a device that just has an error/problem. In fact, in the case of APD, we do not get any sense code back from the device.“
I made a diagram for this as well:
Also the VMware blog post posted ways to remove devices and avoid PDL and APD. I recommend reading that blog.