ESX

We just recently had a VMFS volume become full due to over-provisioning which caused the VMs on the datastore to stop responding. Typically the solution is easy – free up space on the volume by migrating VMs off the datastore or increase the space on the underlying volume and expand the datastore. Since this was just a development environment, we did not have an enterprise-grade array that provided features such as volume autogrow, nor did we even have the luxury of additional space to add to the volume. We realized we would have to move files off the datastore to free up space to allow the VMs to “breathe” again. We quickly discovered however, that we could not migrate VMs nor delete any files off the volume.

We were prompted with an error when attempting a VM migration or a file deletion from the vSphere client. We also tried removing files via the service console which returned the following error:

rm: cannot remove <filename>: Input/output error

It appeared that the files were locked. Thankfully, we discovered a quick solution. One of the servers in the cluster had a lock on a file on the full volume but had no space to release the lock. The only way to manually force this release was to attempt to remove any one file from from this volume from each of the hosts in the cluster. This command would be successful on whichever host in the cluster was holding the lock.

VMware wrote this KB article stating exactly this solution: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1011592

Thankfully, this worked for us and allowed us to free up enough space to perform normal operations on the VMFS volume and get the stopped VMs running once again.

We just discovered one of our older host servers was in a non-responsive state in vCenter. After successfully confirming network connectivity of the host server and virtual machines, we determined that the problem must be the host management service was hung.

The issue was resolved by running the following command after logging into the service console:

# service mgmt-vmware restart

About a minute after successfully restarting the host agent service, the host regained connected state and full mangement of the host resumed.

Great VMware KB articles to reference:

Diagnosing an ESX/ESXi host that is disconnected or not responding in vCenter Server: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1003409

Restarting the Management agents on an ESX or ESXi Server: http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&externalId=1003490

Additional note:

Ray Heffer noted in his blog that if the restart hangs, then the process causing the issue must be killed. We did not need to take this step, but if this situation occurs, Ray has some great notes for killing the conflicting process.