Today i had a nasty problem with a customer. Some VMware ESX servers were disconnected from vCenter. Tried to connect the VMware ESX hosts using the VIC, but that was not possible. I was unable to login via SSH connection, connect with HP iLO to the HP Blade enclosure and did a remote connection to the console. There where HA errors on the screen. I tried to log in the console, after typing the password and hit the enter button the login prompt re-appears every time i try. The VMware ESX servers were not manageable anymore, DAMM.
The VMs on the disconnected VMware ESX servers were still running and had RDP enabled. The only solution so far was to log in via RDP and shutdown the VMs. I did this for one VMware ESX server. After shutting down all VMs, i did a cold reboot of the VMware ESX server. After several minutes the VMware ESX server reappears in vCenter. I went to the log files and found the following error:
fork: Resource temporarily unavailable
After searching the VMware Knowledge base i found the following article:
It looks like we have the same symptoms:
- Unable to login through SSH to VMware ESX host
- Unable to login on local Service console
- HA errors.
I did “ps-ef” on the console of other VMware ESX servers that were still working and returned a couple of thousands cimservera defunct processes. Holy shit.
After restarting the pegasus service on the host the cimservera defunct processes are away. Now there are 134 processes active.
It seems that one ISCSI target was offline and that all the VMware ESX servers tried to connect to the ISCSI target every 60 seconds, the failed login attemps results in thousands of cimservera defunct processes. This is a bug and VMware will release a patch for this nasty problem. Watch your patches.
I did a temporarily fix by schedule restarting the pegasus service every day on every VMware ESX server using the plink utility.
Watch your processes frequently on your VMware ESX hosts!