Today i had a nasty problem with a customer. Some VMware ESX servers were disconnected from vCenter. Tried to connect the VMware ESX hosts using the VIC, but that was not possible. I was unable to login via SSH connection, connect with HP iLO to the HP Blade enclosure and did a remote connection to the console. There where HA errors on the screen. I tried to log in the console, after typing the password and hit the enter button the login prompt re-appears every time i try. The VMware ESX servers were not manageable anymore, DAMM.
The VMs on the disconnected VMware ESX servers were still running and had RDP enabled. The only solution so far was to log in via RDP and shutdown the VMs. I did this for one VMware ESX server. After shutting down all VMs, i did a cold reboot of the VMware ESX server. After several minutes the VMware ESX server reappears in vCenter. I went to the log files and found the following error:
fork: Resource temporarily unavailable
After searching the VMware Knowledge base i found the following article:
Defunct cimservera processes seen on VMware ESX 3.5 running hardware management agents
It looks like we have the same symptoms:
-
Unable to login through SSH to VMware ESX host
-
Unable to login on local Service console
-
HA errors.
I did “ps-ef” on the console of other VMware ESX servers that were still working and returned a couple of thousands cimservera defunct processes. Holy shit.
After restarting the pegasus service on the host the cimservera defunct processes are away. Now there are 134 processes active.
It seems that one ISCSI target was offline and that all the VMware ESX servers tried to connect to the ISCSI target every 60 seconds, the failed login attemps results in thousands of cimservera defunct processes. This is a bug and VMware will release a patch for this nasty problem. Watch your patches.
I did a temporarily fix by schedule restarting the pegasus service every day on every VMware ESX server using the plink utility.
Watch your processes frequently on your VMware ESX hosts!
[ad#verticaal]
Ivo,
I had the same issue by a customer and found out that is was HP SIM causing the zombie processes.
Probably there’s a bug in the HP Agents or in the pegasus service.
However whem HP SIM detects this WBEM service it tries to login with the users defined in HPSIM for WBEM. This login process never shuts down (cimservera processes) so your COS wil run out of resources (max 32000 processes?).
I solved this by shutting down the CIMHttpServer port on the ESX Firewall because I’m not using it anyway. (usr/sbin/esxcfg-firewall -d CIMHttpsServer)
If you shut down this port, HP SIM won’t detect WBEM.
After closing the port I did an Identify on HP SIM on all my ESX servers. The problem never returned.