The Errors
We recently encountered an issue with our VCSA appliance. We first noticed it because the backups for the weekend failed and upon further inspection, it appeared that the appliance failed. The login screen had a cryptic error. A reboot of the appliance seemed to work initially, but once I logged into vCenter, no hosts were being displayed.I then decided to SSH into the appliance and was able to check the services. What I noticed was that a number of them were in a downed state. The command to restart all of them did not work. If you would like to check the status of services, run the following command from your SSH prompt:
service-control --status
If you want to attempt to start all the downed services, run the following command from SSH:
service-control --start --all
In my case, that command was unable to get the other services to start. My solution was to start them individually and that seemed to do the trick. To start services individually, run the following command from the SSH prompt:
service-control --start servicename
Once I ran the command on multiple downed services, I got the appliance to work once again - or so I thought! The fix only lasted for a few minutes. After failing to run the failed backups, the login screen started displaying the same cryptic message as before. It was at this point that I started looking into the VCSA Appliance Management interface. It was here that I noticed that some services were down and that one of the partitions was almost full.
The Fix
Once I realized the database partition was nearing capacity, I looked into expanding it just to see if that was the root cause and hoping it would fix the issue. In my case, Hard disk 8 (/seat partition) was at 90%+ and originally sized at 10GB. If you're in a bind, you can grow the VCSA disk by going to the host where the virtual server is running and adding space to that disk. Once the space has been added, you will need to run a command in VCSA via SSH. Log in to your appliance and run the following command to allocate the newly added space to the seat partition.
com.vmware.appliance.system.storage.list
This command will grow the partition with the newly allocated storage. If you check the Disks section under Monitoring in the Appliance Management, you might noticed that the new space is not being shown under Utilization. In my case, a reboot of the appliance resolved this graphical glitch. It was at this point that I was finally able to successfully run our backups and go back to normal - for about a week! As in the previous time, the VCSA appliance started displaying the same symptoms that did not allow us to log in and manage the hosts or back up any virtual machines. I added some additional space, but I knew I was on borrowed time and had to open a case with VMware support.
The Bug & Workaround
Once I opened a case and got a call back from support, it didn't take long to determine that the root cause of this issue is a bug in vSphere 6.7 Update 3 that is currently affecting Cisco, Dell, HP and possibly other platforms. The hosts WBEM starts to report tons of events, causing the database to fill and forcing the vpxd service to stop. If you look at the following graphic, you'll see how it correlates with the upgrade of our hosts. The initial hosts were upgraded to U3 on the 13th, with the rest being upgraded on the 19th. The dip displayed in the graphic was me adding additional space to the partition where the database resides.
From reading comments online, it seems that if you upgrade your hosts to U3 but leave VCSA on U2, you will not experience this issue - but that's really going against recommended practices of updating your VCSA before your hosts. The work around is comprised of adding additional space to the disk and partition where the database resides, or truncating some of the tables on your DB and additionally stopping the WBEM services on the hosts. To stop the WBEM services, you'll need to SSH to every host connected to your VCSA and run the following command:
esxcli system wbem set --enable false
Support warns that you will need to run this command if you reboot your host, so keep that in mind. If you would like to shrink the database size, you will have to truncate some of the tables by running the following commands. Start by logging on to VCSA via SSH then connect to the database, You will first need to launch BASH by running the shell command.
shell
Once you're at the BASH prompt, run the following command to access the database prompt:
/opt/vmware/vpostgres/current/bin/psql -d VCDB -U postgres
Once you're at the Postgres command prompt, run the following command to determine which tables you will need to truncate to clear space:
SELECT nspname || '.' || relname AS "relation", pg_size_pretty(pg_total_relation_size(C.oid))
AS "total_size" FROM pg_class C LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
WHERE nspname NOT IN ('pg_catalog', 'information_schema') AND C.relkind <> 'i' AND nspname !~ '^pg_toast'
ORDER BY pg_total_relation_size(C.oid) DESC LIMIT 20;
The results from this command will look a little something like the screenshot displayed below.
From the results select tables to truncate by running the following command:
TRUNCATE table VPX_EVENT_XX;
TRUNCATE table VPX_EVENT_ARG_XX;
Repeat the process until you're satisfied with the amount of space cleared on the affected drive and do not forget to disable the WBEM service on your connected hosts. For additional information on this bug, check out the documented KB 74607.