As I have stated before, I love databases, especially the vCenter database. That’s why I really wanted to create another vCenter database post, this time regarding troubleshooting.
I recommend watching the video from my previous post, regarding how performance statistics are processed.
Starting with a small overview its good to know the vCenter database is quite a complex database and it’s not getting any simpler with each new version of vSphere.
- In vCenter 2.5 it had 88 tables and 8 procedures
- In vCenter 4.0 it had 196 tables and 12 procedures
- In vCenter 5.0 it has 247 tables and 27 procedures
I have had my share of database troubleshooting, even though I live in Iceland 🙂 Mostly its because of server sprawl on vCenters starting on SQL Express.
The most common problems are:
- Timeouts – Gaps in the Performance tab in the vSphere Client.
- Slow response getting performance data.
- Only real time data is available – or only Real-time and 1 day. Missing the week, month and year old performance statistics.
- No space left for database growth.
- Overloaded SQL server.
- If specific to one ESXi host – restart the hosts Management agent. KB article (1003490): Restart Management agents on ESXi and ESX.
- If showing on every host – a) Check if the SQL server is overloaded. CPU and Memory. b) Check the size of the first vpx_hist table, if larger than 10 million rows you will need to dig deeper. Most likely the rollup jobs aren’t working correctly (you should see a lot of other signs if they are not working, like older performance data not showing). You can also try to manually run the jobs.
- Check and see when the rollup jobs last ran:
select max (sample_time) from vpx_sample_time2
- If there is a long time since last run, your rollup jobs aren’t functioning correctly.
- Most likely its because of fragmented vpx_hist tables. Go through the steps in this KB article (1003990): Defragmenting VirtualCenter performance data indexes on a Microsoft SQL database.
- Large database. Truncate or purge (archive first).
- When selecting older data than 1 day, you will get “Performance data is currently not available for this entity”. Culprit: Rollup jobs. Check and see the SQL Agent Service is Started and recreate the Rollup jobs.
- Check to see when the last time a rollup job ran. Use this query:
select max (sample_time) from vpx_sample_time2
- Manually run the rollup jobs – purge the tables – truncate data from the tables.
No space for database growth:
- Most likely the database is configured for Full Recovery. This will eventually consume all the space on the SQL server. Please make sure to change the recovery mode to Simple but please note that this option will probably make you lose some days of performance data.
- To check where the growth is occuring see this KB article (1028356): Determining where growth is occurring in the vCenter Server database. Also you can run these commands:
exec sp_spaceused vpx_hist_statx or
select count(*) from vpx_hist_statx.
- Then you can purge data, KB Article (1025914): Purging old data from the database used by vCenter Server.
- Or truncate the data. Use these commands:
truncate table vpx_hist_statx or
truncate table vpx_sample_timex
Overloaded SQL server:
- Use Perfmon on the SQL server to check specific SQL counters. Most common culprits are memory or IO bottlenecks.
So that probably made you more confused than you were before so this is the short version for vCenter SQL troubleshooting:
- Loss of performance data is only specific to one host – restart the Management agent. KB article (1003490): Restart Management agents on ESXi and ESX.
- Check if there is space for database growth – SQL server drives. KB article (1028356): Determining where growth is occurring in the vCenter Server database. When you found the guilty table, purge it. KB Article (1025914): Purging old data from the database used by vCenter Server. Or truncate it, see commands above.
- Check the health of the transaction logs. What is the recovery model for database? Should be Simple. KB article (1003980): Troubleshooting transaction logs on a Microsoft SQL database server.
- Slow response means fragmented tables. KB article (1003990): Defragmenting VirtualCenter performance data indexes on a Microsoft SQL database.
- Other good housekeeping tasks include: Check SQL servers resources, no extra applications on the vCenter, keep vCenter and SQL separate, avoid using statistic level higher then 2, do not use SQL Express, check for network congestion between vCenter and SQL.
- Put the vCenter database in your SQL preventive maintenance schedule – defrag, check for huge tables etc. Also look at this great post from Chris Wahl at wahlnetwork.com where he uses the SQL’s Maintenance plan to configure daily backups, and this post as well, where he configures Log shipping for a vCenter SQL database.
Other than that you can also see my previous post about vCenter SQL database performance considerations.
Note! There are multiple other vCenter SQL troubleshooting scenarios but they all include loss of service, locked database, ODBC errors in vpxd.log (wrong authentication (KB1006482)/permission(KB1003052), time sync(important for SQLs).
VirtualCenter Database Maintenance – SQL http://www.vmware.com/files/pdf/vc_microsoft_sql_server.pdf (for vcenter 2.5… but still)
Industrial strength defrag from get-admin.com, great post for even more defragmentation.