Tag Archives: VMware

vCenter SQL Database Troubleshooting

As I have stated before, I love databases, especially the vCenter database.¬†That’s why I really wanted to create another vCenter database post, this time regarding troubleshooting.

I recommend watching the video from my previous post, regarding how performance statistics are processed.

Starting with a small overview its good to know the vCenter database is quite a complex database and it’s not getting any simpler with each new version of vSphere.

  • In vCenter 2.5 it had 88 tables and 8 procedures
  • In vCenter 4.0 it had 196 tables and 12 procedures
  • In vCenter 5.0 it has 247 tables and 27 procedures

I have had my share of database troubleshooting, even though I live in Iceland ūüôā Mostly its because of server sprawl on vCenters starting on SQL Express.

The most common problems are:

  • Timeouts – Gaps in the Performance tab in the vSphere Client.
  • Slow response getting performance data.
  • Only real time data is available – or only Real-time and 1 day. Missing the week, month and year old performance statistics.
  • No space left for database growth.
  • Overloaded SQL server.

Timeouts:

  • If specific to one ESXi host – restart the hosts Management agent. KB article (1003490): Restart Management agents on ESXi and ESX.
  • If showing on every host ¬†– a) Check if the SQL server is overloaded. CPU and Memory. b) Check the size of the first vpx_hist table, if larger than 10 million rows you will need to dig deeper. Most likely the rollup jobs aren’t working correctly (you should see a lot of other signs if they are not working, like older performance data not showing). You can also try to manually run the jobs.
  • Check and see when the rollup jobs last ran:

select max (sample_time) from vpx_sample_time2

  • If there is a long time since last run, your rollup jobs aren’t functioning correctly.
Slow response:
  • Most likely its because of fragmented vpx_hist tables. Go through the steps in this KB article (1003990):¬†Defragmenting VirtualCenter performance data indexes on a Microsoft SQL database.
  • Large database. Truncate or purge (archive first).
Only Real-time data available:
  • When selecting older data than 1 day, you will get “Performance data is currently not available for this entity”. Culprit: Rollup jobs. Check and see the SQL Agent Service is Started and recreate the Rollup jobs.
  • Check to see when the last time a rollup job ran. Use this query:
select max (sample_time) from vpx_sample_time2
  • Manually run the rollup jobs – purge the tables – truncate data from the tables.

No space for database growth:

  • Most likely the database is configured for Full Recovery. This will eventually consume all the space on the SQL server. Please make sure to change the recovery mode to Simple but please note that this option will probably make you lose some days of performance data.
  • To check where the growth is occuring see this KB article (1028356):¬†Determining where growth is occurring in the vCenter Server database. Also you can run these commands:

exec sp_spaceused vpx_hist_statx or

select count(*) from vpx_hist_statx.

  • Then you can purge data,¬†KB Article (1025914):¬†Purging old data from the database used by vCenter Server.
  • Or truncate the data. Use these commands:

truncate table vpx_hist_statx or

truncate table vpx_sample_timex

Overloaded SQL server:

  1. Use Perfmon on the SQL server to check specific SQL counters. Most common culprits are memory or IO bottlenecks.

So that probably made you more confused than you were before so this is the short version for vCenter SQL troubleshooting:

  1. Loss of performance data is only specific to one host Рrestart the Management agent. KB article (1003490): Restart Management agents on ESXi and ESX.
  2. Check if there is space for database growth РSQL server drives. KB article (1028356): Determining where growth is occurring in the vCenter Server database. When you found the guilty table, purge it. KB Article (1025914): Purging old data from the database used by vCenter Server. Or truncate it, see commands above.
  3. Check the health of the transaction logs. What is the recovery model for database? Should be Simple. KB article (1003980): Troubleshooting transaction logs on a Microsoft SQL database server.
  4. Slow response means fragmented tables. KB article (1003990): Defragmenting VirtualCenter performance data indexes on a Microsoft SQL database.
  5. Other good housekeeping tasks include: Check SQL servers resources, no extra applications on the vCenter, keep vCenter and SQL separate, avoid using statistic level higher then 2, do not use SQL Express, check for network congestion between vCenter and SQL.
  6. Put the vCenter database in your SQL preventive maintenance schedule – defrag, check for huge tables etc. Also look at this great post from Chris Wahl at wahlnetwork.com where he uses the SQL’s Maintenance plan to configure daily backups, and this post as well, where he configures Log shipping for a vCenter SQL database.

Other than that you can also see my previous post about vCenter SQL database performance considerations.

Note! There are multiple other vCenter SQL troubleshooting scenarios but they all include loss of service, locked database, ODBC errors in vpxd.log (wrong authentication (KB1006482)/permission(KB1003052), time sync(important for SQLs).

Recommended reading:

KB: Investigating the health of a vCenter Server database

VirtualCenter Database Maintenance – SQL http://www.vmware.com/files/pdf/vc_microsoft_sql_server.pdf¬†(for vcenter 2.5… but still)

Industrial strength defrag from get-admin.com, great post for even more defragmentation.

VMware Storage Basics – PSA and NMP

A few days ago I posted a blog about VMware Storage Basics, the identifiers and how paths work between them.

This post will be post two of three in explaining the VMware storage stack, ending in a post about masking paths.

So, PSA and NMP, what is that? First a short and extremely dry explanation.

PSA (Pluggable Storage Architechture)

  • is special VMKernel layer that manages storage multipathing.
  • Coordinates simultaneous operation of multiple MPPs (Multipathing plugins) and the default NMP (Native Multipath Plugin).
  • Allows 3rd party vendors to design their own MPP, with their own load balancing techniques and failover mechanisms.

NMP (Native Multipath Plugin)

  • The default MPP that comes with ESX,ESXI is the NMP. It manages some subplugins.
  • The subplugins are SATPs (Storage Array Type Plugin) and the PSPs (Path Selection Plugin).
  • Subplugins are either VMware default or 3rd party plugins (specific SATPs for specific arrays, e.g. VMW_SATP_SVC for IBM SVC Arrays).
  • It associates SATP to paths, processes I/O requests to logical devices, performs failovers using SATP.

MPP (Multipath Plugin)

  • 3rd party NMP+SATP+PSP. All in one stack.

Does that explain anything? No. Well OK, some. For my part its when I see a picture and preferrably a moving representation is when things start to seep into the grey matter.

First an overview picture – On the left you got the NMP, with its destinct subplugins, SATP and PSP. Then in the middle there is a 3rd party MPP, and last but not least is the MASK_PATH plugin.

So what do the subplugins do?

SATP

  • Manages failover of paths. Monitors, determines and implements switching between paths in case of a failure.
  • Provided for every type of array that VMware supports, e.g. VMW_SATP_LSI for LIST/NetApp arrays from Dell, IBM, Oracle and SGI to name a few.

PSP

  • Determines which path will be used for an I/O request. Thats the Fixed, Round-Robin, Most-recently-used algorithms. More on that later.

So now we got these acronyms : PSA, NMP, MPP, SATP, PSP, MRU, RR. How do they work together? Lets begin when you boot up a ESXi host.

  1. NMP assigns a SATP to every physical path to the logical device (datastore), e.g. VMW_SATP_LSI if its a IBM DS3524.
  2. NMP associates paths to logical devices – see my previous post on paths.
  3. NMP decides which PSP to use with the logical device.
  4. Storage framework (VM) tells NMP an I/O is ready to send.
  5. I/O is issued.
  6. PSP is selected. Load-balances if applicable.
  7. I/O is sent to  device.
  8. Success:Device driver (Storage array) indicates I/O is complete. Failure: NMP calls appropriate SATP.
  9. Success: NMP tells PSP I/O is complete. Failure: SATP interprets error codes and fails over to inactive paths.
  10. Failure: PSP is called again to select which path to use for I/O – excluding the failed path.

That pretty simple right? Well not exactly because there is no way to visualize that process unless you had read about the PSA before.

So here is a short video to help you out. I hope this makes the topic easier to understand.

VMware Storage Basics – Explanation + Video

I was working on masking some storage paths some days ago and after finding the VMware KB I realized that I needed to create a blog post about storage and how it relates to VMware before I go and explain how to mask paths and why its is done.

This will be the first of three posts regarding storage in VMware, leading up to a explanation of masking storage paths in VMware.

The Basics:

You can put the components in a VMware storage array connection in small list:

  1. LUN ID.
  2. Datastore UUID.
  3. LUN Paths.
  4. Adapter identifier
  5. Target identifier (Storage array)

1. Every LUN has an ID to identify between them.

  • NAA & EUI: Most common ¬†are the NAA, that’s Network Address Authority ¬†Identifier, or a EUI, that’s Extended Unique Identifier.

  • MPX : Local Datastores have another UUID, for example mpx.vmhba33:C0:T1:L0. This is only for local LUNs and can be used in the exact same way as the NAA identifier.
  • VML: Can be used interchangeably with the NAA identifier and the MPX identifier. Used for operations such as vmkfstools.

2 .Every VMFS volume has UUID, a Universally Unique Identifier that is assigned at creation.

3 .Then you got Paths to a LUN. Thats the connections from the ESXi host to the Storage Array.

Most commonly you have 4-8 paths, at least 2 to have some kind of failover.

Paths are identified with something like this: vmhba1:C0:T1:L1. What does these numbers mean?

  • vmhba1= Name of the storage adapter on the physical ESXi host. Most commonly you got two, vmhba1 and vmhba2.
  • C=Channel, or the storage channel. Some controllers support multiple connections to the same controller.
  • T=Target, this is a number made up by the host for the Datastore’s paths, so in a 4 path Datastore (with 2 HBAs) you would get :vmhba1:C0:T0:L1,vmhba1:C0:T1:L1,vmhba2:C0:T0:L1,vmhba2:C0:T1:L1.
  • Please note that these target number are NOT shared between hosts and can change between reboots. Really its the NAA(or EUI) that lets you know what datastore you are referring to.
  • L=LUN number. A number provided by the storage system.

4. Adapter identifier is made from the WWNN (world wide node name) and WWPN (world wide port name). The format is fc.200100e08ba5ff63:210100e08ba5ff63. you can see this in the vSphere client:

This is a ID (WWN) for the ESXi host adapter. This adapter ID is used mainly when zoning storage arrays and hosts together.

I want to explain what you see in the picture:

First you have the Paths (vmhba2:C….), as you can see this datastore has 4 paths to it (8 if the dead ones are counted…).

All paths has a different Target identifier as this Storage array has 4 HBA cards (IBM SVC).

5. Target identifier has the same format as the adapter identifier. Can also be seen in the vSphere client.

This is an ID for the storage array. Also used mainly in zoning.

Finally I wanted to add a video how these storage components work together.

*Sidenote : Most common ESXi 5.0 CLI commands to get some storage data are:

  • esxcli storage core path list : Generates a list of all LUN paths currently connected to the ESX host.
  • esxcli storage core device list: Generates a list of LUN’s currently connected to the ESX host.
  • esxcli storage vmfs extent list : Generates a list of extents for each volume as well as providing the mapping from the device name to the UUID.
  • esxcli storage filesystem list: Generates a compact list of the LUNs currently connected to the ESX host, including VMFS version.
  • ls -alh /vmfs/devices/disks:¬†¬†Lists the possible targets for certain storage operations.

Next post will go deeper into VMware PSA (Pluggable Storage Arcitecture) and how it works.