VCDX: Troubleshooting Scenario

***UPDATE*** This section from the VCDX defense has been removed. Instead you will get extra 15 minutes for the Design Scenario.

This is the last part of the journey. Now you only need to troubleshoot a problem as methodically as possible and you have reached the finish line.

Let me explain.

Most real life troubleshooting sessions with customers have a single goal, a solution. And preferably with an analysis on root-causes, and recommendation to mitigate the risk of future incidents. This needs to be unlearned to a degree for the troubleshooting scenario.

The VCDX troubleshooting scenario will give you access to 1-2-3 slides containing a predetermined scenario and on those slide a predetermined amount of information is given. Your role is to ask the right questions to get better information to analyse the fault and explain why you asked about it or addressed with a change request to show expert level knowledge.

The VCDX blueprint has also a pretty good explanation:
Respond interactively to a presentation of a customer problem to show analytical skills and deep product knowledge, especially an understanding of how the components work and interact.

Here is an example:

  • What virtual SCSI controllers does the VM have?
    • The reason I’m asking is it sometimes can account for specific performance degradation on high IO machines as well control how large the IO queue that particular disk has.
  • How is the fiber channel zoning configured? Is it using single initiator-single target or something specific?
    • The reason I’m asking is that arrays that do not have ALUA capabilities tend to experience path thrashing if the zoning is incorrectly configured or the Load Balancing option within the Storage Kernel is misconfigured.

As you can see this will soon become very hard to cover everything so its best to read the slides and decide on a specific technology stack to focus on. Then move to the next even if you went for the “right” one in the beginning. Make sure to propose a plan of action and explain why that specific set of changes should be done.

I found that Rene Van Den Bedems method that he explained in this blog post extremely helpful:

As for the troubleshooting process it worked best for me to start from the VM side and troubleshoot towards the physical infrastructure.

I took the method in Rene post and personalised it to a degree. Since I had to methodically troubleshoot I thought it would be best to create a whiteboard layout to use in the scenario.

Here is a picture of that layout:


I need to explain the picture or it will not make any sense… Not my favorite kind of picture but it is necessary.

On the left there are locations of ESXi and vCenter specific logs (from Rene’s post). Of course this is just to read it again and again over the course of the troubleshooting mock scenarios. This of course can include track (DT-NV-Cloud) specific log locations.

Below that is box including lots of recommendations from various sources (including Rene’s post).

The large box with the colored boxes is the whiteboard diagram. The boxes include the technology stacks of core vSphere, Management, Storage, CPU/RAM Scheduling, Network. A similar layout for other tracks could include a larger management box with the corresponding management components (vCloud/vCNS, Horizon components, vRA components etc). The left hand side included a Q&A for initial questions, Notes if any and Change Log if I requested some changes to be done to the environment.

The plan was to at least have a something similar to that layout during the scenario itself. In my scenario I never had a chance to draw all of the boxes, not even close. But I used the layout during the troubleshooting mocks and I think it helped with the methodology of the process. I also practiced drawing the layout on a smallish whiteboard so I could at least draw it quickly.

The right side included boxes that are again from Rene’s blog (which I hope you have at least read by now since I’ve mentioned it 4 times now 🙂 ) and they include potential investigation paths to follow.

Also I recommend doing at least 3-4 Troubleshooting mocks before the defense cause the first I went through I just froze.  You will not want that to happen on the defense day.

Also reading past experiences and other people recommendations were also extremely helpful:

2 thoughts on “VCDX: Troubleshooting Scenario”

Leave a Reply

Your email address will not be published. Required fields are marked *