Virtual Machine Sizing - Increasing The Number of Virtual CPUs (VMWare)

Not all workloads are the same. Riva can be configured in an Active/Active/n-Active cluster. When sizing horizontally is not an option and increasing the number of virtual CPU's is preferred, it is important to factor in possible virtualization overhead. When configuring a virtual machine to have more than 8 vCPUs, the term "monster VM" is often used.

A "monster virtual machine" is a virtual machine (VM) that typically has more than eight virtual CPUs (vCPUs) and may be configured with more than 255 GB of virtual RAM. Monster VMs are used to virtualize applications with large resource needs, such as Microsoft Exchange, Microsoft SharePoint, Microsoft SQL Server, or Oracle databases. Another common usage of the term "monster VM" is to identify that the VM uses more cores than are available on a single physical socket. If a VM host has 4 sockets and each socket has 8 cores, if a virtual machine uses more than 8 vCPUs, then it is considered a "monster VM".

VMware vSphere Blog - When to overcommit VCPU:pCPU for Monster VMs - http://blogs.vmware.com/vsphere/2014/02/overcommit-vcpupcpu-monster-vms.html

The following text is summarized in case the above link becomes unavailable. For all details, refer to the original source.

Troubleshooting a Virtual Machine That Has Stopped Responding: VMM and Guest CPU Usage Comparison (1017926)

Purpose

Virtual machines depend on available host resources (CPU, Memory), and the guest operating system consumes those resources. A problem with resource availability or scheduling inside or outside the virtual machine may cause it to become unresponsive.

This article provides steps for using CPU performance metrics to determine whether a guest operating system is actually running, whether the virtual machine monitor (VMM) is running, or whether there is scheduling contention.

Note: This article is part of a series. For more information, see the parent article Troubleshooting a virtual machine that has stopped responding (1007819).

Resolution

Four virtual machine CPU performance metrics can be used together to gain insight into the responsiveness of a virtual machine or its Guest OS:

Run: Amount of time the virtual machine is consuming CPU resources.
Wait: Amount of time the virtual machine is waiting for a VMkernel resource.
Ready: Amount of time the virtual machine was ready to run, waiting in a queue to be scheduled.
Co-Stop: Amount of time a SMP virtual machine was ready to run, but incurred delay due to co-vCPU scheduling contention.

These performance metrics can be reviewed using the Performance tab in the vSphere Client or using the esxtop or resxtop command-line utilities. Choose the most appropriate method for your environment.

Reviewing performance metrics using the vSphere Client

For more information on using custom performance charts, see the Customizing Chart Views section of the Resource Management Guide or the View Advanced Performance Charts section of the vSphere Monitoring and Performance Guide.

Connect to vCenter Server or an ESX/ESXi host using the vSphere Client.
Select the target virtual machine in the inventory.
Click the Performance tab.
Click Chart Options to customize the performance chart.
Under the CPU heading, select Real-time.
Under the Chart Type heading, select Line Graph.
Under the Objects list, select the virtual machine by name.
Under the Counters list, select Co-stop, Run, Ready, and Wait.
Optionally, save the chart settings to make re-use easier.
Click OK.
Make a note of the four metrics displayed. Each is measured in milliseconds.

Interpreting CPU performance metrics

Run, %RUN:

This value represents the percentage of absolute time the virtual machine was running on the system.
If the virtual machine is unresponsive, %RUN may indicate that the guest operating system is busy conducting an operation.

Wait, %WAIT:

This value represents the percentage of time the virtual machine was waiting for some VMkernel activity to complete (such as I/O) before it can continue.
If the virtual machine is unresponsive and the %WAIT value is proportionally higher than %RUN, %RDY, and %CSTP, then it can indicate that the world is waiting for a VMkernel operation to complete.

Ready, %RDY:

This value represents the percentage of time that the virtual machine is ready to execute commands, but has not yet been scheduled for CPU time due to contention with other virtual machines.

Co-stop, %CSTP:

This value represents the percentage of time that the virtual machine is ready to execute commands but that it is waiting for the availability of multiple CPUs as the virtual machine is configured to use multiple vCPUs.
If the virtual machine is unresponsive and %CSTP is proportionally high compared to %RUN, it may indicate that the ESX host has limited CPU resources, and simultaneously co-schedule all vCPUs in this virtual machine.