Last week I received a very interesting question from a fellow blogger asking whether it was possible to "pause" (not suspend) a virtual machine running on ESXi. Today ESXi only supports the suspend operation which saves the current memory state of a virtual machine to disk. With a "pause" operation, the memory state of the virtual machine is not saved to disk, it is still preserved in the physical memory of the ESXi host. The main difference with a "pause" operation is the allocated memory is not released and this allows you to quickly resume a virtual machine almost instantly at the cost of holding onto physical memory.
The use case for this particular request was also quite interesting. The user had an NFS server that housed about 200 virtual machines that needed to be restarted and the goal was to minimize the impact to his virtual machines as much as possible. He opted out from suspending the virtual machines as it would have taken too long and decided on a more creative solution. He filled up the remainder capacity on the datastore which in effect caused all virtual machines to halt their I/O operations. Though not an ideal solution IMHO, this allowed him to restart the NFS server and then run a script for the virtual machines to retry their I/O operation once the NFS server was available again.
Based on the above scenario, he asked if it was possible to "pause" the virtual machines similar to a capability Hyper-V provides today which would have provided him a quicker way to resume the virtual machines. Thinking about the question for a bit, a virtual machine is just a VMX process running in ESXi and I wondered if this process could be paused like a UNIX/Linux process using the "kill" command. Well, it turns out, it can be!
Disclaimer: This is not officially supported by VMware, use at your own risk.
Using the kill command, you can pause the VMX process by sending the STOP signal and to resume the VMX process, you can send the CONT signal. Before getting started, you will need to identify the PID (Process ID) for the virtual machine's VMX process.
There are two methods of identifying the parent VMX PID, the easiest is using the following ESXCLI command:
esxcli vm process list
The PID for the virtual machine will be listed under the "VMX Cartel ID" and in this example I have a virtual machine called vcenter51-1 and on the right I am pinging the system to verify it is up and running. An alternative way of identifying the PID is to use "ps" by running the following command:
ps -c | grep -v grep | grep [vmname]
Note: Make sure you identify the parent PID of the virtual machine if you are using the above command as you will see multiple entries for the different VMX sub-processes.
To pause the VMX process, run the following command (substitute your PID):
kill -STOP [pid]
To resume VMX process, run the following command:
kill -CONT [pid]
Here is a screenshot of pausing and then resuming the virtual machine. You can also see where the pings stop as the virtual machine is paused and then resumed. Once the virtual machine was resumed, it operated exactly where it left off with no issues as far as I can tell.
Note: I have found that if you have VM monitoring enabled, there maybe issues resuming the virtual machine. This should only be done if you have VM monitoring disabled as it may not be properly aware that the VMX process being paused on purpose.
Though it is possible to pause a virtual machine, I am not sure I see too many valid use cases for this feature? Are there are use cases where this feature would actually be beneficial, feel free to leave a comment if you believe there are. For now, this is just another neat "notsupported" trick 😉
I know another hypervisor who does it natively... 😉
I found this trick similar to the vMotion 'stun' operation where the source VM is 'paused' while QuickResume transmits the remaining memory pages to the destination VM which is now live.
Could that be the 'stun' operation is just a kill -STOP followed by a kill when vMotion is completed!?
William Lam says
No, that is FSR (Fast Suspend & Resume) http://www.yellow-bricks.com/2011/04/13/vmotion-and-quick-resume/#comment-23924
This completely pauses the VMX process.
BTW, you can get rid of the grep -v grep pipe by enclosing the first letter (or first few even) of your vmname in square brackets like this:
ps -c | grep [v]mname
it's hard to explain why this works, has to do with character sets etc, but it does work, even in the ash shell in esxi. There isn't a massive performance difference or anything of course, but it does get rid of the one pipe and 2nd grep invocation.
Thanks for this site btw, been a big fan for a long time.
Would it work to limit the CPU for the VM to 0 MHZ?
Instead of writing those complicated commands every single time you can simply put even more complicated one-liners in /etc/profile.local and forget all that gibberish. Then you just use 'pausevm vmname' or 'unpausevm vmname' or even 'checkpausevm vmname'.
What is somewhat cool in my solution is that you can use several VMs as the arguments, e.g. 'pausevm vmname1 vmname2 vmname3' and if a VM's name got spaces you just double-quote it, like 'pausevm vmname1 "vmname with spaces" "another vmname"' etc.
Tested on the latest ESXi.
pausevm() (for vm in "[email protected]"; do pid=`esxcli vm process list | grep -A 3 "^$vm$" | tail -n 1 | sed 's/.*: //'`; [ -n "$pid" ] && kill -STOP $pid || echo "$vm not found"; done)
unpausevm() (for vm in "[email protected]"; do pid=`esxcli vm process list | grep -A 3 "^$vm$" | tail -n 1 | sed 's/.*: //'`; [ -n "$pid" ] && kill -CONT $pid || echo "$vm not found"; done)
checkpausevm() (for vm in "[email protected]"; do pid=`esxcli vm process list | grep -A 3 "^$vm$" | tail -n 1 | sed 's/.*: //'`; [ -n "$pid" ] && (ps -Ccs | grep -q "^$pid.*USIG" && echo "$vm is paused" || echo "$vm is running") || echo "$vm not found"; done)
In case this blog is incapable of showing it right, here is the snippet: http://pastebin.com/si6SD6FC.
Good to know, thanks.
My boss who uses Xen is always talking about a quick "pause" of his vm's where my vmware suspending takes ages to execute and another long time to bring the vm up again.
This pausing is useful for storage or network maintenance as we do not have vmotion to quickly get a vm out of the way.
Thanks a lot for your one-liners, anonymous from june tenth. That's very helpful.
For my messy environment, I added an "-i" to grep so it is case insensitive.
Rizul Khanna says
I read here that if a Datastore is filled, the VMs will be suspended as can be seen here in: The .vmss file., this article: http://searchvmware.techtarget.com/tip/Understanding-the-files-that-make-up-a-VMware-virtual-machine
But as per your article,
"He filled up the remainder capacity on the datastore which in effect caused all virtual machines to halt their I/O operations". I believe this situation should actually suspend his VMs and he should not be able to 'Pause' his VMs. Whats your opinion William?
I appreciate your time and patience.
I wonder what repercussions this has regarding timekeeping, cpu timers and so on.
Ron Hawkins says
Would a use case like this be an appropriate use of the pause vm methodology. We are swapping in a bypass router to allow for some fairly significant network router maintenance. A lot of our VM's reside on netapp CDOT NFS storage. Our concern is whether the NFS timeout value (90 seconds in our case) will be enough to handle the hot swap of the routers. It has been proposed that we select any vm's with storage on that CDOT cluster and suspend them. But suspend would seem to take way too long. The pause methodology described above would seem to provide a way to quickly stop vm's before the hot swap then resume?
I have noticed that if a vm has been 'kill -STOP ' for too long time, e.g. 10 minutes, then ESXi go to a serious badly scenario called PSOD(Purple Screen of Death) . My ESXi version is 5.5
Here is the vmkernel coredump log:
2017-03-15T03:49:04.787Z cpu24:32823)@BlueScreen: PCPU 29: no heartbeat (2/2 IPIs received)
2017-03-15T03:49:04.787Z cpu24:32823)Code start: 0x41801b400000 VMK uptime: 0:00:12:49.057
2017-03-15T03:49:04.788Z cpu24:32823)Saved backtrace from: pcpu 29 Heartbeat NMI
2017-03-15T03:49:04.788Z cpu24:32823)0x4123af95de68:[0x41801b9894a8][email protected]#+0x190 stack: 0x4123af95df38
2017-03-15T03:49:04.788Z cpu24:32823)0x4123af95deb8:[0x41801b989bbf][email protected]#+0x14f stack: 0x80
2017-03-15T03:49:04.789Z cpu24:32823)0x4123af95df08:[0x41801b97ff0e][email protected]#+0x282 stack: 0x4123af95df28
2017-03-15T03:49:04.789Z cpu24:32823)0x4123af95df18:[0x41801b4aa78d][email protected]#nover+0x1d stack: 0x3ffcc3a4a70
2017-03-15T03:49:04.789Z cpu24:32823)0x4123af95df28:[0x41801b4f1064][email protected]#nover+0x64 stack: 0x10b
2017-03-15T03:49:04.792Z cpu24:32823)base fs=0x0 gs=0x418046000000 Kgs=0x0
2017-03-15T03:48:51.730Z cpu29:35813)NMI: 709: NMI IPI received. Was eip(base):ebp:cs [0x58949b(0x41801b400000):0x4123af95de68:0x4010](Src 0x1, CPU29)
Sorry but this is handiwork... In the first place, if you need to restart your NFS server then you cleanly power-off your VM's, period. What's next ? I'm going to call BMW and explain them that I want to change my rear left wheel but I have to find a way doing it while the vehicule is moving <_< Such ways of thinking/proceeding are non-sense in my opinion.