UPDATE 07/13/2012 - vSphere 5.0 Update 1a has just been released which resolves this issue, please take a look here for more details for the patch as this script is no longer required.
Duncan Epping recently wrote an article about Clarifying the SvMotion / VDS problem in which he describes the scenario that would impact your VMs as well as a way to remediate those impacted VMs. I would recommend you go through Duncan's article before moving any further.
The challenge now, is how to easily identify all VMs that are currently impacted by this problem in your environment? The answer is of course Automation and leveraging the vSphere API! I created the following vSphere SDK for Perl script called querySvMotionVDSIssue.pl which searches for all VMs that are connected to a VDS and checks whether or not it's expected dvPortgroup file exists in the appropriate datastore. To use the script, you just need a system with the vCLI installed or you can just use the vMA appliance.
UPDATE: The script has now been updated to support remediation for VMs connected to both a VMware VDS as well as Cisco N1KV. The solution, thanks to one of our internal engineers was to "move" the VM's dvport from one to another, all while staying within the existing dvPortgroup which will also force the creation of the .dvsdb port file. Once the dvport move has successfully completed, we will move it back to it's original dvport that it initially resided on. We no longer have to rely on creating a temporally dvPortgroup and best of all, we can now remediate both VDS and N1KV. The script now combines both the "query" and "remediation" into single script. Please take a look at the examples below on usage.
Disclaimer: This script is not officially supported by VMware, please test this in a development environment before using on production systems.
Here is a sample output of the script running in "query" mode:
Only impacted VMs will be listed in the output. To remediate, I have combined the remediation script into the query script, if you wish to remediate ALL VMs that were listed as being impacted, you can specify the --fix flag and providing the option "true". This will go ahead and remediate all impacted VMs that were listed as before.
Here is a sample output of the script running in "remediation" mode:
In the screenshot above, you may noticed a few interesting details with VM3 and VM4. If you run out of dvports in a dvPortgroup, the script will automatically increase the number of ports to satisfy the swap (max of 10 due to number of ethernet interfaces a VM can have). Once the VM has been remediated, the dvportgroup will be reconfigured to it's original configured number of ports as shown with VM3.
If you have an impacted VM that is connected to an ephemeral dvportgroup, we will not be able to remediate due to the nature of how an ephemeral binding works. You will get a message on the specific interface and you will need to manually remediate using the steps outlined by Duncan or using the "old" remediation script which will create a temporally dvPortgroup (again, this will only work for VMware VDS' only).
If you run into any issues or have questions, feel free to leave a comment.
Luke says
Nice! The first script lists several of my VMs as victims of this. I'll need to adapt the second script for my environment, though, since we use the n1kv. I guess I can use PowerCLI to create a vSwitch on the host and do the same.
William says
@Luke,
I did not have access to N1KV, but it looks like for N1KV, you will need to create the portgroup profiles directly to VSM and not at the vSphere layer.
You should be able to use PowerCLI to create a vSwitch on the same host with the same VLAN configuration of the dvPortgroup and then reconfigure the VM.
William says
@Luke,
I've just updated my script and it's implementation using a different method of remediation. This will allow you to remediate both VDS + N1KV. Please refer to the updated post for more details. Thanks
Luke says
I thought about using pcli to create a svs, but changing dvports is much nicer! Thanks!
pietia7 says
Hi, I experienced this issue when svmotion machines from a datastore that was removed later on. I got errors that dvport state info could not be saved because the file containing info about dvport was gone. In my situation editing port settings (I added a temporary description for that particular port and then removed it) on dvPortgroup for affected machine recreated file in proper location on new datastore. This might be a better solution for your remediate script since you don't need to create any temporary dvPortgroup. No vm network downtime as well.
William says
@pietia7,
I've actually tried this as well and it does NOT resolve the problem. Perhaps you had a different problem since the dvportgroup state file could not be created initially? The only method I'm currently aware of is a complete network reconfiguration at the VM level for that file to be regenerated.
Dan Barr says
Hi William, thanks for the script. I did notice a false-positive issue with it, related to VMs that have virtual disks on multiple datastores. Take a VM with its VMX file and VMDK1 on Datastore1, and VMDK2 on Datastore2. Your script is popping positive for that VM, I'm assuming because vDS info was not found on Datastore2 (but it is present on Datastore1, which I manually confirmed).
William says
@Dan,
Thanks for reporting, I've just updated the script, you're right it should only be looking at the VM's configuration datastore and not all of it's datastores. Let me know if you're still having any issues. Thanks
Dan Barr says
Looking good, thanks William!
kwinsor says
I just finished a SAN storage migration. I have vCenter 5 and ESXi 4.1 Update 2. All datastores on the new storage do not contain the dvsData folder. The folders remained on the old datastores and are still being written to. If I run your script it comes back with no VMs listed as having a problem. I want to remove the olde datastores but I want to make sure this will not cause any problems first. Any ideas why the script is not detecting any problems?
William says
@kwinsor,
From my understanding, this only impacts vCenter Server 5 and ESXi 5 which is also mentioned in the KB - http://kb.vmware.com/kb/2013639
I have an if statement that checks for the version of ESXi on line 96 and the corresponding bracker on 143.
I would recommend contacting VMware Support to ensure you won't be impacted before removing the datastore.
kwinsor says
Thanks William....will do.
Revoklat says
Hi,
Thanks for the script. We have quite a few machines impacted by this bug and I would like to run your script to fix this. One thing; I am wondering if network connectivity is lost (briefly?) when 'fixing' a VM with the script. Our machines are used in production (mail/sql/application servers).
Thanks
William says
Hi Revoklat,
Once the task is kicked off, if you loose connection the task would have been sent & executed on the server. You can just re-run the script and if you enable the fix param, it'll only remediate impacted VMs, so if the previous VM was fixed before the disconnect it won't need to re-run.
Bob says
Hi,
When the VMs are fixed via the script, do they drop off the network for a short time or is the fix no impact?
Thanks,
Bob
William says
@Bob,
From my testing, there was no impact and I ran a continuous ping to to see if any packets were dropped and there was no. The DvPortgroup Network is not modified, except it's just moving it's dvport ID within the DvPortgroup
Mark Strong says
In my environment the script runs OK but does not change the dvPort number:
vi-admin@VMA001:~> ./querySvMotionVDSIssue.pl --server 10.1.1.1 --username root
Searching for VM's with Storage vMotion / VDS Issue ...
TESTVM01 is currently impacted
vi-admin@VMA001:~> ./querySvMotionVDSIssue.pl --server 10.1.1.1 --username root --fix true
Searching for VM's with Storage vMotion / VDS Issue ...
TESTVM01 is currently impacted
Remediating TESTVM01
Moving from dvPort: 1153 to dvPort: 1153
Moving from dvPort: 1153 back to dvPort: 1153
Moving from dvPort: 1216 to dvPort: 1216
Moving from dvPort: 1216 back to dvPort: 1216
Remediation complete!
vi-admin@VMA001:~> ./querySvMotionVDSIssue.pl --server 10.1.1.1 --username root
Searching for VM's with Storage vMotion / VDS Issue ...
TESTVM01 is currently impacted
Please help. Thank you.
Antonis Kopsaftis says
Hi,
I have the same problem as Mark Strong above. I run the script and it's find some impacted VMs. I run it with the fix parameter and the remediation procedure is completed.
But after i run the script again, the same VMs appears to be impacted.
I am running esxi 5.0 (721882 version) updated yesterday with update manager.
I have feeling that with the previous esxi 5 version before the upgrade(unfortunately i cannot remember the version number) the script was working without any problems.
William says
@Mark & @Antonis,
What type of port binding is the VM connected to and how many dvPorts are in the DvPortgroup?
Antonis Kopsaftis says
Im using static binding for all the VMs. Also all my dvportgroups have 128dvports and there are plenty of free dvports on each dvportgroup.
I would like to note that i succeeded to remediate my impacted VMs manually from vsphere client.
After that, the script does not report any impacted VMs.
Mike Evans says
Did either of you get the script to work for you? I just ran across this issue and found this post. I am trying to test this against a host that has 17 impacted virtual machines. The port groups are all configured for static binding. After remediation, the virtual machines still show as impacted
Mike Evans says
I figured it out. I was trying to test by running this directly against a host. When I run the script and specify the --fix and --vmname switches, I was able to remediate the VMs I was testing against and was able to verify they were remediated by querying directly to the host before and after. Thanks, this was a big help!
Manas Barooah says
On a similar note, I am finding an issue with respect to VM creation from OVA File. During the vCenter Client 5.5 "Deploy OVF Template" task, VM Creation fails if we map the vNic of the VM to a DIstributed Virtual Portgroup for the ESX Host. The ESX Host/Vcenter is 5.5. As a workaround, we need to move the vNic to a vSwitch Portgroup and this creates the VM. After the VM is created we can move the vNic to the Distributed Virtual Portgroup on the same ESX Host. Power on the VM and things work fine.
Any idea what could be the cause for this? We also use VI JAVA API to create the VM and that process also fails during OVF parsing.