vOpenData: An Open Virtualization Community Database

Recently, I had the opportunity to help out with a very unique and cool project called vOpenData which was created by Ben Thomas (a former VMware GSS Technical Engineer). The idea for the project was sparked by a very simple tweet that came from Duncan Epping:

Ben wanted to help answer Duncan’s question but more importantly he wanted to help answer a bigger set of questions: what are some of the common virtual infrastructure deployment configurations, averages and consolidation ratios? These questions cross the minds of the everyday vSphere administrators, architects and consultants. It would be quite difficult and nearly impossible to answer these questions outside of their own environment.

Ben reached out to me with his idea and asked if I could help develop a script to collect basic configuration information from a vSphere environment to help test out his idea. I was immediately intrigued with his idea and saw the huge potential value that Ben’s unique solution could bring to the virtualization community. The coolest thing about this project is that we were able to put together a working prototype within a week’s time!

Note: Also be sure to check out Ben's article vOpenData - Crunching everyone's data fun for fun and knowledge and his perspective on how he was able to quickly develop a prototype leveraging a PaaS solution.

What is vOpenData?

vOpenData is an open community project that grew from the question "What is the average VMDK size for deployed virtual machines?” We wanted to create an open community database that is purely driven by users submitting their virtual infrastructure configurations. Leveraging the powerful virtualization community and applying simple analytics we are able to provide various trending statistics and data for virtualized environments. This is 100% community driven and the results will be available for everyone to view and hopefully you will contribute to the overall dataset!

What information do we collect?

We made an effort to not collect specific information such as hostnames or even display names that could be used to identify a particular organization. Instead, we are using UUIDs which are automatically generated by the virtualization platform to uniquely identify a particular object. This allows us to keep track of changes in the our database when a new data set is uploaded from an existing environment. In addition we are collecting various configuration data and you can find a complete list in the Data FAQs

More info on the data we collect is here: Data FAQs

What will this data be used for?

We are planning on using this data to create some interesting statistics and data modeling for the community to use in capacity planning and analysis. Most of this data will be made available through a dashboard or reports and eventually through an API to be mixed into other applications.

What about privacy concerns?

Though the data that is collected is already anonymized and non-identifying, please ensure that you are abiding by the privacy policies of your organization when uploading this data. If you are concerned about the data, it is recommended that you audit the zip contents before uploading which are just CSV files. We only ask that you do not modify the schema at all.

How do to get started?

Step 1 - Check out the sexy vOpenData Public Dashboard here to get a glimpse of some of the information you will find by submitting your configuration data.

Step 2 - Download either the PowerCLI or vSphere SDK for Perl script which you will run against a vCenter Server which will produces a compressed zip file containing several CSV files. Instructions are available on the download page. You may rename the default file name vopendata-stats.zip to something else, as long as you do not modify the contents of the file.

Step 3 - Open a browser and go to http://www.vopendata.org and sign up for new account.

Step 4 - Click on the “Infrastructures” tab at the upper left hand corner. An Infrastructure is a logical view that can help you organize the data you have collected. You can associate a single vCenter Server with an infrastructure or you can combine multiple vCenter Server data sets into a single infrastructure. The choice is really up to you on how you would like to visualize your data and whether you would like to map that to the physical location of your virtual infrastructure.

Step 5 - Once you have created your Infrastructures, you will then upload your data files to their respective Infrastructure. This may take some time as the data processing is executed in the background and will also depend on the number of users and uploads occurring at the moment. We ask that you please be patient and check back in a bit and you can refresh the page which will let you know when the processing is complete

Step 6 - After the data is uploaded to the system, there is a scheduled job that performs the analytics and calculations which occurs in periodic batches. These calculations can take up to 45minutes to an 1hour before the results are reflected in the public dashboard and is primarily governed by the single worker we have on the backend due to resource constraints. To view the results of the public dash board visit http://dash.vopendata.org

We hope you frequent the vOpenData site regularly as the community uploads more and more data and see how statistics are trending over time. We would also like thank the following people who were part of our early alpha program and assisted with both testing as well as code contributions: Frederic Martin, Raphaël SCHITZ, Timo Sugliani and of course my Automation colleague Alan Renouf! If you would like to learn more about the vOpenData project, we have also submitted a session for VMworld 2013 4976 - vOpenData - Crunching Everyone's Data For Fun And Knowledge, be sure to vote for it!

You can follow @vopendata on Twitter for new updates and notifications as well as both Ben Thomas at @wazoo and William Lam at @lamw

How can I help or contribute?

First and foremost, you can get involved by signing up for a free account and begin contributing your data to the open community database! We are also open to any suggestions and feedback as they would be very valuable to us, feel free to join the vOpenData VMTN Community Group to discuss further. We know that in this first release we are not going to be able to show everything, but have plan to show much more. Lastly, all the infrastructure that is used to provide the dashboard, the backend database and processing is all hosted and paid out of our own pockets. If you have found this to be a useful resource and would like to contribute either with a donation or sponsorship to help us continue developing this project, please contact us at vopendata[at]gmail[dot]com

Comments

Marco Broeken says

04/12/2013 at 7:09 pm

I can imagine that the guys over at CloudPhysics also have a lot of valuable data.

Perhaps they are willing to share to opendata?

- William Lam says
  
  04/12/2013 at 7:35 pm
  
  Completely agree. Would love to collaborate with them and see how we can further benefit the virtualization community as a whole!
  
Ammesiah says

04/12/2013 at 7:20 pm

That's an great idea and an amazing job !

Long live vOpenData !

- William Lam says
  
  04/12/2013 at 7:36 pm
  
  Thanks Fredric! We couldn't have done it without you and Raphael! Hopefully this will be a useful tool for everyone
  
Michael Ryom says

04/13/2013 at 8:46 pm

Would love to see network added to the stack

- William Lam says
  
  04/14/2013 at 3:52 pm
  
  Michael,
  
  Definitely. Networking is on our roadmap. Is there anything in particular that is a MUST see that would be helpful/useful?
  
Iwan 'e1' Rahabok says

04/14/2013 at 2:45 am

What does the color mean? Can't figure it out. If they don't mean anything, then my suggestion is to have 3 colors:
1 for Total. e.g. total number of LUNs in the opendata database.
1 for Average.
1 for Maximum. This is for showing how high people push it. So we know the highest or record.

Thanks! great job!

- William Lam says
  
  04/14/2013 at 3:55 pm
  
  Iwan,
  
  Yes, the tiles are color coated to represent the specific entity types.
  
  Baby blue = Infrastructure (this is the logical view and everything in that color represents data related to that)
  
  The same goes for light green = cluster, red = clusters, yellow = hosts, etc.
  
  Hopefully you'll help contribute more data too!
  
Mohammed Raffic says

04/14/2013 at 2:17 pm

Thanks for your valuable posts

http://www.vmwarearena.com/

Anonymous says

04/15/2013 at 4:48 pm

at the moment I got the message at start:

PowerCLI S:\VMware> .\getvOpenData.ps1
The '<' operator is reserved for future use. + FullyQualifiedErrorId : RedirectionNotSupported

- William Lam says
  
  04/15/2013 at 10:42 pm
  
  Hi,
  
  Make sure your download was not corrupted. Someone ran into this when we first launch and it was due to a bad download. On the github site, there is a link for the "zip" file that you can download the scripts