Generative AI (GenAI) has taken the world by storm and not just in tech but it has also infiltrated every single industry with billions of dollars (here, here, here, here, here, here, here & here) being invested to unlock its hidden potentials.
I am sure many of you have already experimented with some aspect of GenAI whether that is using chat interfaces like OpenAI's ChatGPT or Google Bard to the impressive text-to-image generation tools like DALL-E from OpenAI, Midjourney and Stable Diffusion from Stability.AI to just name a few.
I use ChatGPT/Bard on a regular basis to help me debug cryptic Linux error message to helping me a craft complex regular expression to generating random PowerShell snippets for automating various tasks, the possibilities even for IT Administrators are pretty endless. My workflow typically includes the use of ChatHub, an all-in-one chatbot browser plugin that allows me to use both ChatGPT and Bard simultaneously to compare and/or identify the best possible answer.
Until recently, solutions like ChatGPT only have access to data trained up to Sept 2021 but even with this constraint, the biggest issue that plagues all of these AI models are their hallucinations. AI hallucinations is where an AI simply makes up responses believing that it is factual and while this problem is being worked on by the broader industry, it certainly makes it difficult to trust and validate an answer before using it yourself. I have certainly seen this first hand when asking ChatGPT to generate some code, I would say it is usually 60/40% correct but I often have to verify and re-prompt when I know the syntax or answer is completely wrong.
While using these platforms, I had been thinking about a personal use case of mines and I was curious if other bloggers or even some of my readers might be able relate?
Have you ever tried to find a blog post that you know you have written or know that it exists, but you can not locate it because you did not use the right keywords? While Googl'ing is still king for searching, I have found that it does not always return the best results and may require multiple search iterations. What is worse, after a bit of searching online you finally get a good result and realized it is a blog post that you had authored but completely forgotten about! 😅
While I would like to think I can recall most of the blog posts I wrote over the past 13+ years, the scenario above does happen quite frequently.
This gave me an idea! It would be really cool to have my own personal ChatGPT but instead of the model having access to random or incomplete data from the public internet, what if I could simply source that data from my own personal blog posts?!
Right before VMware Explore US, I had been doing some research and I found a number of free and open source projects like privateGPT and localGPT that allows you to retrieve or summarize information using your own documents. Most of these solutions can work across multiple platforms and support different accelerators including just a CPU, but for an ideal experience, a GPU would be required and specifically an NVIDIA GPU as most projects have implemented NVIDIA CUDA. I had experimented with a few of these projects, but some of the setup experience was pretty complex or convoluted with the different Python and package dependencies, which was something I had noticed in many of these projects even with my super limited exposure.
I finally settled on a project called h2ogpt which I was able to get up and running without too much hassle but it also includes a simple but rich user graphical user interface The project is extremely well documented and it shares a lot of simliar similarities from the other projects but it felt much more comprehensive in terms of its documentation with support for over 80+ native file formats. With access to a decently powered NVIDIA GPU (RTX A5500 Laptop) from a recent review of the Lenovo P3 Ultra, I was curious to see if h2ogpt could enable the use case that I was looking to solve for myself and along the way, hopefully learn a bit more about GenAI ...
Pre-Requisite:
- NVIDIA GPU for VM passthrough as performance with a CPU is not performant nor ideal
Step 1 - Install Ubuntu 22.04 with the following VM configuration: 2 vCPU, 32GB memory and 60GB storage. Once the base operating system has been installed, go ahead and shutdown the VM and configure passthrough for the NVIDIA GPU and then power on the VM.
For successful VM passthrough of the NVIDIA GPU, you will also need to add the following VM Advanced Setting to properly power on the VM:
pciPassthru.use64bitMMIO = TRUE
SSH access to the VM will be required for the remainder of the steps.
Step 2 - Recent Ubuntu releases no longer consumes more than 40GB of storage, you will need to run the following to see the full 60GB:
sudo lvresize -l +100%FREE /dev/mapper/ubuntu--vg-ubuntu--lv
sudo resize2fs /dev/mapper/ubuntu--vg-ubuntu--lv
Step 3 - Ensure that our Ubuntu system has the latest updates:
sudo apt update -y && sudo apt upgrade -y
Step 4 - Run the following command to determine the recommended NVIDIA driver:
ubuntu-drivers devices
In the output look for the "Recommend" keyword to find the recommended driver name. In my example, the result returned "nvidia-driver-535" and we can install it by running the following command:
sudo apt install -y nvidia-driver-535
Once the driver installation has completed, reboot for the changes to go into effect.
sudo reboot
Step 5 - Install these additional packages including Conda which will help setup our Python environment:
sudo apt install -y autoconf libtool nvidia-cuda-toolkit
curl https://repo.anaconda.com/archive/Anaconda3-2021.11-Linux-x86_64.sh --output anaconda.sh
bash anaconda.sh
Follow the rest of the directions provided by the prompt and answering yes for all answer. You also need to re-open a new session for all the changes to go into effect before going to Step 5.
Step 5 - Next, we create our Conda environment which we will name h2ogpt and clone then clone the h2ogpt repo and install remainder requirements including using llama-cpp-python which will allow us to take advantage of both CPU & GPU resources for any llama-based models.
conda create -n h2ogpt -y
conda activate h2ogpt
conda install python=3.10 -c conda-forge -y
git clone https://github.com/h2oai/h2ogpt.git
cd h2ogpt/
conda install cudatoolkit-dev -c conda-forge -y
pip install -r requirements.txt --extra-index https://download.pytorch.org/whl/cu117
pip install -r reqs_optional/requirements_optional_langchain.txt
pip uninstall -y llama-cpp-python llama-cpp-python-cuda
export LLAMA_CUBLAS=1
export CMAKE_ARGS=-DLLAMA_CUBLAS=on
export FORCE_CMAKE=1
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.73 --no-cache-dir --verbose
Note: Each time you exit or reboot the system, make sure to use the "conda activate h2ogpt" to get back into the environment.
Step 6 - Per the h2ogtp documentation, the model llama-2-7b-chat.ggmlv3.q8_0 is the smallest and most CPU friendly which requires a 32GB system ram or 9GB GPU with full GPU offloading. We will pre-download the model into the h2ogpt directory by running the following command:
wget https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/resolve/main/llama-2-7b-chat.ggmlv3.q8_0.bin
Note: You can experiment with other models by simply specifying the Hugging Face URI using the --base_model argument. For example. VMware has also published a number of its own models on Hugging Face and lets say you want to try out open-llama-7b-v2-open-instruct, the --base_model value would simply be VMware/open-llama-7b-v2-open-instruct which you can also copy from the upper left hand corner of the model webpage. Do note, most models are 10s of GB if not larger, so be sure you have enough storage space to download these models and you may also want to store them on a local fileshare so you do not have to re-download these huge models each time.
Step 7 - Finally, to launch the h2ogpt application, run the following command which will use the llama2 based model that we had just downloaded from Step 6. If the model you are attempting to use does not exist locally, then it will automatically download it for you but given these are huge models, you may want to download it once and then save it locally for reuse as mentioned in the note above.
python generate.py --base_model='llama' --prompt_type=llama2
The first time you launch h2ogpt, the initialization will take slightly longer (20-30 seconds typically) but after that, it should be pretty quick to load unless you do NOT have an NVIDIA GPU to offload. At the very end of the output you should see a line stating the service is now running on http://0.0.0.0:7860 and you can now open a browser to the FQDN/IP Address of the VM on port 7860
You now have private ChatGPT-like solution using h2ogpt running privately within your own environment! Unlike some of the other alternatives, where the interface is purely command-line, h2ogpt provides a nice and relatively easy to use graphical interface where you can interact with the model and upload your documents directly from the UI. There are also a number of advanced settings from changing the mode from Q/A to summarization to much much more, definitely check out the documentation on h2ogpt project for more details.
While the UI is a great way to get started with h2ogpt, if you have a bunch of documents or blog posts, you probably want a more efficient way of uploading all of that content! 🙂 Luckily, h2ogpt also supports a bulk import option that can be used to pre-build your source document database, so that is available immediately for use.
To do so, use the src/make_db.py script where you need to provide the path to all of your documents and it will import the raw files and generate the source database file which can then be used when launching the h2ogpt application. In the example below, the script is processing all of my blog posts from /home/vmware/williamlam.com directory and the I have named this collection WilliamLamData.
python src/make_db.py --user_path=/home/vmware/williamlam.com --collection_name=WilliamLamData
Note: Even during the import and database creation, the GPU will be leveraged to help offload the process but usage only went up to 4GB of GPU memory when monitoring with the nvidia-smi utility.
As you can see from the screenshot above, it took ~9m (with some GPU offloading) to process 1500+ articles and build my database which I can now reference with the following command:
python generate.py --base_model='llama' --prompt_type=llama2 --score_model=None --langchain_mode='WilliamLamData' --langchain_modes="['LLM','WilliamLamData']"
Now, instead of starting with an empty documents database, I have now pre-loaded my 1500+ blog posts which I can now interact directly with using WilliamGPT 😛
If you ask me, this is pretty freaking dope! You immediately get to try out these Large Language Models (LLMs) and start experimenting before investing any additional time and resources, especially not having to do any training. For advanced users, you might want to explore fine tuning using one of these models or adjust the prompts before and after going to the LLM or even start building your own models to use and compare and contrast their outcomes.
One thing I really like about the h2ogpt solution is that for the generated response, it includes the source document reference that it used to derive the answer, which is not something you see with all the public ChatGPT-like systems. After standing up this solution earlier last week, I already had more than a dozen uses from quickly identifying an article that contains a specific piece of information or pointer to the blog post to answer a question from our field. I even ran the same same question on Google, which has probably crawled my entire blog, but I still found that using just my blog articles, the results were much more accurate, which was exactly what I was expecting when I came up with this particular use case for myself!
Note: I have only tried this llama2 (llama-2-7b-chat.ggmlv3.q8_0) model so far and have not experimented with others. If you are not getting the result you expect even after fine tuning, then you may want to experiment with other prompts, especially if you have more powerful GPUs in running larger models with greater accuracy.
Obviously, this was a very basic use case that I have demonstrated for my own personal use but imagine what you could do with something like this running in your own private environment? Some additional use cases that come to mind when I think about some of the typical tasks for an IT organization:
- Summarizing and/or searching for information (knowledge, troubleshooting, etc) across various internal document stores
- Finding code snippets or generating net new code from your existing code repository
- Building resource allocation and usage reports (application, networking, configuration) for management sourcing from various systems of records including Change Management Database (CMDB)
There are no right or wrong ways to use GenAI and everyone is literally experimenting right now to see what it can do and what are its limits. What type of use cases come to your mind, especially as you think about your day to day work within your organization and how GenAI could augment or improve your existing workflows?
I already have my next idea after going through this exercise myself 😀
Mauro Bassani says
Thanks William, this is very helpful! For example, a use case can be for the ones that needs to run a pilot for an AI/ML project, but their corporate policies don't allow them to publicly expose their company's data on ChatGPT or Bard.
Doug says
That's really cool! Thanks for sharing the step-by-step walkthrough of this experiment!
Lukas Klinger says
Hi William, thanks for putting this guide together.
I've encountered an issue with my GPU's pass-through. (RTX 4090 OC)
This message appeared on my VM's events page:
"The firmware could not allocate 33587200 KB of PCI MMIO. Increase the size of PCI MMIO and try again."
The solution was to add this VM Advanced Setting along with the parameters you mentioned:
pciPassthru.64bitMMIOSizeGB = “64"