In my previous blog post, I discussed how we can finely tune a small language model, specifically Microsoft Phi-2, and effectively train it on a single GPU to achieve remarkable results. Following that post, a comment caught my attention, where a reader inquired about generating a GGUF file from the fine-tuned model. While I had some knowledge on the topic, I wanted to explore this a bit more. With a little bit of research on the web, I realized that there was not a lot of content on this topic and figured that it might be worthwhile to post a more detailed article on this.
For those who are uninitiated, GGUF is a successor to GGML (GPT-Generated Model Language) developed by the brilliant Georgi Gerganov and the llama.cpp team. It allows for faster and more efficient use of language models for tasks like text generation, translation, and question answering. GGUF is quite popular among macOS users and due to its minimal setup and performance, you can run inference on pretty much any operating system, including running it on a Docker container. How cool is that!
I am using Google Colab for my code, and I will be using llama.cpp to convert the fine-tuned model to GGUF and to spice things up I am using LangChain with llama-cpp-python, which is a python binding for llama.cpp for running inference. I am using a GPU instance to convert the model to GGUF format and then I will switch to a CPU instance to run inference, to demonstrate that it works great on a CPU.
Here are the links to the source code in Google Colab notebook and the GitHub repo. Feel free to reuse for your own projects. Happy Coding!
Google Colab Notebook
GitHub Repository
Loading and Merging Phi-2 with fine-tuned LoRA adapters
Before we get into converting the fine-tuned model to GGUF format, lets first load the model and merge it with the LoRA adapters. I am going to reuse the adapters I created in my previous blog post. We will merge the adapters and push it to hugging face hub. Let’s start with installing the python packages we need at this stage.
Next, we will load the model in fp16. Note that in my previous blog I had used bits and bytes to load it in 4-bit, reducing the memory usage. However, in this case we cannot convert models that have already been quantized (8-bit? or below) to GGUF format.
Once the model is downloaded (this could take a while), lets load and merge the adapters to the model.
After the model has been merged and saved to local, you can see the files by clicking on the “Files” icon on the left in Google Colab, we will push the model to hugging face hub.
We will also need to push the tokenizer along with the model as this would be needed for conversion later down the line.
Installing and building llama.cpp and converting HF model to GGUF
At this stage, if you are using Google Colab, I would recommend to disconnect and delete runtime. Let’s download the merged model, install and build llama.cpp and convert the downloaded model to GGUF.
Below is the code to download the model along with the tokenizer.
Installing and building llama.cpp
I am using CuBLAS, which is a library provided by NVIDIA for performing BLAS operations on NVIDIA GPUs. It is optimized for the NVIDIA CUDA architecture. BLAS (Basic Linear Algebra Subprograms) is a library that provides optimized routines for performing common linear algebra operations. By utilizing BLAS, llama.cpp can perform mathematical calculations more efficiently, leading to faster processing of the input text. Llama.cpp offers several different BLAS implementations for usage.
We then run “make”, a software tool that automates the process of building software. Its typically used for compiling C and C++ projects.
Now that we have the project compiled and the necessary python packages installed, lets run the convert on the downloaded model.
Note that because this is a hugging face model we downloaded, I am using the “convert-hf-to-gguf.py” instead of the “convert.py”. We save the output bin file in the phi2 directory.
Our next step would be to quantize the converted model to Q5_K_M.
Notice the “q5_k_m” in the command. Q5 refers to the number of bits used to represent the quantized weights, so the weights are represented using 5 bits. “k” refers to the key weights within the attention mechanism. I could not find reference to “m” in any public documentation, but if I were to guess it probably means moderate level of quantization? I used this setting as it preserves the model performance, while also being optimal for memory usage. Lower bits can decrease performance, while higher bits, can take up more memory.
After the model has been quantized, lets upload the GGUF to a different repo on hugging face.
Like before, if you are on Google Colab, I would recommend to disconnect and delete your runtime. And go ahead and connect to a CPU instance. First we will install the necessary packages to run inference, then download the GGUF from Hugging Face and finally run inference on it.
Downloading the GGUF file from hugging face
Setting up LangChain, prompt and running inference
For the prompt, I am using the same prompt from my previous article on fine tuning Microsoft’s Phi-2.
And below is the output!
On a CPU, it took 138518 milliseconds or approximately over 2 minutes, for it to run. And the GGUF file itself was around 2GB, roughly 60% less in size than the fine-tuned model I downloaded (5GB).
Final Thoughts
Apart from the fact that GGUF format is compatible with CPU, and can run on Apple hardware, it offers a variety of benefits as listed below. But running inference on a desktop CPU will still be slowish, unless you are experimenting and want to work with an LLM on your computer.
- Reduced Model Size: While not the primary focus, GGUF has a smaller file size compared to the original format, depending on the model and the chosen quantization method used during conversion.
- Deployment on devices with limited storage: If you plan to deploy your LLM on mobile devices or embedded systems with limited storage space, the smaller GGUF file size can be a significant advantage.
- Efficient Memory Usage: GGUF is specifically designed for efficient memory usage during inference. This allows the model to run effectively on devices with less memory, expanding deployment possibilities beyond GPUs with high computational power.
And that’s all folks! Thank you for taking time to read this article.
Resources
https://github.com/ggerganov/llama.cpp
https://towardsdatascience.com/quantize-llama-models-with-ggml-and-llama-cpp-3612dfbcc172
https://python.langchain.com/docs/integrations/llms/llamacpp
https://huggingface.co/docs/trl/main/en/use_model
https://huggingface.co/docs/huggingface_hub/v0.21.4/guides/download