Part 8- How to run inference on Google Gemma Model on Google Colab
Google released open source model Gemma model on Feb 21, 2024. These models are free to download and use. Gemma model weights has been released in two sizes: Gemma 2B and Gemma 7B. Each size has been released with pre-trained and instruction-tuned variants. As per Google, Gemma models are better than Llama 2 open source models.
You can run Gemma model on Google Colab. In this tutorial I will show to how to run both Gemma-2B and Gemma-7B model. You have to following steps
Step 1: Accepting Google License
First go in Huggingface model page, login into hugging face and accept the google license for the model. You can only use the model after accepting the license.
Step 2: Generating Hugging face token
Go to settings Hugging Face website and generate user access tokens
Step 3: Instantiate the models
Open Google Colab notebook. Choose runtime as T4 GPU. Then run the following code to instantiate Gemma-2B model
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
torch.set_default_device("cuda")
pip install einops
token ='your huggingface token'
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b", torch_dtype="auto", trust_remote_code=True,token = token)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b", trust_remote_code=True, token = token)
If you want to run Gemma-7B model on colab then you have to use quantised models. For which you can run following code
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
torch.set_default_device("cuda")
!pip3 install -q -U bitsandbytes==0.42.0
!pip3 install -q -U peft==0.8.2
!pip3 install -q -U trl==0.7.10
!pip3 install -q -U accelerate==0.27.1
!pip3 install -q -U datasets==2.17.0
!pip3 install -q -U transformers==4.38.1
token ='your huggingface token'
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b",token = token)
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b", quantization_config=quantization_config,token = token)
Step 4: Model Inference
Model Inference code is same for both models
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids, max_length=200)
print(tokenizer.decode(outputs[0]))
In my part 7 of series, I had tested Microsoft phi-2 model. On my example prompts, performance of Gemma-2B was not as good as Microsoft Phi-2 model. But Gemma-7B model is significantly better. Among Open LLM models of similar sizes, only Mistral 7B model is better than Gemma-7B model.