Part 8- How to run inference on Google Gemma Model on Google Colab

2 min readMar 4, 2024

Google released open source model Gemma model on Feb 21, 2024. These models are free to download and use. Gemma model weights has been released in two sizes: Gemma 2B and Gemma 7B. Each size has been released with pre-trained and instruction-tuned variants. As per Google, Gemma models are better than Llama 2 open source models.

You can run Gemma model on Google Colab. In this tutorial I will show to how to run both Gemma-2B and Gemma-7B model. You have to following steps

Step 1: Accepting Google License

First go in Huggingface model page, login into hugging face and accept the google license for the model. You can only use the model after accepting the license.

Step 2: Generating Hugging face token

Go to settings Hugging Face website and generate user access tokens

Step 3: Instantiate the models

Open Google Colab notebook. Choose runtime as T4 GPU. Then run the following code to instantiate Gemma-2B model

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

torch.set_default_device("cuda")
pip install einops
token ='your huggingface token'
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b", torch_dtype="auto", trust_remote_code=True,token = token)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b", trust_remote_code=True, token = token)

If you want to run Gemma-7B model on colab then you have to use quantised models. For which you can run following code

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
torch.set_default_device("cuda")
!pip3 install -q -U bitsandbytes==0.42.0
!pip3 install -q -U peft==0.8.2
!pip3 install -q -U trl==0.7.10
!pip3 install -q -U accelerate==0.27.1
!pip3 install -q -U datasets==2.17.0
!pip3 install -q -U transformers==4.38.1
token ='your huggingface token'
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b",token = token)
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b", quantization_config=quantization_config,token = token)

Step 4: Model Inference

Model Inference code is same for both models

input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids, max_length=200)
print(tokenizer.decode(outputs[0]))

In my part 7 of series, I had tested Microsoft phi-2 model. On my example prompts, performance of Gemma-2B was not as good as Microsoft Phi-2 model. But Gemma-7B model is significantly better. Among Open LLM models of similar sizes, only Mistral 7B model is better than Gemma-7B model.

Part 8- How to run inference on Google Gemma Model on Google Colab

Written by Rohit Raj

No responses yet