Google recently released a new version of the Gemini 1.5 Pro model. The new model is better than all models except the GPT-4o model of OpeanAI. You can see model rankings at the below site.
They have also recently released a smaller model, Gemini-1.5-Flash. Its performance and inference cost are much better than those of other smaller models.
Gemini-1.5-pro and Gemini-1.5-flash are multimodal models that can accept audio and video as input. In part 5 of my series, I wrote about how to use text input in Gemini API.
In this tutorial, I will show various uses of Gemni API for image, audio, and video inputs.
First, we import the necessary libraries
import pathlib
import textwrap
import google.generativeai as genai
from IPython.display import display
from IPython.display import Markdown
def to_markdown(text):
text = text.replace('•', ' *')
return Markdown(textwrap.indent(text, '> ', predicate=lambda _: True))
genai.configure(api_key=GOOGLE_API_KEY)
model = genai.GenerativeModel('gemini-1.5-flash-001')
We can generate response to text content by using following function
response = model.generate_content("What is the meaning of life?")
to_markdown(response.text)
Image Inputs
- Image Description
I am going to demonstrate several use cases of working with image input with these models. I am going to work with image of Jon Snow
img = PIL.Image.open(image)
img
response = model.generate_content(['Describe this image',img])
to_markdown(response.text)
2. Face Recognition
I also tested this with other celebrities image. It does a good job of identifying persons.
3 Geolocation
I also tested this with other images. It was correct every time.
3 Image Classification
It should be easy problem for these models.
4 Object Detection
You can use following function to crop bounding box
def crop_image(im, box):
# Load the image
img = im
width, height = img.size
y1, x1, y2, x2 = box
abs_x1 = int(x1/1000 * width)
abs_y1 = int(y1/1000 * height)
abs_x2 = int(x2/1000 * width)
abs_y2 = int(y2/1000 * height)
return (im.crop((abs_y1,abs_x1,abs_y2,abs_x2)))
I would recommend to use Gemini-1.5-pro model for this use. Flash model does not give very accurate outputs.
5 Optical Character Recognition
It came very close to solving this difficult captcha. Normal text in images should not be a problem for the model.
Audio and Video Input
We have to upload the media file into google cloud to provide audio and video input to Gemini API. Then give file identifier into API. We can then ask for transcript, translation, or any question on the content of the media file.
First upload file
video_file_name= path to video/audio file
print(f"Uploading file...")
video_file = genai.upload_file(path=video_file_name)
print(f"Completed upload: {video_file.uri}")
Second check for completion of upload of file
import time
while video_file.state.name == "PROCESSING":
print('.', end='')
time.sleep(10)
video_file = genai.get_file(video_file.name)
if video_file.state.name == "FAILED":
raise ValueError(video_file.state.name)
Then make API request
# Create the prompt.
prompt = "Give transcript."
# The Gemini 1.5 models are versatile and work with multimodal prompts
model = genai.GenerativeModel(model_name="models/gemini-1.5-flash")
# Make the LLM request.
print("Making LLM inference request...")
response = model.generate_content([prompt, video_file],
request_options={"timeout": 600})
print(response.text)
I use Gemini API to ask questions about following movie trailer
First I downloaded the youtube video using following code
from pytube import YouTube
video_url = r'https://www.youtube.com/watch?v=0UZLOpDTwO0'
# Create a YouTube object
yt = YouTube(video_url)
# Get the highest resolution stream available
stream = yt.streams.get_lowest_resolution()
# Download the video
stream.download()
Then I asked the model to provide transcript and summary of plot and cast.
Model was almost perfect in generating the transcript but struggled in the movie name as it was not clearly mentioned in the video.
As these models will keep improving. The use cases of these models will only keep growing.
If you liked my article, please clap and subscribe to get my stories.