Part 11- Various Uses of Gemini 1.5 Pro API

Rohit Raj
4 min readMay 30, 2024

--

Google recently released a new version of the Gemini 1.5 Pro model. The new model is better than all models except the GPT-4o model of OpeanAI. You can see model rankings at the below site.

https://chat.lmsys.org/

They have also recently released a smaller model, Gemini-1.5-Flash. Its performance and inference cost are much better than those of other smaller models.

Gemini-1.5-pro and Gemini-1.5-flash are multimodal models that can accept audio and video as input. In part 5 of my series, I wrote about how to use text input in Gemini API.

In this tutorial, I will show various uses of Gemni API for image, audio, and video inputs.

First, we import the necessary libraries

import pathlib
import textwrap

import google.generativeai as genai

from IPython.display import display
from IPython.display import Markdown


def to_markdown(text):
text = text.replace('•', ' *')
return Markdown(textwrap.indent(text, '> ', predicate=lambda _: True))

genai.configure(api_key=GOOGLE_API_KEY)
model = genai.GenerativeModel('gemini-1.5-flash-001')

We can generate response to text content by using following function

response = model.generate_content("What is the meaning of life?")
to_markdown(response.text)

Image Inputs

  1. Image Description

I am going to demonstrate several use cases of working with image input with these models. I am going to work with image of Jon Snow

img = PIL.Image.open(image)
img
response = model.generate_content(['Describe this image',img])
to_markdown(response.text)

2. Face Recognition

I also tested this with other celebrities image. It does a good job of identifying persons.

3 Geolocation

I also tested this with other images. It was correct every time.

3 Image Classification

It should be easy problem for these models.

4 Object Detection

You can use following function to crop bounding box

def crop_image(im, box):
# Load the image
img = im
width, height = img.size
y1, x1, y2, x2 = box
abs_x1 = int(x1/1000 * width)
abs_y1 = int(y1/1000 * height)
abs_x2 = int(x2/1000 * width)
abs_y2 = int(y2/1000 * height)
return (im.crop((abs_y1,abs_x1,abs_y2,abs_x2)))

I would recommend to use Gemini-1.5-pro model for this use. Flash model does not give very accurate outputs.

5 Optical Character Recognition

It came very close to solving this difficult captcha. Normal text in images should not be a problem for the model.

Audio and Video Input

We have to upload the media file into google cloud to provide audio and video input to Gemini API. Then give file identifier into API. We can then ask for transcript, translation, or any question on the content of the media file.

First upload file

video_file_name= path to video/audio file
print(f"Uploading file...")
video_file = genai.upload_file(path=video_file_name)
print(f"Completed upload: {video_file.uri}")

Second check for completion of upload of file

import time

while video_file.state.name == "PROCESSING":
print('.', end='')
time.sleep(10)
video_file = genai.get_file(video_file.name)

if video_file.state.name == "FAILED":
raise ValueError(video_file.state.name)

Then make API request

# Create the prompt.
prompt = "Give transcript."

# The Gemini 1.5 models are versatile and work with multimodal prompts
model = genai.GenerativeModel(model_name="models/gemini-1.5-flash")

# Make the LLM request.
print("Making LLM inference request...")
response = model.generate_content([prompt, video_file],
request_options={"timeout": 600})
print(response.text)

I use Gemini API to ask questions about following movie trailer

First I downloaded the youtube video using following code

from pytube import YouTube
video_url = r'https://www.youtube.com/watch?v=0UZLOpDTwO0'
# Create a YouTube object
yt = YouTube(video_url)

# Get the highest resolution stream available
stream = yt.streams.get_lowest_resolution()

# Download the video
stream.download()

Then I asked the model to provide transcript and summary of plot and cast.

Model was almost perfect in generating the transcript but struggled in the movie name as it was not clearly mentioned in the video.

As these models will keep improving. The use cases of these models will only keep growing.

If you liked my article, please clap and subscribe to get my stories.

--

--