Part 10- Voice Chat with Gemini

Rohit Raj
3 min readMay 22, 2024

--

Recently OpenAI gave a mindblowing demo of voice mode on GPT-4o. It is a natively multimodal model that can take text, audio, and images as input and output text, audio and images. OpenAI has rolled out the model to all its users. But currently voice mode of Chatgpt app is still powered by GPT-4 model. They transcribe user voice input to text and text response of GPT-4 to speech using two different models.

While waiting for this I decided to voice mode demo using Google Gemini 1.5 pro API. I chose Gemini 1.5 pro as it is nearly as capable as GPT 4 and google provides large number of free requests per day.

For transcribing speech input to text, I used Whisper library from OpenAI. It runs machine learning models locally for converting speech input to text. You can also use OpenAi API for whisper if you don't have good GPU.

For converting Gemni reply to speech, I have used Bark model which is a transformer-based text-to-audio model created by Suno. Bark can generate highly realistic, multilingual speech as well as other audio — including music, background noise and simple sound effects.

For creating GUI for application for voice chat I have worked upon my previous blog where I created an application for chatting with Gemini

Let's work step by step

Step One Speech to Text

You can convert speech to text using the following Python code

import whisper

def stt(audio):
if audio is None:
return ""

audio = whisper.load_audio(audio)
result = model.transcribe(audio,language='en')
print(result['text'])
return result["text"]

Step Two Call to Gemini 1.5 Pro API

You can call Gemini 1.5 pro API using the following code

import google.generativeai as genai
genai.configure(api_key=key)
model = genai.GenerativeModel('gemini-1.5-pro-latest')
response = model3.generate_content(text)
print(response.text)

Step Three Converting text to Speech

You can convert text to speech using the following code

from transformers import pipeline
model = pipeline(task="text-to-speech", model="suno/bark")

def tts(text):
return (np.array(model(text)['audio'])*32767).astype(np.int16).T

This function model produces output in a format that is required by Gradio audio component.

Combining all three steps I get the following code which allows me to voice chat with Gemini 1.5 Pro API.

I have used a gradio audio component as input which records audio and sends it processinput function. It converts speech to text and then sends it to Gemni 1.5 Pro API for response.

Before converting API response to speech, I split the API response into several smaller fragments so that Suno model can process the entire response. I concatenate audio for each segment of response by Numpy library and feeds it into Gradio Audio output.

This method works if you have decent GPU. If you don't have decent GPU then you call whisper API for converting speech to text and elvenlabs API for converting text to speech.

If you liked the article please clap and subscribe. Thank you for reading.

--

--

Rohit Raj

Studied at IIT Madras and IIM Indore. Love Data Science