Recently OpenAI gave a mindblowing demo of voice mode on GPT-4o. It is a natively multimodal model that can take text, audio, and images as input and output text, audio and images. OpenAI has rolled out the model to all its users. But currently voice mode of Chatgpt app is still powered by GPT-4 model. They transcribe user voice input to text and text response of GPT-4 to speech using two different models.
While waiting for this I decided to voice mode demo using Google Gemini 1.5 pro API. I chose Gemini 1.5 pro as it is nearly as capable as GPT 4 and google provides large number of free requests per day.
For transcribing speech input to text, I used Whisper library from OpenAI. It runs machine learning models locally for converting speech input to text. You can also use OpenAi API for whisper if you don't have good GPU.
For converting Gemni reply to speech, I have used Bark model which is a transformer-based text-to-audio model created by Suno. Bark can generate highly realistic, multilingual speech as well as other audio — including music, background noise and simple sound effects.
For creating GUI for application for voice chat I have worked upon my previous blog where I created an application for chatting with Gemini
Let's work step by step
Step One Speech to Text
You can convert speech to text using the following Python code
import whisper
def stt(audio):
if audio is None:
return ""
audio = whisper.load_audio(audio)
result = model.transcribe(audio,language='en')
print(result['text'])
return result["text"]
Step Two Call to Gemini 1.5 Pro API
You can call Gemini 1.5 pro API using the following code
import google.generativeai as genai
genai.configure(api_key=key)
model = genai.GenerativeModel('gemini-1.5-pro-latest')
response = model3.generate_content(text)
print(response.text)
Step Three Converting text to Speech
You can convert text to speech using the following code
from transformers import pipeline
model = pipeline(task="text-to-speech", model="suno/bark")
def tts(text):
return (np.array(model(text)['audio'])*32767).astype(np.int16).T
This function model produces output in a format that is required by Gradio audio component.
Combining all three steps I get the following code which allows me to voice chat with Gemini 1.5 Pro API.
I have used a gradio audio component as input which records audio and sends it processinput function. It converts speech to text and then sends it to Gemni 1.5 Pro API for response.
Before converting API response to speech, I split the API response into several smaller fragments so that Suno model can process the entire response. I concatenate audio for each segment of response by Numpy library and feeds it into Gradio Audio output.
This method works if you have decent GPU. If you don't have decent GPU then you call whisper API for converting speech to text and elvenlabs API for converting text to speech.
If you liked the article please clap and subscribe. Thank you for reading.