How to do Voice Cloning in Python

Rohit Raj
3 min readOct 31, 2024

--

Voice cloning has become so easy now that you should not believe authenticity of audio.

A Few weeks back, the F5-TTS library was launched, making it very clone with only 15 second voice sample.

You can follow the following steps to clone voice

  1. Run following commands in command shell to install the library
conda create -n f5-tts python=3.10
conda activate f5-tts

# Install pytorch with your CUDA version, e.g., if this fails library will run on CPU
pip install torch==2.3.0+cu118 torchaudio==2.3.0+cu118 --extra-index-url https://download.pytorch.org/whl/cu118

pip install git+https://github.com/SWivid/F5-TTS.git

2 Run the following command to launch the Gradio UI

f5-tts_infer-gradio

Now navigate to http://localhost:7860/ on your browser. You have to upload 15 second long source audio. And type text to convert to speech. It will generate audio for your text in source audio voice. I tried it for my wife voice. It was genuinely scary, how good was the output.

Instead of using Gradio UI, you can it following command line tool

# Leave --ref_text "" will have ASR model transcribe (extra GPU memory usage)
f5-tts_infer-cli \
--model "F5-TTS" \
--ref_audio "ref_audio.wav" \
--ref_text "The content, subtitle or transcription of reference audio." \
--gen_text "Some text you want TTS model generate for you."

You can also use its command line tool to generate audio containing more than one voice.

For this you have to create .toml line in following format

# F5-TTS | E2-TTS
model = "F5-TTS"
ref_audio = "infer/examples/multi/main.flac"
# If an empty "", transcribes the reference audio automatically.
ref_text = ""
gen_text = ""
# File with text to generate. Ignores the text above.
gen_file = "infer/examples/multi/story.txt"
remove_silence = true
output_dir = "tests"

[voices.town]
ref_audio = "infer/examples/multi/town.flac"
ref_text = ""

[voices.country]
ref_audio = "infer/examples/multi/country.flac"
ref_text = ""

You have to text to generate in following format. You should mark the voice with [main] [town] [country] whenever you want to change voice.

A Town Mouse and a Country Mouse were acquaintances, and the Country Mouse one day invited his friend to come and see him at his home in the fields. The Town Mouse came, and they sat down to a dinner of barleycorns and roots, the latter of which had a distinctly earthy flavour. The fare was not much to the taste of the guest, and presently he broke out with [town] “My poor dear friend, you live here no better than the ants. Now, you should just see how I fare! My larder is a regular horn of plenty. You must come and stay with me, and I promise you you shall live on the fat of the land.” [main] So when he returned to town he took the Country Mouse with him, and showed him into a larder containing flour and oatmeal and figs and honey and dates. The Country Mouse had never seen anything like it, and sat down to enjoy the luxuries his friend provided: but before they had well begun, the door of the larder opened and someone came in. The two Mice scampered off and hid themselves in a narrow and exceedingly uncomfortable hole. Presently, when all was quiet, they ventured out again; but someone else came in, and off they scuttled again. This was too much for the visitor. [country] “Goodbye,” [main] said he, [country] “I’m off. You live in the lap of luxury, I can see, but you are surrounded by dangers; whereas at home I can enjoy my simple dinner of roots and corn in peace.”

Then you have to pass your toml file to tool using following command

f5-tts_infer-cli -c custom.toml

I would request you to not use this library for spamming. But people need to be aware what existing machine learning tools are capable of.

--

--

Rohit Raj
Rohit Raj

Written by Rohit Raj

Studied at IIT Madras and IIM Indore. Love Data Science

Responses (1)