Google Speech to Text API in python — Worked Example

I found this article on medium about using the google speech to text API.

As a python coder this was a good first start, but was not in a state that I could just use it.

Please read the original article, for the why, this is just the how.

So how do you convert the speech an audio file (mp3, ogg, wav) to text? I have uploaded all you need to this git repository


To get code:

I recommend using virtualenv/venv to setup your own local copy of python:

Then you will need to install the dependent python modules, these are all contained in the requirements.txt file in the directory that comes from the repo.

I was able to get this working under native windows and linux, not cygwin.

Google cloud account.

As per the original article you will need a google cloud platform account.

Once set up you will need to set up a “bucket”, this is an area where you can upload data to on google servers.

You will need setup a <credentials>.json. This is used by the python script to authenticate against the google servers and allow you to upload the audio file to the server and then call the transcription services.

In my project I have called the bucket ‘throat’, and I have included an example json file, gcloud-123011d921d1.json, this is a dummy file, to see what one looks like, you can’t use it (well you can, but it won’t work!)

Google charges you for the pleasure, but at the time of writing 100 minutes of transcription per months is free. The script when it finishes removes the audio file from the server. If you exit prematurely you may have left it on the server. It is no harm to have a look when you are done and make sure the bucket is empty or files.

Once you have the bucket name and json file, edit the gcloud.ini file accordingly (no quotes):


The python script calls ffmpeg under the hood. Make sure it is installed on you machine and in your path:

How to install ffmpeg


You should now be setup. I have included a few audio files in the audio directory. It is Thackery Binx from the movie Hocus Pocus saying the phrase, “it’s protected by magic”.

Bonus points if any one can figure out why that snippet of audio is being used.

It’s protected by magic!

To call the script it is:

Or in this case you can use the one in the repo:

In the background, it converts it to a single channel wav file, uploads it to google, translates it, prints the translation to the script and writes it to a text file in the transcript directory and finally deletes the wav file from the google server.

Sample output:

That’s it, its working!

Get your own audio file and try it, at the moment it only supports mp3, ogg and wav files.


This post is just for setup. The efficiency of google speech to text is not great I will detail it in another post. I suspect it is because I have an Irish accent but the AI (deep learning) was trained mainly on American accents.