Speech-to-text transcription consists in the conversion of the audio speech into text by using voice recognition models.

Converting speech into text, automatically and with high accuracy, is a straightforward process. There are several speech-to-text APIs that provide a *fee-based service*. Speech audio files can be converted to text using Google’s Cloud Speech-to-Text, IBM™ Watson, Amazon Transcribe and Microsoft™ Azure.

I use Google's Cloud Speech-to-Text API. Check here for extensive documentation to guide you in the use of the API. In the Google Cloud Platform, you can create a project, set up your billing preferences and create a bucket segment to upload your audio files.

Once you are ready, you can use the Cloud Shell (recommended) to send speech audio files for decoding by using the command line. The code snippet below is an example of decoding a speech audio file uploaded to a bucket (i.e. it is not local). It also includes a command to convert to text format.

gcloud alpha ml speech recognize-long-running gs://audios_experimento_hyperscanning/012_MOVIES.wav --language-code=en-US --async --enable-automatic-punctuation --encoding=linear16 --include-word-confidence --include-word-time-offsets --max-alternatives=0 --sample-rate=44100 --audio-topic="preferences and opinions about movies" --interaction-type=professionally-produced --microphone-distance=nearfield


gcloud alpha ml speech operations --format=text describe 1234567890> filename.txt

Note that the alpha version of the API is used (gcloud alpha ml speech recognize-long-running), but other interesting features like the diarization (to recognize multiple speakers) are included in the beta version (gcloud beta ml speech recognize-long-running).

Decoding at gcloud could also be done from Matlab (R2019b and on) by using Audio toolbox, Text Analytics toolbox and the speech2text functions. The function speech2text_from_MATLAB could be used for that. However, it is limited to the processing of short speech audio (< 1 min).