Converting Text to Speech with Azure Cognitive Service's REST-Based API

Artificial intelligence (AI) is being applied across an ever-increasing range of services. One of those applications is in the text to speech realm.

Text to speech software and services has been around for a long time but it always been known to sound monotone, robotic and without emotion. Using AI, Microsoft has built a simple REST-based API and set of language SDKs to leverage the power of AI to create voices that sound close to human speech. This text to speech service is built into their Cognitive Services suite of products in Azure.

To get started using the text to speech REST API for free, head over to Microsoft's Try Cognitive Services page and click on Speech APIs and then on Get API Key in the Speech Services row. This link will walk you through getting a free API key.

Once you have an API key, you'll then need to get an authentication token. To do that, we'll need to know the endpoint to call. If you've got an API key via the free trial, your endpoint to get a token will be https://westus.api.cognitive.microsoft.com/sts/v1.0/issuetoken.

To get down to a demo, I've chosen to use PowerShell and the AzTextToSpeech module. The AzTextToSpeech module makes it easy to work with the text to speech API without having to get in the weeds. First, let's download the AzTextToSpeech module by running Install-Module -Name AzTextToSpeech in your PowerShell console run as administrator.

Note that you will need the Az.CognitiveServices PowerShell module that can be installed by running Install-Module Az.

Once the module is installed, head over to C:\Program Files\WindowsPowerShell\Modules\AzTextToSpeech\<Version>\configuration.json and open it up with your favorite text editor. In this JSON file, add your token endpoint to the TokenEndpoint attribute along with the SubscriptionRegion of westus and close the file.

Now, run Save-ApiKey -Key <YourKeyHere> replacing the key you received when you signed up for the free trial. Save-ApiKey will save the key encrypted in configuration.json. At this point, you've got everything configured and are ready to get a token.

Run Connect-AzTextToSpeech. This will query configuration.json, grab a token and then save it in your current session to be used for all API calls. You're now ready to convert text to speech.

The command to convert text to speech is ConvertTo-Speech. This command requires at least four different parameters:

Text: This parameter will be the text that will be converted to speech
AudioOutput: This is the kind of audio that will be returned. You can cycle through all available options by hitting the Tab key.
VoiceAgent: Cognitive Services has many different voices to choose from. You can also cycle through all available voices using the Tab key with this parameter as well.
OutputFile: This is the path to the file that will be created once the speech is rendered

Altogether, an example of calling the command looks like below. Notice that I'm also using the optional PassThru parameter which will return the file saved and then sending that file to the Invoke-Item parameter which immediately opens it up in my media player.

PS> ConvertTo-Speech -Text 'Yay! I am in an Ipswitch article!' -AudioOutput 'audio-16khz-128kbitrate-mono-mp3' -VoiceAgent 'Guy24kRUS' -OutputFile 'C:\ipswitch.mp3' -PassThru | Invoke-Item

Once this command is run, you will now have a human-sounding voice! If you'd like more information on the AzTextToSpeech PowerShell module, check out the GitHub repo. If you'd like to learn more about text to speech, I also encourage you to check out the Microsoft documentation.

Adam Bertram

Adam Bertram is a 25+ year IT veteran and an experienced online business professional. He’s a successful blogger, consultant, 6x Microsoft MVP, trainer, published author and freelance writer for dozens of publications. For how-to tech tutorials, catch up with Adam at adamtheautomator.com, connect on LinkedIn or follow him on X at @adbertram.