Azure Speech

Azure Speech provides speech-to-text, text-to-speech, speech translation, and other speech processing capabilities. Build voice-enabled applications with high-accuracy speech recognition, natural-sounding speech synthesis, and real-time translation.

Key Capabilities

Speech-to-Text

Convert spoken audio to text with high accuracy

Text-to-Speech

Generate natural-sounding speech from text

Speech Translation

Translate spoken language in real-time

Voice Live

Build conversational voice interfaces

Speaker Recognition

Identify and verify speakers by voice

Pronunciation Assessment

Evaluate and improve pronunciation

Speech-to-Text

Transcribe audio to text with industry-leading accuracy:

Real-Time Transcription

Convert streaming audio to text in real-time:

import azure.cognitiveservices.speech as speechsdk

speech_config = speechsdk.SpeechConfig(
    subscription="<your-key>",
    region="<your-region>"
)

audio_config = speechsdk.AudioConfig(use_default_microphone=True)
recognizer = speechsdk.SpeechRecognizer(
    speech_config=speech_config,
    audio_config=audio_config
)

print("Say something...")
result = recognizer.recognize_once()

if result.reason == speechsdk.ResultReason.RecognizedSpeech:
    print(f"Recognized: {result.text}")
elif result.reason == speechsdk.ResultReason.NoMatch:
    print("No speech recognized")

Batch Transcription

Process large audio files asynchronously:

from azure.cognitiveservices.speech import SpeechConfig
from azure.cognitiveservices.speech.transcription import BatchTranscriptionClient

client = BatchTranscriptionClient(
    endpoint="https://<region>.api.cognitive.microsoft.com",
    subscription_key="<your-key>"
)

# Create transcription
transcription = client.create_transcription(
    name="My Transcription",
    description="Batch audio transcription",
    locale="en-US",
    content_urls=["https://example.com/audio.wav"]
)

# Wait for completion and get results
while transcription.status != "Succeeded":
    transcription = client.get_transcription(transcription.id)
    time.sleep(10)

results = client.get_transcription_files(transcription.id)

Fast Transcription

Quick transcription for pre-recorded audio:

Ultra-fast processing (faster than real-time)
Optimized for recorded files
Lower latency than batch transcription
Ideal for captions and subtitles

Custom Speech

Improve accuracy for specific scenarios:

Acoustic models: Adapt to noise environments
Language models: Add domain-specific vocabulary
Pronunciation: Define custom pronunciations
Train with your audio and transcripts

speech_config.endpoint_id = "<your-custom-model-id>"
recognizer = speechsdk.SpeechRecognizer(
    speech_config=speech_config,
    audio_config=audio_config
)

Text-to-Speech

Generate natural-sounding speech from text:

Neural Voices

High-quality, natural voices powered by neural networks:

speech_config = speechsdk.SpeechConfig(
    subscription="<your-key>",
    region="<your-region>"
)

# Select voice
speech_config.speech_synthesis_voice_name = "en-US-JennyNeural"

synthesizer = speechsdk.SpeechSynthesizer(
    speech_config=speech_config
)

result = synthesizer.speak_text("Hello, welcome to Azure Speech!")

if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
    print("Speech synthesized successfully")

SSML (Speech Synthesis Markup Language)

Fine-tune speech output:

ssml = """
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice name="en-US-JennyNeural">
        <prosody rate="slow" pitch="low">
            This is spoken slowly with a low pitch.
        </prosody>
        <break time="1s"/>
        <prosody rate="fast" pitch="high">
            This is spoken quickly with a high pitch.
        </prosody>
    </voice>
</speak>
"""

result = synthesizer.speak_ssml(ssml)

Voice Styles and Emotions

Express emotions and speaking styles:

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US">
    <voice name="en-US-AriaNeural">
        <mstts:express-as style="cheerful">
            Great news! Your order has been confirmed.
        </mstts:express-as>
        <break time="500ms"/>
        <mstts:express-as style="sad">
            Unfortunately, we're experiencing delays.
        </mstts:express-as>
    </voice>
</speak>

Available Styles:

Cheerful, sad, angry, fearful
Customer service, newscast, assistant
Chat, poetry reading, and more

Custom Neural Voice

Create unique voices for your brand:

Record voice samples (300-2000 utterances)
Train custom neural voice model
Unique brand identity
Consistent voice across applications
Requires Limited Access approval

Batch Synthesis

Generate audio for large texts asynchronously:

from azure.cognitiveservices.speech import SpeechSynthesizer

# Submit batch synthesis
batch_request = {
    "displayName": "Batch Synthesis",
    "description": "Long form audio",
    "textType": "PlainText",
    "inputs": [
        {"text": "Long text content to synthesize..."},
    ],
    "properties": {
        "outputFormat": "audio-24khz-96kbitrate-mono-mp3",
        "voiceName": "en-US-JennyNeural"
    }
}

Speech Translation

Translate speech between languages in real-time:

translation_config = speechsdk.translation.SpeechTranslationConfig(
    subscription="<your-key>",
    region="<your-region>"
)

# Set source and target languages
translation_config.speech_recognition_language = "en-US"
translation_config.add_target_language("de")
translation_config.add_target_language("fr")
translation_config.add_target_language("es")

recognizer = speechsdk.translation.TranslationRecognizer(
    translation_config=translation_config
)

print("Say something in English...")
result = recognizer.recognize_once()

if result.reason == speechsdk.ResultReason.TranslatedSpeech:
    print(f"Original: {result.text}")
    for language, translation in result.translations.items():
        print(f"{language}: {translation}")

Speech-to-Speech Translation

Translate and synthesize in target language:

# Enable voice output in target language
translation_config.voice_name = "de-DE-KatjaNeural"

recognizer = speechsdk.translation.TranslationRecognizer(
    translation_config=translation_config
)

def synthesis_callback(evt):
    print(f"Synthesizing translated speech...")
    # Play or save audio

recognizer.synthesizing.connect(synthesis_callback)

Voice Live (Preview)

Build conversational voice interfaces:

Natural, human-like conversations
Fast response times (low latency)
Integration with LLMs
Real-time interaction
Context-aware responses

Language Identification

Automatically detect spoken language:

auto_detect_config = speechsdk.languageconfig.AutoDetectSourceLanguageConfig(
    languages=["en-US", "de-DE", "fr-FR"]
)

recognizer = speechsdk.SpeechRecognizer(
    speech_config=speech_config,
    auto_detect_source_language_config=auto_detect_config,
    audio_config=audio_config
)

result = recognizer.recognize_once()
auto_detect_result = speechsdk.AutoDetectSourceLanguageResult(result)

print(f"Detected language: {auto_detect_result.language}")
print(f"Recognized text: {result.text}")

Pronunciation Assessment

Evaluate speech pronunciation for language learning:

pronunciation_config = speechsdk.PronunciationAssessmentConfig(
    reference_text="Hello world",
    grading_system=speechsdk.PronunciationAssessmentGradingSystem.HundredMark,
    granularity=speechsdk.PronunciationAssessmentGranularity.Phoneme
)

pronunciation_config.enable_miscue = True
pronunciation_config.apply_to(recognizer)

result = recognizer.recognize_once()
assessment_result = speechsdk.PronunciationAssessmentResult(result)

print(f"Accuracy: {assessment_result.accuracy_score}")
print(f"Fluency: {assessment_result.fluency_score}")
print(f"Completeness: {assessment_result.completeness_score}")
print(f"Pronunciation: {assessment_result.pronunciation_score}")

Text-to-Speech Avatar

Generate videos of photorealistic talking avatars:

Lifelike synthetic avatars
Natural speech and lip-sync
Multiple avatar styles
Real-time or batch generation
Suitable for training, presentations, customer service

Use Cases

Call Centers

Transcribe customer calls
Real-time agent assistance
Sentiment analysis from audio
Automated quality assurance
Multi-language support

Accessibility

Voice dictation for text input
Screen reader integration
Caption generation for videos
Voice-controlled applications
Text-to-speech for visually impaired

Content Creation

Generate audiobooks from text
Create podcast voiceovers
Produce e-learning narration
Synthesize multilingual content
Avatar-based video production

Customer Service

Voice-enabled chatbots
IVR systems
Virtual assistants
Automated responses
Multi-language support

Language Learning

Pronunciation feedback
Speaking practice
Real-time transcription
Reading assistance
Fluency assessment

SDK Support

Python

pip install azure-cognitiveservices-speech

C#

dotnet add package Microsoft.CognitiveServices.Speech

Java

Maven package for Speech SDK

JavaScript

npm install microsoft-cognitiveservices-speech-sdk

C++

Native SDK for C++ applications

Swift/Objective-C

SDK for iOS and macOS apps

Input Requirements

Speech-to-Text

Audio formats: WAV, MP3, OGG, FLAC, OPUS
Sample rate: 8 kHz or 16 kHz (16 kHz recommended)
Channels: Mono or stereo
Bit depth: 16-bit PCM

Text-to-Speech

Text length: Up to 10,000 characters per request
SSML: Supported for fine-tuned control
Output formats: Multiple audio formats available

Containers

Run Speech services on-premises:

Speech-to-text container
Text-to-speech container
Custom speech container
Neural text-to-speech container
Maintain data privacy
Low-latency local processing

Pricing

Speech-to-Text

Free Tier (F0): 5 hours per month
Standard Tier (S0): Pay per hour of audio
Custom models: Additional costs

Text-to-Speech

Free Tier (F0): 0.5M characters per month
Standard Tier (S0): Pay per million characters
Neural voices: Higher cost than standard
Custom voices: Additional training and hosting

Getting Started

Create Resource

Create a Speech resource in the Azure Portal

Try Speech Studio

Test features with sample audio at speech.microsoft.com

Install SDK

Install the Speech SDK for your programming language

Build Application

Integrate speech capabilities into your app

Best Practices

Use appropriate audio quality (16 kHz, 16-bit)
Implement noise reduction for better accuracy
Use custom models for domain-specific vocabulary
Cache text-to-speech audio for repeated phrases
Implement retry logic for network failures
Monitor usage and costs
Test with diverse accents and speaking styles

Documentation Index

​Azure Speech

​Key Capabilities

Speech-to-Text

Text-to-Speech

Speech Translation

Voice Live

Speaker Recognition

Pronunciation Assessment

​Speech-to-Text

​Real-Time Transcription

​Batch Transcription

​Fast Transcription

​Custom Speech

​Text-to-Speech

​Neural Voices

​SSML (Speech Synthesis Markup Language)

​Voice Styles and Emotions

​Custom Neural Voice

​Batch Synthesis

​Speech Translation

​Speech-to-Speech Translation

​Voice Live (Preview)

​Language Identification

​Pronunciation Assessment

​Text-to-Speech Avatar

​Use Cases

​SDK Support

Python

C#

Java

JavaScript

C++

Swift/Objective-C

​Input Requirements

​Speech-to-Text

​Text-to-Speech

​Containers

​Pricing

​Speech-to-Text

​Text-to-Speech

​Getting Started

​Best Practices

​Next Steps

Azure Speech

Key Capabilities

Speech-to-Text

Real-Time Transcription

Batch Transcription

Fast Transcription

Custom Speech

Text-to-Speech

Neural Voices

SSML (Speech Synthesis Markup Language)

Voice Styles and Emotions

Custom Neural Voice

Batch Synthesis

Speech Translation

Speech-to-Speech Translation

Voice Live (Preview)

Language Identification

Pronunciation Assessment

Text-to-Speech Avatar

Use Cases

SDK Support

Input Requirements

Speech-to-Text

Text-to-Speech

Containers

Pricing

Speech-to-Text

Text-to-Speech

Getting Started

Best Practices

Next Steps