Skip to main content

Protocol

Secure WebSocket (WSS)

Message Types

  1. Initial Message (Client to Server)
  2. Audio Data (Client to Server)
  3. Transcription Results (Server to Client)

For detailed message structures, see Message Types.

Audio Data Requirements

The server expects audio data to adhere to specific standards to ensure proper processing and transcription. Here's what you need to know:

Sample Rate:

  • The audio data must have a sample rate of 16kHz (16,000 samples per second). This sample rate provides a balance between audio quality and processing efficiency, making it suitable for speech recognition tasks.

Bit Depth:

  • The audio data should be in 16-bit PCM format. This bit depth ensures that the audio data is represented with sufficient precision for accurate transcription while keeping file sizes manageable.

Channel Configuration:

  • The audio data must be mono (single channel). Stereo or multi-channel audio can introduce unnecessary complexity and data redundancy, so converting to mono ensures uniformity.

Example Code for Resampling Audio

To help you meet these requirements, here's an example of how you can resample an audio file using Python and FFmpeg:

def resample_audio(input_file: str, new_sample_rate: int = 16000):
"""
Open an audio file, read it as mono waveform, resample if needed,
and save the modified audio file.
"""
try:
# Use ffmpeg to decode audio with resampling
output, _ = (
ffmpeg.input(input_file, threads=0)
.output("-", format="s16le", acodec="pcm_s16le", ac=1, ar=new_sample_rate)
.run(cmd=["ffmpeg", "-nostdin"], capture_stdout=True, capture_stderr=True)
)
except ffmpeg.Error as e:
raise RuntimeError(f"Error loading audio: {e.stderr.decode()}") from e
np_audio_buffer = np.frombuffer(output, dtype=np.int16)

modified_audio_file = f"{input_file.split('.')[0]}_modified.wav"
scipy.io.wavfile.write(modified_audio_file, new_sample_rate, np_audio_buffer.astype(np.int16))
return modified_audio_file