Our speech-to-text api supports 98 languages and has all the features you love

Written by Lars Damgaard Nielsen
Our speech-to-text api supports 98 languages and has all the features you love

Just a quick note on something cool we’ve been cooking up in-house: MediaCatch Speech-to-Text API version 2.

We initially developed it for internal use, but hey, why not share the goodness?

It’s a neat tool that converts spoken language into text, and the new version supports an impressive 98 languages.

⭐️The real stars here are our in-house models for 🇩🇰 Danish, 🇸🇪 Swedish, 🇳🇴Bokmål, and 🇳🇴 Nynorsk. We’re not bragging (okay, maybe a little), but they’re the most accurate globally.

For the other 94 languages, we’ve harnessed the power of the best open-source models out there.

What’s it got?

The API can identify multiple languages in a single file and transcribe them automatically. It’s pretty handy in delivering metadata like speaker ID, gender, and language. Plus, you get the transcript in raw format or with time-stamped sentences and words, making it a breeze to follow along. Oh, and it generates subtitles for videos, too.

We’ve positioned this as an enterprise solution, already tackling thousands of hours of transcription daily for our clients. It’s all about high accuracy and stellar uptime.

If interested you can read the documentation here.

But in more human-like language, these are the features:

  • Speaker Identification: This feature assigns unique identifiers to each speaker in an audio recording. For instance, different speakers are tagged as SPEAKER_0, SPEAKER_1, etc., which helps in distinguishing who is speaking at any given time in multi-speaker recordings.
  • Gender Classification: The system classifies speakers by gender. It analyzes the voice characteristics and categorizes each speaker as male or female, aiding in demographic analysis and audience segmentation.
  • Language Identification: This functionality enables the system to detect and identify the language being spoken in each utterance. It's designed to recognize multiple languages, making it useful for processing diverse linguistic content.
  • Transcription in Raw Text: The system can transcribe spoken words into a continuous stream of text. This transcription is provided in raw format, meaning it's a complete textual representation of the audio without any segmentation into sentences or utterances.
  • Transcription by Sentence with Timestamps: Transcription is also available segmented by sentences, with each sentence accompanied by a timestamp. This format is useful for users who need to know the exact time a sentence was spoken in the audio.
  • Transcription by Word with Timestamps: In this mode, the system provides transcriptions at a more granular level, breaking down the speech to individual words, each with its own timestamp. This is particularly helpful for detailed analysis or when precise timing for each word is required.
  • SRT Subtitle Format: The system can generate transcripts in the SRT (SubRip Subtitle) file form
  • at, using the word timestamps. This format is widely used for creating subtitles for videos and includes both the text and the timing information for each line of dialogue.

We love sharing our passion for media and AI in our newsletter. No sales BS - just sharing challenges and insights with you.

Or contact Lars Damgaard Nielsen to get more information

Don't see the specific solution you need above?

Then reach out and one of our skilled consultants will evaluate if your problem is solvable through AI. If yes, we can make sure to find the right AI solution that will super power the needs of your business.