Is "Good Enough" killing Innovation in Norwegian Speech to Text?

As a consultant at People Made Machines, I see firsthand how organizations want to use AI for efficiency and insight. But when it comes to Norwegian speech-to-text, most models simply aren’t delivering the accuracy or nuance that real businesses demands.

For companies investing in digital transformation, this is a clear signal: Don’t settle for generic models. Norwegian language with its linguistic diversity requires robust systems tailored to our unique context.

I have evaluated nine different models and the tests show that even the most popular and best performing models struggle with names, dialects and conversational flow. The result? Time-consuming manual corrections and missed opportunities for automation.

How the evaluations were done

I used two podcast clips in Norwegian to test the different models and APIs. One is a discussion between two people, and the second one is four people catching up, with more crosstalk and different dialects. The samples were manually transcribed to ensure that no model influenced transcription

The code for the evaluation is available here: github.com/luvogels/…

Best performing models

OpenAI: large-v3 (Local model)

Overall this a good transcript which captures the discussions and topics in both samples. The transcript contains several small accuracy issues, including name misspellings and misinterpretations or substitutions. Some phrases are truncated or altered, and the ending is incomplete compared to the reference transcript. While key ideas around the main topics are preserved, minor errors occasionally distort the intended meaning. Readability is impacted by grammatical inconsistencies and awkward phrasing, interrupting the flow. Certain colloquial expressions are retained but interjections are missing. This leads to the overall tone lacking the liveliness of the original and reducing its stylistic authenticity.

ElevenLabs: Scribe v1 (API)

The transcription model demonstrates good overall performance. Accuracy is reduced some by name errors and misheard words, and completeness suffers from missing details and incomplete segments. The main topics are retained, some preservation of meaning is impacted by awkward phrasing and unclear expressions. Readability is somewhat reduced due to grammatical issues, fragmented sentences, and excessive filler words. Style and tone are maintained, capturing a conversational feel. The spelling and name errors are smaller than the other models.

Lower performing models

Google: Gemini 2.0 Flash (API)

The transcript demonstrates decent accuracy, with recurring name errors. It also includes critical factual mistakes and grammatical issues. Completeness varies—while one portion covers the main topics well, another cuts off mid-sentence, leading to loss of content. The preservation of meaning is moderate. Core discussions are conveyed, but some distortions affect clarity. Readability is hindered by typos, inconsistent phrasing, and minor grammatical errors. Style and tone are partially retained, with some interjections preserved, but the overall delivery lacks the natural flow and authenticity of the original due to transcription errors.

OpenAI: GPT-4o transcribe (API)

The transcript shows moderate accuracy, with several recurring issues such as misheard or misspelled names and substitutions. Completeness is limited, with multiple instances of omitted or abruptly cut content, such as missing context and key phrases Despite these issues, the preservation of meaning of the main topics are generally solid and intact. However, readability is undermined by grammatical errors, fragmented phrasing, and awkward constructions, which hinder smooth comprehension. Style and tone retention is uneven, while some conversational elements are present, others are distorted or missing, weakening the informal and expressive quality of the original delivery. This model introduces a different behavior compared to the other models, the output is less stable between runs. Each run is slightly different with different errors.

Nasjonalbiblioteket AI Lab: nb-whisper-large (Local model)

The transcript shows moderate accuracy, with several name errors and spelling mistakes. While one portion includes all key topics other parts show missing or condensed content, affecting overall completeness. Most of the intended meaning is preserved, though certain substitutions and errors, like incorrect units slightly distort the message. Readability is moderately impacted by awkward phrasing and grammatical mistakes. The conversational tone and stylistic elements are mostly retained, including some informal markers, though occasional disruptions from errors reduce their natural flow. It is difficult to recommend using this fine tuned model compared to Whisper large-v3 it does not offer a significant better result. It gives different errors and misspellings and a more compressed output.

Nasjonalbiblioteket AI Lab: nb-whisper-large-distil-turbo-beta (Local model)

The transcript demonstrates low accuracy with numerous name errors and distorted phrases. Completeness is moderate, though several phrases are truncated or unclear, resulting in a loss of nuance. Meaning is partially preserved, but some of the changes compared to the reference introduces confusion and distort key concepts. Readability is notably poor due to awkward phrasing, grammatical mistakes, and misspellings. Style and tone are inconsistently conveyed, with forced translations and unnatural expressions weakening the conversational and informal character of the original.

Nasjonalbiblioteket AI Lab: nb-whisper-medium (Local model)

The transcript contains notable accuracy issues, including incorrect names and word substitutions. Some expressive elements, like repeated exclamations and informal phrases are omitted, affecting completeness and tone. Preservation of meaning is inconsistent. Core content is mostly present, but awkward rephrasings introduce distortions. Readability is significantly hindered by grammatical errors and unnatural phrasing. The overall style and tone are weakened by mispronunciations, altered expressions, and the loss of casual, conversational markers that characterized the original delivery. The output is also significantly more compressed.

Nasjonalbiblioteket AI Lab: nb-whisper-small (Local model)

The transcript has moderate accuracy, with several name errors and distorted phrases. Completeness is limited by missing words and altered expressions, which omit certain nuances while preserving the overall content. Most of the intended meaning is retained, though occasional phrasing errors can lead to minor confusion. Readability is impacted by grammatical issues and awkward constructions that affect flow. The conversational and lively tone is generally maintained, but some expressions come across as forced or imprecise due to errors in transcription. This model has less compression than the nb-whisper-medium, but overall quality is lower.

Nasjonalbiblioteket AI Lab: nb-whisper-tiny (Local model)

The transcript displays low overall quality, with numerous accuracy issues such as incorrect or substituted names and mistranscribed phrases. Completeness is poor, with key names and phrases either omitted or distorted, weakening the coherence of the transcription. Preservation of meaning is inconsistent. The general themes are partially retained, many context altering mistakes confuse or distort intended points. Readability is significantly affected by grammatical errors and awkward constructions, and the style and tone suffer from the loss of humor, casual phrasing, and conversational rhythm. The amount of compression is low but this only leaves more space for errors.

Conclusion

The best performance for Norwegian in this comparison is ElevenLabs followed by OpenAI Whisper large-v3. In my tests the specially trained models from the Norwegian National Library was a disappointment. They did not perform better than the generic models, instead they introduced different errors, misspellings and compressed the output. My recommendation is to use the Whisper large-v3 model if you are going to run the model locally and to use ElevenLabs Scribe v1. If you run locally the best overall solution is WhisperX. WhisperX is much faster than the OpenAI client and includes diarization which splits the transcript by speaker and also has slightly better name transcription for Norwegian.