May 2, 2026

How to Transcribe a Video in Spanish, French, or 90+ Other Languages

Step-by-step guide to getting accurate transcripts of foreign-language videos using AI — including handling mixed-language content, dialects, and translation.

Most "video transcript" tools assume your video is in English. If you've ever tried to transcribe a Spanish-language YouTube interview, a Korean K-drama clip, or a Japanese tutorial, you've probably hit one of two walls: (a) the tool returns an English phonetic mess, or (b) it returns nothing at all.

The shift in the last 18 months is that AI transcription models are now genuinely multilingual — not "English plus a few extras", but trained from the ground up on dozens of languages. Here's how to use them well.

Which Languages Actually Work

SubGrab uses OpenAI's Whisper model under the hood. Whisper was trained on 680,000 hours of audio across 99 languages. Accuracy varies by language depending on how much training data was available.

In practice, three tiers:

Tier 1 — Excellent (95%+ accuracy on clear speech):

English, Spanish, French, German, Italian, Portuguese, Dutch, Russian, Japanese, Chinese (Mandarin), Korean, Arabic, Polish, Turkish

Tier 2 — Good (85–95% accuracy):

Hindi, Indonesian, Vietnamese, Thai, Hebrew, Czech, Greek, Swedish, Danish, Norwegian, Finnish, Hungarian, Romanian, Ukrainian, Bulgarian

Tier 3 — Workable but expect to edit (70–85%):

Bengali, Tamil, Telugu, Marathi, Urdu, Persian, Swahili, and most of the long tail of supported languages

If your language isn't on this list, the transcript will likely come back garbled. The full supported list is in the Whisper paper appendix.

How to Get the Best Results

Three things matter, in order:

### 1. Audio Quality > Everything

A clean recording in Tier 3 Bengali will beat a noisy recording in Tier 1 English. Background music, multiple overlapping speakers, low bitrate, or compression artifacts all hurt accuracy more than the choice of language model does.

If the source video has clear single-speaker speech, expect close to the high end of its tier. If it's a chaotic group call with music in the background, expect the low end.

### 2. Let the AI Auto-Detect Language

You don't need to tell SubGrab what language a video is in — Whisper detects it from the first 30 seconds and runs the appropriate model. This works correctly more than 95% of the time.

The exception is mixed-language content — videos that switch between two languages mid-sentence. Whisper picks the dominant one. If your video is 70% English with Spanish quotes interleaved, you'll get an English transcript with the Spanish bits transliterated. There's no perfect fix for this; the workaround is to manually transcribe the second-language segments after.

### 3. Long Videos Are Fine

For videos over 45 minutes, SubGrab automatically chunks the audio and transcribes in parallel. Language detection happens once on the first chunk; the detected language is then carried into all subsequent chunks for consistency. This avoids the failure mode where a model "loses the language" partway through a long video.

A Worked Example: Spanish YouTube Interview

Let's say you want to transcribe a 25-minute Spanish-language interview from a Mexican YouTube channel.

1. Copy the YouTube URL.

2. Paste it into [SubGrab](/).

3. If YouTube has Spanish captions for the video, SubGrab will pull them for free in seconds — no AI needed. Captions on Spanish-language YouTube are often manually-added by the channel and will be very accurate.

4. If captions aren't available, click "Use AI Transcription" (1 credit). The transcript comes back in Spanish, with timestamps.

5. Download as TXT for reading, SRT for subtitles, or VTT for web playback.

6. (Optional) Run the Spanish text through DeepL or Google Translate for a side-by-side English version.

The whole flow takes about 90 seconds for a 25-minute video.

Special Cases

### Chinese (Mandarin vs Cantonese)

Whisper's "Chinese" mode is trained primarily on Mandarin. It will *attempt* Cantonese videos and the result is usually understandable but imperfect — expect 75–85% accuracy. For Cantonese-heavy content (Hong Kong news, Cantonese vlogs), the transcript will need editing.

### Indian Languages

Hindi, Tamil, Telugu, Marathi, Bengali, Punjabi, Gujarati are all supported. Hindi is the most accurate; the South Indian languages tend to have lower accuracy because there's less training data. Code-switched content (English mixed with Hinglish, for example) is a known weak spot — Whisper will pick one language and transliterate the other.

### Japanese and Korean

Both work very well, including for casual conversational speech. The transcripts come back in native script (Japanese: kanji + hiragana + katakana; Korean: Hangul). If you need romanised output, run the result through a separate romaniser tool.

### Arabic

Modern Standard Arabic works well. Regional dialects (Egyptian, Levantine, Gulf) work but with lower accuracy and the model will sometimes "correct" dialect words to MSA equivalents. For dialect-heavy content, expect to edit.

What About Translation?

SubGrab transcribes in the original language of the video. We don't translate by default, because:

1. Translation quality varies hugely by language pair, and forcing it would silently degrade results

2. Many users want the original-language transcript (for citation, language learning, or to translate themselves with a tool they trust)

3. Modern translation tools (DeepL, Google Translate, ChatGPT) are very good and you probably already have a preference

If you want a translation, copy the SubGrab transcript and paste it into your translator of choice. For long transcripts, DeepL has the cleanest results in our testing for European languages; ChatGPT or Claude works better for Asian languages because it can handle context.

Pricing

Multilingual transcription is the same price as English: 1 credit per video up to 60 minutes (longer videos use 1 credit per 60-minute chunk). Every new SubGrab account gets 2 free credits.

Try it free — paste any video in any language →.