SubGrab

Caption vs Transcript vs Subtitle: What's the Difference?

Captions, transcripts, and subtitles are not the same thing. Understand which one you need for accessibility, SEO, study, or translation.

If you've ever Googled "how to get captions from a video" and ended up with a tool that gave you a transcript instead — or vice versa — you're not alone. The three terms are used interchangeably online, but they describe different things, with different formats, different uses, and different legal implications.

Here's the practical difference, and how to know which one you actually need.

The 30-Second Summary

| Term | What it is | Includes timestamps? | Includes non-speech audio? | Typical file format |

|---|---|---|---|---|

| Transcript | The full spoken text of a video, as plain text | No (or optional) | Usually no | TXT, DOCX |

| Subtitle | Text shown over a video, usually a translation of the dialogue | Yes | No (dialogue only) | SRT, VTT |

| Caption | Text shown over a video, including dialogue AND non-speech audio (laughter, music, door slams) | Yes | Yes | SRT, VTT, SCC |

When You Want a Transcript

A transcript is the right choice when you need to read or process the content as text rather than watch the video.

Common use cases:

  • Studying — review a 90-minute lecture in 5 minutes by skimming the transcript
  • Citation — quote a video accurately in an article, paper, or research note
  • Search — find the exact moment someone said a specific word
  • SEO — turn a video into a blog post that search engines can index
  • AI summarisation — feed the transcript into an LLM for a summary, key points, or action items
  • Translation — translate the text once, then use it as a starting point for subtitles in other languages

If your goal is *reading* or *processing*, you want a transcript.

SubGrab gives you a transcript by default — clean text you can copy, search, or download.

When You Want Subtitles

Subtitles are translations of dialogue, displayed on top of a video. They assume the viewer can hear sound effects and music; they only translate what's being said.

Use subtitles when:

  • You're publishing a video for an audience that doesn't speak the original language (e.g., dubbing an English explainer for a Spanish-speaking audience)
  • You're a fan-translator producing translated versions of foreign content
  • You're a content creator localising your own videos for international audiences

Subtitles are typically delivered as SRT (SubRip) or VTT (WebVTT) files — both are timestamped text formats that any video player understands.

SubGrab's transcripts can be downloaded directly as SRT or VTT, ready to drop into a video editor or upload to YouTube.

When You Want Captions

Captions are the most regulated of the three. They're designed for viewers who can't hear the audio at all — Deaf and hard-of-hearing audiences, viewers in noisy environments (gyms, airports), or viewers who simply prefer text.

Unlike subtitles, captions include:

  • All spoken dialogue
  • Speaker identification ("ALEX:", "NARRATOR:")
  • Sound effects in brackets ("[door slams]", "[ominous music]")
  • Tone indicators where relevant ("[whispering]", "[sarcastic]")

Captions are required by law in many jurisdictions for broadcast, streaming, and government video content (see the WCAG 2.2 success criterion 1.2.2 and the US ADA / Section 508 requirements).

If you're publishing video for commercial use, government use, education, or accessibility compliance, you need true captions, not subtitles.

Common Confusions Cleared Up

"Closed captions" vs "open captions" — Closed captions can be toggled off; open captions are burned into the video frame and can't be disabled. Both contain the same information.

YouTube's auto-generated text — YouTube calls these "captions" but they're closer to a basic transcript. They lack speaker labels and non-speech audio cues, so they don't meet captioning standards.

SRT files — these are usually called "subtitles" because of the format extension (SubRip), but they can contain either subtitles or captions depending on what's inside.

"Transcript" in a meeting tool — Zoom, Teams, and Google Meet all produce something called a "transcript" that includes speaker labels and timestamps. This is closer to a *captioned transcript* than a plain transcript.

Which Should You Generate from SubGrab?

When you paste a video URL into SubGrab, you get back a transcript — clean text that's been segmented with timestamps under the hood. From there, you can download in three formats:

  • TXT — plain text, no timestamps. Best for reading, citation, AI processing.
  • SRT — timestamped, ready to use as subtitles or imported into a video editor.
  • VTT — same as SRT but in WebVTT format, which is what HTML5 video players use natively.

If you need true broadcast-grade captions with sound-effect descriptions and speaker identification, you'll want to add those manually after exporting — SubGrab gives you the speech-to-text foundation, but the editorial layer is on you.

TL;DR

  • Need to read the video as text? → Transcript
  • Need to translate spoken dialogue for foreign viewers? → Subtitles
  • Need to make a video accessible to Deaf viewers or comply with WCAG? → Captions
  • Just want the words? → SubGrab gives you all three, from one URL.

Try it free →.