Generated captions in ScreenKite are word-level. Instead of creating one long subtitle block for a full sentence or clip, ScreenKite creates one caption cue per spoken word. This gives the editor the timing data it needs for short, Screen Studio-style caption reveals and precise agent workflows.

Before You Generate Captions

Open Settings -> Transcription and configure the Word-Level tab:

Choose Automatic for the normal setup. ScreenKite uses ElevenLabs when an API key is configured, then falls back to a downloaded WhisperKit model.
Choose ElevenLabs when you want hosted Scribe word timings.
On an Apple Silicon Mac, choose Local when you want on-device WhisperKit word timestamps from a downloaded model.

OpenAI, Groq, and Azure OpenAI are not used for generated caption timing. They can still be configured under Text & Export for AI cleanup, proofreading, or explicit transcript export workflows.

ElevenLabs Key Validation

After entering an API key under ElevenLabs, click Test Key to verify it. The result appears inline next to the button:

Label	Cause
Valid for Speech to Text	Key is accepted and has the required scope.
Invalid API key	HTTP 401 — key is malformed, revoked, or belongs to a different workspace.
Key needs ElevenLabs Speech to Text permission	HTTP 403 — key exists but lacks the required scope. Open the ElevenLabs dashboard and update your API key scopes to include Speech to Text access.
Orange warning (e.g. "ElevenLabs rate limit reached. Try again later.")	HTTP 429 — you have hit the ElevenLabs rate limit. Wait a moment and test again.

The rate-limit and other transient messages can wrap to multiple lines — the label expands vertically to show the full text.

Generate Captions

Open a .skbundle project in the Project Editor.
Make sure the project has microphone, replacement, or main audio.
Use the caption generation action in the editor or ask an agent to generate captions.
ScreenKite transcribes the audio with the configured word-level provider.
ScreenKite imports an SRT where each cue maps to one spoken word.

The result is a caption track made of short word-timed clips instead of sentence-length chunks. If the provider returns no speech, ScreenKite reports that no speech was detected. If the provider returns only sentence segments without word timestamps, generated captions stop instead of creating approximate long captions.

Agent Workflow

Agents use the same word-level caption path as the app. A prompt can be as direct as:

codex "Open ~/Desktop/Recording.skbundle and generate word-level captions from the microphone track"

For transcript cuts, filler-word cleanup, or B-roll planning, the agent can reuse the same word timestamps so cuts and visual beats stay aligned with speech.

To trigger caption generation without a terminal, use the built-in AI Chat Assistant — it has access to the same caption tools and requires only a text prompt in the chat panel.

Subtitle Segmentation

Subtitle segmentation controls let you adjust how captions are split into lines. By default, ScreenKite generates one caption cue per word, but you can configure line grouping to produce multi-word subtitle segments that are easier to read on screen.

Guided Transcription Setup

The first time you use an AI or transcript feature, ScreenKite shows a guided setup window that walks you through choosing and configuring a transcription provider. This ensures your transcription is ready before you try to generate captions or use text-based editing.

Timeline Behavior

Generated captions appear on a Captions track in the timeline. Because every word has its own cue, you can inspect and edit timing at word granularity.

Use Timeline & Tracks for track navigation basics, and Agentic Video Editing for transcript-driven editing workflows.

Before You Generate Captions

Open Settings -> Transcription and configure the Word-Level tab:

Choose Automatic for the normal setup. ScreenKite uses ElevenLabs when an API key is configured, then falls back to a downloaded WhisperKit model.
Choose ElevenLabs when you want hosted Scribe word timings.
On an Apple Silicon Mac, choose Local when you want on-device WhisperKit word timestamps from a downloaded model.

OpenAI, Groq, and Azure OpenAI are not used for generated caption timing. They can still be configured under Text & Export for AI cleanup, proofreading, or explicit transcript export workflows.

ElevenLabs Key Validation

After entering an API key under ElevenLabs, click Test Key to verify it. The result appears inline next to the button:

Label	Cause
Valid for Speech to Text	Key is accepted and has the required scope.
Invalid API key	HTTP 401 — key is malformed, revoked, or belongs to a different workspace.
Key needs ElevenLabs Speech to Text permission	HTTP 403 — key exists but lacks the required scope. Open the ElevenLabs dashboard and update your API key scopes to include Speech to Text access.
Orange warning (e.g. "ElevenLabs rate limit reached. Try again later.")	HTTP 429 — you have hit the ElevenLabs rate limit. Wait a moment and test again.

The rate-limit and other transient messages can wrap to multiple lines — the label expands vertically to show the full text.

Generate Captions

Open a .skbundle project in the Project Editor.
Make sure the project has microphone, replacement, or main audio.
Use the caption generation action in the editor or ask an agent to generate captions.
ScreenKite transcribes the audio with the configured word-level provider.
ScreenKite imports an SRT where each cue maps to one spoken word.

Agent Workflow

Agents use the same word-level caption path as the app. A prompt can be as direct as:

codex "Open ~/Desktop/Recording.skbundle and generate word-level captions from the microphone track"

For transcript cuts, filler-word cleanup, or B-roll planning, the agent can reuse the same word timestamps so cuts and visual beats stay aligned with speech.

To trigger caption generation without a terminal, use the built-in AI Chat Assistant — it has access to the same caption tools and requires only a text prompt in the chat panel.

Subtitle Segmentation

Guided Transcription Setup

Timeline Behavior

Generated captions appear on a Captions track in the timeline. Because every word has its own cue, you can inspect and edit timing at word granularity.

Use Timeline & Tracks for track navigation basics, and Agentic Video Editing for transcript-driven editing workflows.

Word-Level Generated Captions

Before You Generate Captions

ElevenLabs Key Validation

Generate Captions

Agent Workflow

Subtitle Segmentation

Guided Transcription Setup

Timeline Behavior

Word-Level Generated Captions

Before You Generate Captions

ElevenLabs Key Validation

Generate Captions

Agent Workflow

Subtitle Segmentation

Guided Transcription Setup

Timeline Behavior