Open your .skbundle recording in ScreenKite, then prompt your AI agent (Claude Code, Codex, Gemini CLI, or any agent with ScreenKite's MCP tools). The agent handles two things: cutting the transcript and generating B-roll with scene layouts. You review and approve; it executes.

For community workflows, prompts, and skill packs: github.com/ScreenKite/awesome-ai-video-editing

Preflight: Local Model Readiness

Before the agent can transcribe using WhisperKit, the model files must exist on disk. The agent CLI checks this at startup — if the selected model path is missing or stale, you will see:

WhisperKit transcription is not configured. Open Settings → Transcription and download a WhisperKit model. Stop here and ask the human to configure a transcription provider before continuing.

To resolve:

Open Settings → Transcription.
Select the Word-Level sub-tab.
Set the provider to Local (or Automatic to prefer ElevenLabs with WhisperKit as fallback).
Under the WhisperKit model section, pick a model from the menu, then click Download Selected Model.
Wait for the progress bar to complete; the label changes to Model downloaded.
Re-run your agent command — the readiness check will pass.

Prompting Your Agent

You don't write code. You write a sentence. The agent calls ScreenKite's CLI and MCP tools on your behalf.

Claude Code

# Start an interactive session in your project folder
claude

# Then type:
Open ~/Desktop/Recording.skbundle and do a transcript cut. Plan the cuts first.

# Or one-shot from the terminal
claude "Open ~/Desktop/Recording.skbundle, transcribe the mic with ElevenLabs, plan all cuts before executing"

Codex CLI

codex "Open ~/Desktop/Recording.skbundle and do a transcript cut — plan first, then wait for my approval"

# B-roll in one go
codex "Open ~/Desktop/Recording.skbundle, transcribe and cut, then add medium-density B-roll with a centered layout"

Gemini CLI

gemini "Open ~/Desktop/Recording.skbundle. Transcribe the mic, plan the cuts, and show me the list before touching the timeline."

What the agent actually calls

Under the hood, every session starts with:

# Open the project
'/Applications/ScreenKite.app/Contents/MacOS/ScreenKite' agent project open \
  --path ~/Desktop/Recording.skbundle --json

# Read project state
'/Applications/ScreenKite.app/Contents/MacOS/ScreenKite' agent tool call \
  --name getProjectState --input-json '{"scope":"summary"}' --json

You can run these yourself to inspect state at any point. --json on every call makes output machine-readable.

Skills

Skills are pre-built prompt bundles that teach the agent the full workflow so you don't have to describe it from scratch. Install them once; reference them by name in any session.

Install

npx skills add ScreenKite/awesome-ai-video-editing

Available skills

use-screenkite-advanced-b-roll — Full pipeline: transcribe with ElevenLabs, pack to phrase view, proofread proper nouns, propose visual menu with density bundles, generate Hyperframes compositions in parallel, render to MP4, apply setSceneLayout DSL with magicMove transitions.

claude "use the use-screenkite-advanced-b-roll skill on ~/Desktop/Recording.skbundle. Cute visuals, centered layout, medium density."

video-use — Transcript-focused editing: transcribe, pack, plan cuts, confirm, execute. Also handles color grade, subtitles, and animation overlays via FFmpeg when working outside ScreenKite.

claude "use the video-use skill. Transcribe ~/Desktop/Recording.skbundle and plan a cut."

Invoking a skill in Claude Code

If you have Claude Code open interactively, type the skill name as a slash command:

/use-screenkite-advanced-b-roll

The skill loads its instructions and prompts you for the recording path.

Part 1 — Transcript-Driven Cuts

What the agent does

Transcribes your microphone track with ElevenLabs Scribe — word-level timestamps, cached so it never re-uploads the same file
Packs the raw JSON into a readable phrase view (phrases break on silences ≥ 0.5s)
Proofreads every product name or proper noun via web search — ASR regularly mishears names (e.g. "ScreenKite" transcribed as "Screencast"); wrong names spread into every downstream caption and visual
Proposes a cut list with exact time ranges and a plain-English reason for each cut
Waits for your approval before touching the timeline

The same word timestamps power generated captions. When an agent creates captions in ScreenKite, it uses the configured Word-Level transcription provider and imports one caption cue per spoken word instead of long sentence blocks. See Word-Level Generated Captions for the app-side setup.

Sample prompt

Open ~/Desktop/Recording.skbundle and transcribe + cut.
Show me the cut plan first, don't touch the timeline yet.
ElevenLabs key is in ~/.config/env/elevenlabs.env

The agent returns something like:

[000.06–000.66]  "HelloPro."          → false start
[002.14–002.56]  "Hello."             → second false start
--- CUT [0 → 2.98s] ---
[011.84–012.18]  "Uh,"                → filler      CUT [8.72–10.00]
[043.60–045.16]  "And let's see."     → transition  CUT [40.48–42.60]

Result: 58.8s → 51.2s
Apply these 3 cuts?

Reply yes and all cuts apply in one call via editTimeline(action: "cut", {ranges: [...]}).

What gets cut

False starts — anything before the real first sentence (mic checks, repeated greetings)
Filler words — isolated "Uh," "Um," "Like" with sufficient silence on both sides
Transition phrases — "And let's see," "OK so," "Anyway" that pad between beats

The agent never cuts mid-word, pads every cut edge 100–150ms from word boundaries, and prefers silences ≥ 400ms as cut targets.

Part 2 — Automatic B-Roll Generation

After cuts, the agent maps the transcript to beats and generates an animated visual for each using Hyperframes (HTML + GSAP → MP4). Each visual is placed as a scene layout in ScreenKite with a magicMove transition.

Layout styles

Corner PiP — screen recording fills the canvas, B-roll appears as a corner accent (40–42% width). Best for tutorials where the screen content is the main story.

Centered B-roll — screen recording minimizes to top-left (~38%), B-roll plays centered (~56% width). Best for product intros where the visual should be prominent.

# Corner PiP (default)
claude "add B-roll with corner layout"

# Centered
claude "add B-roll — minimize the screen to top left, B-roll centered, medium density, cute visuals"

What the agent does

Beat mapping — maps cut transcript phrases to beats: product name, key feature, workflow, CTA
Density choice — proposes Sparse (4), Medium (7), or Dense (10); shows a slot menu; waits for your pick
Parallel generation — dispatches one sub-agent per slot simultaneously; each writes a full 1920×1080 Hyperframes composition
Serial renders — renders each slot to MP4 in sequence (parallel Chrome spawns corrupt frames)
DSL application — calls setSceneLayout for each time window with your chosen layout

The visual contract

Every generated visual follows these rules:

Full-frame content — the 1920×1080 MP4 is the PiP frame; content fills it edge-to-edge (placing a small card inside a mostly empty frame buries it in a corner-of-a-corner)
Entry → hold → no internal exit — visuals animate in (0–1.5s), settle into a readable hold, and stop. magicMove handles the exit. Internal fade-outs produce a broken double-exit.
Large typography — display text 160–220px, body 48–72px; at 40–56% width this stays legible on screen

Density bundles

Bundle	Slots	Spacing	Feel
Sparse	4	~13s apart	Clean, documentary
Medium	7	~7s apart	Balanced (default)
Dense	10	~5s apart	Explainer energy

Sample prompt

Recording is cut. Add B-roll:
- Centered layout (screen top-left, B-roll center)
- Medium density
- Cute, warm visuals
- All text in English

Iterating on one slot

Slot 3 should show a Swift logo instead of the Apple emoji.
Re-render slot 3 and re-apply.

The agent regenerates only that slot and re-applies its DSL window. Everything else stays.

Putting It Together

# 1. Start Claude Code in your project folder
claude

# 2. Transcription cut
"Open ~/Desktop/Recording.skbundle. Transcribe and plan cuts. ElevenLabs key at ~/.config/env/elevenlabs.env"
# → review cut list → "yes"

# 3. B-roll
"Add B-roll — centered layout, medium density, cute English visuals"
# → review 7-slot beat menu → "Medium, looks good"
# → agent generates in parallel, renders serially, applies DSL (~3 min)

# 4. Spot-check
"Show me slot 4 at 18s"
# → scrub in ScreenKite

# 5. Tweak if needed
"Slot 4 — change the node diagram to use mint green for all nodes"

Total hands-on time: under 5 minutes. Render time: ~2–3 minutes for 7 slots.

For more workflows, sample prompts, and community skills: github.com/ScreenKite/awesome-ai-video-editing

For community workflows, prompts, and skill packs: github.com/ScreenKite/awesome-ai-video-editing

Preflight: Local Model Readiness

Before the agent can transcribe using WhisperKit, the model files must exist on disk. The agent CLI checks this at startup — if the selected model path is missing or stale, you will see:

WhisperKit transcription is not configured. Open Settings → Transcription and download a WhisperKit model. Stop here and ask the human to configure a transcription provider before continuing.

To resolve:

Open Settings → Transcription.
Select the Word-Level sub-tab.
Set the provider to Local (or Automatic to prefer ElevenLabs with WhisperKit as fallback).
Under the WhisperKit model section, pick a model from the menu, then click Download Selected Model.
Wait for the progress bar to complete; the label changes to Model downloaded.
Re-run your agent command — the readiness check will pass.

Prompting Your Agent

You don't write code. You write a sentence. The agent calls ScreenKite's CLI and MCP tools on your behalf.

Claude Code

# Start an interactive session in your project folder
claude

# Then type:
Open ~/Desktop/Recording.skbundle and do a transcript cut. Plan the cuts first.

# Or one-shot from the terminal
claude "Open ~/Desktop/Recording.skbundle, transcribe the mic with ElevenLabs, plan all cuts before executing"

Codex CLI

codex "Open ~/Desktop/Recording.skbundle and do a transcript cut — plan first, then wait for my approval"

# B-roll in one go
codex "Open ~/Desktop/Recording.skbundle, transcribe and cut, then add medium-density B-roll with a centered layout"

Gemini CLI

gemini "Open ~/Desktop/Recording.skbundle. Transcribe the mic, plan the cuts, and show me the list before touching the timeline."

What the agent actually calls

Under the hood, every session starts with:

# Open the project
'/Applications/ScreenKite.app/Contents/MacOS/ScreenKite' agent project open \
  --path ~/Desktop/Recording.skbundle --json

# Read project state
'/Applications/ScreenKite.app/Contents/MacOS/ScreenKite' agent tool call \
  --name getProjectState --input-json '{"scope":"summary"}' --json

You can run these yourself to inspect state at any point. --json on every call makes output machine-readable.

Skills

Skills are pre-built prompt bundles that teach the agent the full workflow so you don't have to describe it from scratch. Install them once; reference them by name in any session.

Install

npx skills add ScreenKite/awesome-ai-video-editing

Available skills

claude "use the use-screenkite-advanced-b-roll skill on ~/Desktop/Recording.skbundle. Cute visuals, centered layout, medium density."

video-use — Transcript-focused editing: transcribe, pack, plan cuts, confirm, execute. Also handles color grade, subtitles, and animation overlays via FFmpeg when working outside ScreenKite.

claude "use the video-use skill. Transcribe ~/Desktop/Recording.skbundle and plan a cut."

Invoking a skill in Claude Code

If you have Claude Code open interactively, type the skill name as a slash command:

/use-screenkite-advanced-b-roll

The skill loads its instructions and prompts you for the recording path.

Part 1 — Transcript-Driven Cuts

What the agent does

Transcribes your microphone track with ElevenLabs Scribe — word-level timestamps, cached so it never re-uploads the same file
Packs the raw JSON into a readable phrase view (phrases break on silences ≥ 0.5s)
Proofreads every product name or proper noun via web search — ASR regularly mishears names (e.g. "ScreenKite" transcribed as "Screencast"); wrong names spread into every downstream caption and visual
Proposes a cut list with exact time ranges and a plain-English reason for each cut
Waits for your approval before touching the timeline

Sample prompt

Open ~/Desktop/Recording.skbundle and transcribe + cut.
Show me the cut plan first, don't touch the timeline yet.
ElevenLabs key is in ~/.config/env/elevenlabs.env

The agent returns something like:

[000.06–000.66]  "HelloPro."          → false start
[002.14–002.56]  "Hello."             → second false start
--- CUT [0 → 2.98s] ---
[011.84–012.18]  "Uh,"                → filler      CUT [8.72–10.00]
[043.60–045.16]  "And let's see."     → transition  CUT [40.48–42.60]

Result: 58.8s → 51.2s
Apply these 3 cuts?

Reply yes and all cuts apply in one call via editTimeline(action: "cut", {ranges: [...]}).

What gets cut

False starts — anything before the real first sentence (mic checks, repeated greetings)
Filler words — isolated "Uh," "Um," "Like" with sufficient silence on both sides
Transition phrases — "And let's see," "OK so," "Anyway" that pad between beats

The agent never cuts mid-word, pads every cut edge 100–150ms from word boundaries, and prefers silences ≥ 400ms as cut targets.

Part 2 — Automatic B-Roll Generation

Layout styles

Corner PiP — screen recording fills the canvas, B-roll appears as a corner accent (40–42% width). Best for tutorials where the screen content is the main story.

Centered B-roll — screen recording minimizes to top-left (~38%), B-roll plays centered (~56% width). Best for product intros where the visual should be prominent.

# Corner PiP (default)
claude "add B-roll with corner layout"

# Centered
claude "add B-roll — minimize the screen to top left, B-roll centered, medium density, cute visuals"

What the agent does

Beat mapping — maps cut transcript phrases to beats: product name, key feature, workflow, CTA
Density choice — proposes Sparse (4), Medium (7), or Dense (10); shows a slot menu; waits for your pick
Parallel generation — dispatches one sub-agent per slot simultaneously; each writes a full 1920×1080 Hyperframes composition
Serial renders — renders each slot to MP4 in sequence (parallel Chrome spawns corrupt frames)
DSL application — calls setSceneLayout for each time window with your chosen layout

The visual contract

Every generated visual follows these rules:

Full-frame content — the 1920×1080 MP4 is the PiP frame; content fills it edge-to-edge (placing a small card inside a mostly empty frame buries it in a corner-of-a-corner)
Entry → hold → no internal exit — visuals animate in (0–1.5s), settle into a readable hold, and stop. magicMove handles the exit. Internal fade-outs produce a broken double-exit.
Large typography — display text 160–220px, body 48–72px; at 40–56% width this stays legible on screen

Density bundles

Bundle	Slots	Spacing	Feel
Sparse	4	~13s apart	Clean, documentary
Medium	7	~7s apart	Balanced (default)
Dense	10	~5s apart	Explainer energy

Sample prompt

Recording is cut. Add B-roll:
- Centered layout (screen top-left, B-roll center)
- Medium density
- Cute, warm visuals
- All text in English

Iterating on one slot

Slot 3 should show a Swift logo instead of the Apple emoji.
Re-render slot 3 and re-apply.

The agent regenerates only that slot and re-applies its DSL window. Everything else stays.

Putting It Together

# 1. Start Claude Code in your project folder
claude

# 2. Transcription cut
"Open ~/Desktop/Recording.skbundle. Transcribe and plan cuts. ElevenLabs key at ~/.config/env/elevenlabs.env"
# → review cut list → "yes"

# 3. B-roll
"Add B-roll — centered layout, medium density, cute English visuals"
# → review 7-slot beat menu → "Medium, looks good"
# → agent generates in parallel, renders serially, applies DSL (~3 min)

# 4. Spot-check
"Show me slot 4 at 18s"
# → scrub in ScreenKite

# 5. Tweak if needed
"Slot 4 — change the node diagram to use mint green for all nodes"

Total hands-on time: under 5 minutes. Render time: ~2–3 minutes for 7 slots.

For more workflows, sample prompts, and community skills: github.com/ScreenKite/awesome-ai-video-editing

Agentic Video Editing

Preflight: Local Model Readiness

Prompting Your Agent

Claude Code

Codex CLI

Gemini CLI

What the agent actually calls

Skills

Install

Available skills

Invoking a skill in Claude Code

Part 1 — Transcript-Driven Cuts

What the agent does

Sample prompt

What gets cut

Part 2 — Automatic B-Roll Generation

Layout styles

What the agent does

The visual contract

Density bundles

Sample prompt

Iterating on one slot

Putting It Together

Agentic Video Editing

Preflight: Local Model Readiness

Prompting Your Agent

Claude Code

Codex CLI

Gemini CLI

What the agent actually calls

Skills

Install

Available skills

Invoking a skill in Claude Code

Part 1 — Transcript-Driven Cuts

What the agent does

Sample prompt

What gets cut

Part 2 — Automatic B-Roll Generation

Layout styles

What the agent does

The visual contract

Density bundles

Sample prompt

Iterating on one slot

Putting It Together