Open your .skbundle recording in ScreenKite, then prompt your AI agent (Claude Code, Codex, Gemini CLI, or any agent with ScreenKite's MCP tools). The agent handles two things: cutting the transcript and generating B-roll with scene layouts. You review and approve; it executes.
For community workflows, prompts, and skill packs: github.com/ScreenKite/awesome-ai-video-editing
Prompting Your Agent
You don't write code. You write a sentence. The agent calls ScreenKite's CLI and MCP tools on your behalf.
Claude Code
# Start an interactive session in your project folder claude # Then type: Open ~/Desktop/Recording.skbundle and do a transcript cut. Plan the cuts first.
# Or one-shot from the terminal claude "Open ~/Desktop/Recording.skbundle, transcribe the mic with ElevenLabs, plan all cuts before executing"
Codex CLI
codex "Open ~/Desktop/Recording.skbundle and do a transcript cut — plan first, then wait for my approval"
# B-roll in one go codex "Open ~/Desktop/Recording.skbundle, transcribe and cut, then add medium-density B-roll with a centered layout"
Gemini CLI
gemini "Open ~/Desktop/Recording.skbundle. Transcribe the mic, plan the cuts, and show me the list before touching the timeline."
What the agent actually calls
Under the hood, every session starts with:
# Open the project
'/Applications/ScreenKite.app/Contents/MacOS/ScreenKite' agent project open \
--path ~/Desktop/Recording.skbundle --json
# Read project state
'/Applications/ScreenKite.app/Contents/MacOS/ScreenKite' agent tool call \
--name getProjectState --input-json '{"scope":"summary"}' --json
You can run these yourself to inspect state at any point. --json on every call makes output machine-readable.
Skills
Skills are pre-built prompt bundles that teach the agent the full workflow so you don't have to describe it from scratch. Install them once; reference them by name in any session.
Install
npx skills add ScreenKite/awesome-ai-video-editing
Available skills
use-screenkite-advanced-b-roll — Full pipeline: transcribe with ElevenLabs, pack to phrase view, proofread proper nouns, propose visual menu with density bundles, generate Hyperframes compositions in parallel, render to MP4, apply setSceneLayout DSL with magicMove transitions.
claude "use the use-screenkite-advanced-b-roll skill on ~/Desktop/Recording.skbundle. Cute visuals, centered layout, medium density."
video-use — Transcript-focused editing: transcribe, pack, plan cuts, confirm, execute. Also handles color grade, subtitles, and animation overlays via FFmpeg when working outside ScreenKite.
claude "use the video-use skill. Transcribe ~/Desktop/Recording.skbundle and plan a cut."
Invoking a skill in Claude Code
If you have Claude Code open interactively, type the skill name as a slash command:
/use-screenkite-advanced-b-roll
The skill loads its instructions and prompts you for the recording path.
Part 1 — Transcript-Driven Cuts
What the agent does
- Transcribes your microphone track with ElevenLabs Scribe — word-level timestamps, cached so it never re-uploads the same file
- Packs the raw JSON into a readable phrase view (phrases break on silences ≥ 0.5s)
- Proofreads every product name or proper noun via web search — ASR regularly mishears names (e.g. "ScreenKite" transcribed as "Screencast"); wrong names spread into every downstream caption and visual
- Proposes a cut list with exact time ranges and a plain-English reason for each cut
- Waits for your approval before touching the timeline
Sample prompt
Open ~/Desktop/Recording.skbundle and transcribe + cut. Show me the cut plan first, don't touch the timeline yet. ElevenLabs key is in ~/.config/env/elevenlabs.env
The agent returns something like:
[000.06–000.66] "HelloPro." → false start [002.14–002.56] "Hello." → second false start --- CUT [0 → 2.98s] --- [011.84–012.18] "Uh," → filler CUT [8.72–10.00] [043.60–045.16] "And let's see." → transition CUT [40.48–42.60] Result: 58.8s → 51.2s Apply these 3 cuts?
Reply yes and all cuts apply in one call via editTimeline(action: "cut", {ranges: [...]}).
What gets cut
- False starts — anything before the real first sentence (mic checks, repeated greetings)
- Filler words — isolated "Uh," "Um," "Like" with sufficient silence on both sides
- Transition phrases — "And let's see," "OK so," "Anyway" that pad between beats
The agent never cuts mid-word, pads every cut edge 100–150ms from word boundaries, and prefers silences ≥ 400ms as cut targets.
Part 2 — Automatic B-Roll Generation
After cuts, the agent maps the transcript to beats and generates an animated visual for each using Hyperframes (HTML + GSAP → MP4). Each visual is placed as a scene layout in ScreenKite with a magicMove transition.
Layout styles
Corner PiP — screen recording fills the canvas, B-roll appears as a corner accent (40–42% width). Best for tutorials where the screen content is the main story.
Centered B-roll — screen recording minimizes to top-left (~38%), B-roll plays centered (~56% width). Best for product intros where the visual should be prominent.
# Corner PiP (default) claude "add B-roll with corner layout" # Centered claude "add B-roll — minimize the screen to top left, B-roll centered, medium density, cute visuals"
What the agent does
- Beat mapping — maps cut transcript phrases to beats: product name, key feature, workflow, CTA
- Density choice — proposes Sparse (4), Medium (7), or Dense (10); shows a slot menu; waits for your pick
- Parallel generation — dispatches one sub-agent per slot simultaneously; each writes a full 1920×1080 Hyperframes composition
- Serial renders — renders each slot to MP4 in sequence (parallel Chrome spawns corrupt frames)
- DSL application — calls
setSceneLayoutfor each time window with your chosen layout
The visual contract
Every generated visual follows these rules:
- Full-frame content — the 1920×1080 MP4 is the PiP frame; content fills it edge-to-edge (placing a small card inside a mostly empty frame buries it in a corner-of-a-corner)
- Entry → hold → no internal exit — visuals animate in (0–1.5s), settle into a readable hold, and stop.
magicMovehandles the exit. Internal fade-outs produce a broken double-exit. - Large typography — display text 160–220px, body 48–72px; at 40–56% width this stays legible on screen
Density bundles
| Bundle | Slots | Spacing | Feel |
|---|---|---|---|
| Sparse | 4 | ~13s apart | Clean, documentary |
| Medium | 7 | ~7s apart | Balanced (default) |
| Dense | 10 | ~5s apart | Explainer energy |
Sample prompt
Recording is cut. Add B-roll: - Centered layout (screen top-left, B-roll center) - Medium density - Cute, warm visuals - All text in English
Iterating on one slot
Slot 3 should show a Swift logo instead of the Apple emoji. Re-render slot 3 and re-apply.
The agent regenerates only that slot and re-applies its DSL window. Everything else stays.
Putting It Together
# 1. Start Claude Code in your project folder claude # 2. Transcription cut "Open ~/Desktop/Recording.skbundle. Transcribe and plan cuts. ElevenLabs key at ~/.config/env/elevenlabs.env" # → review cut list → "yes" # 3. B-roll "Add B-roll — centered layout, medium density, cute English visuals" # → review 7-slot beat menu → "Medium, looks good" # → agent generates in parallel, renders serially, applies DSL (~3 min) # 4. Spot-check "Show me slot 4 at 18s" # → scrub in ScreenKite # 5. Tweak if needed "Slot 4 — change the node diagram to use mint green for all nodes"
Total hands-on time: under 5 minutes. Render time: ~2–3 minutes for 7 slots.
For more workflows, sample prompts, and community skills: github.com/ScreenKite/awesome-ai-video-editing