ScreenKiteScreenKite|指南
    • 安裝 ScreenKite
    • 系統需求
    • 設定權限
    • 建立新錄製
    • 錄製整個顯示器
    • 錄製視窗
    • 錄製區域
    • 視訊鏡頭與麥克風
    • 系統音訊
    • 錄製 iOS 裝置
    • 鍵盤快捷鍵
    • 自動縮放
    • 設定縮放選項
    • 專案編輯器概覽
    • 時間軸與軌道
    • 裁切與分割
    • 外觀自訂
    • 裝置外框
    • Agentic Video Editing
    • Word-Level Generated Captions
    • 匯出設定
    • 常見問題
    • 權限與存取
    ← ScreenKite 首頁
    指南/編輯

    Word-Level Generated Captions

    這篇文章尚未翻譯成你的語言,現在顯示英文版本。

    Generated captions in ScreenKite are word-level. Instead of creating one long subtitle block for a full sentence or clip, ScreenKite creates one caption cue per spoken word. This gives the editor the timing data it needs for short, Screen Studio-style caption reveals and precise agent workflows.

    Before You Generate Captions

    Open Settings -> Transcription and configure the Word-Level tab:

    1. Choose Automatic for the normal setup. ScreenKite uses ElevenLabs when an API key is configured, then falls back to a downloaded WhisperKit model.
    2. Choose ElevenLabs when you want hosted Scribe word timings.
    3. Choose Local when you want on-device WhisperKit word timestamps from a downloaded model.

    OpenAI, Groq, and Azure OpenAI are not used for generated caption timing. They can still be configured under Text & Export for AI cleanup, proofreading, or explicit transcript export workflows.

    ✅

    For the most reliable generated captions, record microphone narration as its own track. ScreenKite can also generate captions from replacement or main audio when microphone audio is not available.

    Generate Captions

    1. Open a .skbundle project in the Project Editor.
    2. Make sure the project has microphone, replacement, or main audio.
    3. Use the caption generation action in the editor or ask an agent to generate captions.
    4. ScreenKite transcribes the audio with the configured word-level provider.
    5. ScreenKite imports an SRT where each cue maps to one spoken word.

    The result is a caption track made of short word-timed clips instead of sentence-length chunks. If the provider returns no speech, ScreenKite reports that no speech was detected. If the provider returns only sentence segments without word timestamps, generated captions stop instead of creating approximate long captions.

    Agent Workflow

    Agents use the same word-level caption path as the app. A prompt can be as direct as:

    codex "Open ~/Desktop/Recording.skbundle and generate word-level captions from the microphone track"
    

    For transcript cuts, filler-word cleanup, or B-roll planning, the agent can reuse the same word timestamps so cuts and visual beats stay aligned with speech.

    Timeline Behavior

    Generated captions appear on a Captions track in the timeline. Because every word has its own cue, you can inspect and edit timing at word granularity.

    Use Timeline & Tracks for track navigation basics, and Agentic Video Editing for transcript-driven editing workflows.

    上一篇

    ← Agentic Video Editing

    下一篇

    匯出設定→