Guides
How to Make a 60-Second AI Product Demo in 2026
Step-by-step guide to producing a 60-second AI product demo: script, scene breakdown, model picks, lipsync, captions, voiceover, final cut.
The 60-second AI product demo is the highest-converting short-form asset most SaaS, ecommerce and physical-product brands can produce in 2026. It's long enough to demonstrate the value, short enough to play in feed without losing retention, and the AI-generation toolchain has matured to the point where you can produce a credible one in an afternoon for under $40 in model costs. This guide walks through the full workflow on Versely — from script to scene breakdown to model selection per shot to lipsync, captions, voiceover and the final cut.
We'll use a worked example throughout: a fictional SaaS app called "Lumen" that helps freelancers track invoices. The script, scene breakdown and prompt list are reproducible — swap in your own product and the structure carries.
A 60-second AI demo done right beats a 30-minute live demo for top-of-funnel.
The 60-second demo structure
Sixty seconds gives you exactly six 10-second scenes or eight 7-8 second scenes. The structural template that converts:
- Scene 1 (0-8s) — Hook. The pain or the promise. Face on screen, direct address.
- Scene 2 (8-18s) — The problem in concrete terms. Show the friction.
- Scene 3 (18-30s) — Introduce the product. First clean shot of the UI or the physical item.
- Scene 4 (30-42s) — Demonstration. The single most useful action the product enables.
- Scene 5 (42-52s) — Outcome. The after-state. What the user gets.
- Scene 6 (52-60s) — Call to action. Direct ask, URL on screen, voice match to text.
Sixty seconds total, six scenes, single voice running through the whole thing for cohesion. Captions on every scene because most viewers watch silent.
Step 1: Write the script
Script first, always. Write the voiceover before you generate a single frame. The script controls scene length, controls retention and controls the call to action.
Sample script for the Lumen example, 60 seconds total:
(0-8s) "If you're a freelancer chasing invoices, you already know — the work is done in an hour, the chasing takes weeks."
(8-18s) "Spreadsheets, sticky notes, awkward follow-up emails — it's a tax on your time you didn't sign up to pay."
(18-30s) "Lumen is one app for every invoice you send. Track status, automate reminders, get paid faster."
(30-42s) "Send an invoice in twenty seconds. Lumen reminds your client at day three, day seven, day fourteen — automatically, in your voice."
(42-52s) "Freelancers using Lumen get paid on average eleven days faster. That's eleven days of cashflow back in your pocket."
(52-60s) "Try Lumen free for fourteen days at lumenapp.com. Stop chasing. Start collecting."
Roughly 145 words. At a natural narration pace of 145-160 words per minute, this lands at 58-62 seconds. Time it once with your own voice or a Inworld TTS-2 dry run before you commit to scene generation.
Step 2: Break down the scenes
Each scene gets a one-line visual description, an aspect ratio, a model pick and a duration target. Vertical-native 9:16 throughout for Reels, TikTok and Shorts.
| Scene | Duration | Visual | Model | Aspect |
|---|---|---|---|---|
| 1 | 8s | Freelancer at laptop, frustrated, checking phone for payments | VEO 3.1 (with dialogue) | 9:16 |
| 2 | 10s | Cluttered desk: spreadsheets on screen, sticky notes, late-night lighting | Kling 3.0 I2V | 9:16 |
| 3 | 12s | Clean shot of the Lumen app UI on phone, smooth UI animation | Kling 3.0 I2V from screenshot | 9:16 |
| 4 | 12s | Hand creating invoice in app, send animation, notification confirmation | Kling 3.0 I2V from screenshots | 9:16 |
| 5 | 10s | Same freelancer from scene 1 but relaxed, money landing in account on phone | VEO 3.1 (with dialogue) | 9:16 |
| 6 | 8s | Clean end card with logo, URL, free trial CTA, voiceover sign-off | Kling 3.0 T2V or static frame with motion | 9:16 |
Total runtime: 60 seconds. Two scenes use VEO 3.1 for native dialogue and lipsync (the bookend human moments). Four scenes use Kling 3.0 for cost-efficient product and B-roll content.
Step 3: Pick the right model per scene
The routing logic is the load-bearing decision in 2026 AI video work. Premium models on every shot is wasteful. Cheap models on dialogue shots produce ugly lipsync that kills conversion.
For this demo:
- Scenes 1 and 5 (human, talking) — VEO 3.1. Native audio co-generation produces lipsync that reads as human. At ~$0.12/s for 18 seconds total, this costs around $2.16.
- Scenes 2, 3, 4 (product and environment B-roll) — Kling 3.0 image-to-video. Vertical-native, cheap, motion quality is more than sufficient for product demonstration. At ~$0.032/s for 34 seconds total, around $1.10.
- Scene 6 (end card) — Kling 3.0 text-to-video at ~$0.028/s for 8 seconds, around $0.22.
Total model cost across all six scenes: roughly $3.50. For more on routing per shot see our VEO 3.1 vs Kling 3.0 comparison and the Sora 2 vs Kling 3.0 breakdown.
Routing per scene: VEO for talking, Kling for product, mix in the timeline.
Step 4: Write the prompts per scene
Prompts make or break the generation. Vague prompts produce generic output. Specific prompts with camera language, lighting, mood and action produce footage you can actually use.
Sample prompt list:
Scene 1 (VEO 3.1, 8s, 9:16):
"Medium close-up of a 30-year-old female freelancer sitting at a laptop in a small home office. Late afternoon natural light from a window to her left. She glances at her phone with a frustrated expression, then back at the screen. Says directly to camera: 'If you're a freelancer chasing invoices, you already know — the work is done in an hour, the chasing takes weeks.' Subtle handheld camera feel. 9:16 vertical."
Scene 2 (Kling 3.0 I2V, 10s, 9:16):
Input: a high-detail still of a cluttered freelancer desk — open spreadsheet on monitor, paper invoices, sticky notes saying "follow up" and "overdue", coffee mug, late-evening warm lamp light. Motion prompt: "Slow push-in on the cluttered desk. Camera drifts from the spreadsheet to the sticky notes to the unanswered email window. Warm lamp glow. Quiet, frustrated mood."
Scene 3 (Kling 3.0 I2V, 12s, 9:16):
Input: a high-fidelity screenshot of the Lumen app dashboard on an iPhone 15 Pro, dark mode, clean UI with three invoice cards visible. Motion prompt: "Phone held in landscape-portrait by a single hand. Subtle hand motion. UI animates: invoice cards slide in from the right, status indicators pulse from yellow to green. Clean modern aesthetic."
Scene 4 (Kling 3.0 I2V, 12s, 9:16):
Input: screenshot of the invoice creation flow, three sequential frames. Motion prompt: "First-person view of a hand using the phone. Tap to create invoice, fields auto-fill, hit send. Notification appears: 'Invoice sent. Auto-reminder scheduled.' Smooth UI motion."
Scene 5 (VEO 3.1, 10s, 9:16):
"Same freelancer from scene 1, now at the same desk but relaxed. Bright morning light. She glances at her phone, sees a payment notification, smiles slightly. Says to camera: 'Freelancers using Lumen get paid on average eleven days faster. That's eleven days of cashflow back in your pocket.' Natural, conversational delivery."
Scene 6 (Kling 3.0 T2V, 8s, 9:16):
"Clean white end card. The Lumen logo (geometric L mark, navy blue) animates into center frame. Below the logo: 'lumenapp.com' and 'Free 14-day trial' in clean sans-serif type. Subtle parallax motion. Bright, modern, software-brand aesthetic."
Step 5: Voiceover and voice cloning
The script in scenes 1 and 5 is co-generated by VEO 3.1 because that's the native dialogue path. For scenes 2, 3, 4 and 6 the voiceover runs over silent generated footage and needs to be produced separately.
Two voiceover paths in 2026:
- ElevenLabs v3 (GA March 14, 2026) — premium TTS quality with natural prosody. Around $0.30 per 1000 characters.
- Inworld TTS-2 (released May 5, 2026) — newer, cheaper, strong on conversational delivery. Worth testing for cost-sensitive production.
For brand work where the spokesperson has an established voice, voice cloning gives you a single matched voice across all six scenes — even the VEO 3.1 dialogue scenes can be regenerated with the cloned voice via lipsync if the native VEO voice doesn't match the brand.
The cleanest 2026 workflow: clone the spokesperson voice once, generate the full 60-second voiceover as a single track with ElevenLabs v3 or Inworld TTS-2, then route it through AI lipsync on the visual scenes. This guarantees voice consistency across the entire piece.
One cloned voice across all six scenes is the brand-consistent path.
Step 6: Lipsync the dialogue scenes
For scenes 1 and 5, you have two options:
- Path A (faster): Use VEO 3.1's native co-generated audio. The lipsync is excellent and you save the lipsync model call. Voice is the VEO-generated voice, which may or may not match your brand identity.
- Path B (brand-consistent): Generate scenes 1 and 5 silently on VEO 3.1, then run them through AI lipsync with your cloned spokesperson voice. Quality of the lipsync is very high in 2026 — sub-frame timing accuracy and natural mouth shape transitions.
For one-off demos, Path A is fine. For brand campaigns where the same spokesperson voice carries across multiple assets, Path B is worth the extra step.
Step 7: Captions on every scene
Roughly 80 percent of social viewers watch with sound off. Captions are not optional.
Generation pattern:
- Export the final 60-second voiceover track.
- Run it through Versely's AI auto-caption generator for word-by-word timing.
- Style: sans-serif bold, 38pt, white text with thin black outline, center-screen at the upper-middle band.
- Keep one or two words on screen at a time for maximum legibility on small phone screens.
- Avoid the lower 250 pixels — Instagram, TikTok and YouTube Shorts overlay UI there.
Step 8: Final cut and assembly
Assemble in Versely's movie maker timeline:
- Drop scenes 1 through 6 in order on the video track.
- Drop the unified voiceover track on the audio layer, locked to the scene timing.
- Add a soft music bed at -20dB underneath — custom Suno v5.5 generation or a licensed track.
- Add subtle audio transitions: a half-second cross-fade between scenes 2 and 3 (problem to solution), hard cuts everywhere else.
- Layer captions on the text track, timed to the voiceover.
- Color-grade across all six scenes for visual consistency — Kling 3.0 and VEO 3.1 default to slightly different color science, so a unifying grade across the timeline pulls them together.
- Export at 1080x1920, H.264, 30fps, target file size under 100MB for clean upload to Reels and TikTok.
Total production time on a first pass: 3-5 hours for someone familiar with the toolchain. Total model cost: under $5. Total per-asset cost including time at $50/hr: $155-$255 versus $5,000-$15,000 for a comparable live-shot production.
60 seconds, six scenes, under $5 in model costs — a credible 2026 production stack.
FAQ
Can I do this entire workflow on Versely without other tools?
Yes. VEO 3.1 and Kling 3.0 generation, voice cloning, lipsync, auto-captioning, custom music and timeline assembly are all on Versely as of mid-2026. The full 60-second demo lives inside one platform.
Do I need a separate voice cloning step if I use VEO 3.1's native audio?
Only if you need the same brand voice across both VEO scenes and the Kling B-roll voiceover scenes. For a one-off demo where consistency across pieces doesn't matter, VEO's native audio is fine. For ongoing brand work with a recognizable voice, clone once and reuse.
Why not generate the entire demo on VEO 3.1?
Cost. VEO 3.1 is roughly 4x more expensive per second than Kling 3.0. For B-roll and product UI shots where you don't need native audio, Kling 3.0 produces equivalent visual quality at a fraction of the price. Reserve VEO 3.1 for the dialogue scenes specifically.
Can I use Sora 2 instead of VEO 3.1 for the dialogue scenes?
You can but you shouldn't for product demos. Sora 2 produces silent video, so you'd add a separate lipsync pass and lose the native audio quality advantage. VEO 3.1 is the correct dialogue model in 2026. See the Sora 2 vs VEO 3.1 comparison for the deeper breakdown.
How do I scale this to multiple product demos?
Templatize the script structure (six scenes, hook-problem-product-demo-outcome-CTA), templatize the prompt list, swap in the product specifics. With the structural work done, you can produce 5-10 demos per week on a single product line. For volume creator work see our AI content creation playbook.
Closing takeaway
A 60-second AI product demo in 2026 is a 3-5 hour production at under $5 in model costs that genuinely competes with traditional live-shot demos for top-of-funnel social conversion. The structure is six scenes routed across VEO 3.1 for dialogue and Kling 3.0 for product B-roll, unified by a single voice clone, captioned end-to-end and assembled in one timeline. The workflow is repeatable, the cost is decisive and the conversion lift on social over a static product image is meaningful. Build your first demo on Versely's AI movie maker and route the dialogue work through the AI video generator with VEO 3.1 selected.