What these tools actually do
HeyGen is an AI video generation platform. You provide a script and a digital avatar (either a stock avatar or a custom one trained on your likeness), and it generates a video of the avatar delivering the script. Output: MP4, ready to upload natively to LinkedIn, YouTube, or wherever.
ElevenLabs is an AI voice synthesis and cloning platform. It can generate hyper-realistic speech from text, and it can train a voice model on a few minutes of your own speech so generated audio sounds like you.
Used together, they let you produce a short explanatory video of yourself delivering a script in under an hour — at a fraction of the cost and time of filming, lighting, and editing real footage.
HeyGen — setup and workflow
There are two ways to use HeyGen: with a stock avatar (quick, no setup, less personal) or with a custom avatar trained on your likeness (more setup, more authentic, significantly better results for personal brand content).
Setting up a custom avatar
The requirements: 2–5 minutes of clean video footage of you talking directly to camera. Flat background, natural light (or a ring light), no background noise. You don't need to say anything specific — just speak naturally in varied sentences, including some with natural pauses, pitch changes, and emphasis.
Upload the footage to HeyGen → request avatar training (takes a few hours to process) → receive a notification when the avatar is ready. First render is typically 80–90% convincing. The main tells at first: unusual mouth movements on certain phonemes, occasional blink timing that's slightly off.
The production workflow
- Write the script first. 100–150 words for a 60-second video. Short sentences. No complex compound clauses — the avatar handles those less naturally.
- Paste the script into HeyGen → select your avatar and voice → set the background (plain backgrounds work better than virtual sets for professional content) → generate.
- First render: watch it through once. Check lip sync, check any unusual pauses, check emphasis. If the delivery is off on a key sentence, re-record just that line with adjusted emphasis markers.
- Download the MP4 → import into Descript → add captions (Descript does this in about 2 minutes) → export.
Total time per 60-second video with this workflow: 45–60 minutes, including script writing.
ElevenLabs — voice cloning
ElevenLabs produces more natural voice output than HeyGen's built-in voices — particularly for conversational content and longer-form narration. I use it in two ways:
As a voice layer for HeyGen
Generate the audio in ElevenLabs → download the MP3 → import into HeyGen as a custom voice track instead of using the platform's synthesis. This gives more natural delivery at the cost of an extra production step.
For audio-only content
Short audio clips for social posts, podcast-style narration for blog content, or voiceover for screen recordings. ElevenLabs produces natural-sounding narration from text that can replace or supplement real recording for most non-personal content.
Training the voice clone
Upload 3–5 minutes of clean speech audio. Varied sentence structures, natural delivery, no background noise. The resulting model handles new scripts with accurate inflection within a day or two of training. Edge cases where it struggles: unusual proper nouns, very long compound sentences, acronyms it hasn't seen pronounced.
The quality ceiling (and when it matters)
AI-generated video is visibly AI-generated if you look closely — particularly in close-up shots, emotional content, and anything that requires natural micro-expressions. The technology is good enough to be convincing in most B2B contexts; it is not good enough to be indistinguishable from filmed footage.
When quality ceiling matters:
- Testimonials and personal stories — authentic human footage is significantly more credible. Don't use AI avatars for content where emotional authenticity is the entire point.
- Client-facing communications — video messages to specific clients should be real. The AI tell is too obvious at conversational distance.
- High-stakes brand moments — product launches, major announcements — worth investing in real production.
When quality ceiling doesn't matter:
- Educational explainers — people are watching for the content, not the presenter.
- Framework walkthroughs — the slide or screen is the focus; the narrator voice is secondary.
- Top-of-funnel brand content where the goal is reach, not emotional connection.
- LinkedIn posts where the video is 60–90 seconds and watched without sound (captions carry the weight).
Best B2B use cases
The formats where AI video has the highest ROI in B2B marketing:
- Weekly "one insight" LinkedIn video: 60–90 seconds, one specific insight, produced Monday, posted Tuesday. Builds a consistent video presence without the overhead of weekly filming.
- Lead magnet explainers: the 90-second video on the thank-you page after a resource download, introducing the next step. This is where we saw the highest per-view engagement rate at Excelerate.
- Case study narration: a 2–3 minute walkthrough of a case study, with the avatar narrating while screen capture or slides show the results. Easier to consume than a written case study for some audiences.
- FAQ videos: one question per video, 30–45 seconds each, formatted as a short playlist. Excellent for reducing pre-sales objections.
- Email nurture sequence videos: short personal-style video embedded in email sequences. Higher click-to-open rates than text-only emails, even when the avatar is visibly AI.
Common mistakes
- Scripts too long: anything over 90 seconds loses attention on social. If you have more to say, break it into two videos.
- No captions: 70–80% of LinkedIn video is watched without sound. Captions are not optional. Descript generates them in under 2 minutes.
- Uploading to YouTube then sharing the link: LinkedIn suppresses external video links. Upload natively. Always.
- Using stock avatars for personal brand content: if the content is meant to represent you, use your avatar. Stock avatars are fine for branded content.
- Skipping the edit pass: always watch the full render before downloading. Lip sync errors, awkward pauses, and mispronunciations happen. They're quick to fix before you've distributed the video.
My production workflow
Here is the exact workflow I use to produce a batch of 4 videos in a single session (roughly 3 hours total):
- Write all 4 scripts in one sitting. 100–150 words each. Store in Notion.
- Paste all scripts into HeyGen at once. Generate all 4 simultaneously (HeyGen processes in the background).
- While generating: build the Canva thumbnail for each video (5 min per thumbnail, template).
- Download all 4 MP4s when ready.
- Import into Descript in batch — auto-generate captions for all, clean up any errors, export.
- Schedule all 4 in Typefully with the thumbnails and LinkedIn post copy.
Batch production cuts the per-video overhead significantly. Context switching is the real time cost — batching eliminates it.