Skip to content

Latest commit

 

History

History
86 lines (61 loc) · 7.9 KB

README.md

File metadata and controls

86 lines (61 loc) · 7.9 KB

thought-forge-ai

An experiment in generating 30-60 second "deep thought" TikTok-style video including a spoken monologue, moving video scenes, music, and subtitles.

Examples: (make sure to turn on audio, GitHub mutes by default)

Why.Being.Bored.Might.Be.Your.Secret.Superpower.mp4
The.1.Change.That.Can.Transform.Your.Life.mp4
Why.Being.Weak.Is.Actually.Your.Greatest.Strength.sm.mp4

Some more examples are in ./data/examples.

Background / Motivation

I've recently seen a fair amount of "philosophical" and story-telling content on social media. Common traits are a calm voice, soothing music, calm pictures and a discussion of some topic of self-improvement, or love, or the world and our place in it. The algorithms have noticed that I sometimes enjoy this content.

Some of this type of content has clearly AI-generated elements, for example using the same voice or images that are slightly off. So I wondered how hard it would be to create similar content in a 100% automated manner.

Turns out, it's pretty easy. I also think it's pretty hilarious to create thoughtful content about human struggles in modern life fully with AI.

Quality and Future Work

Sometimes the quality is surprisingly good, both in the topic and structure as well as the video. Mostly it is somewhat mediocre.

I think the output could be improved a bit with better prompts, and significantly with more cherry-picking or using (human) source material or at least inspiration. It currently also doesn't generate first-person personal stories which I might add later.

I've also noticed that the LLM is more deterministic than I thought. Even with temperature 1 it kept generating the same or similar topics, so I had to add the previous topics into the context. The same happens for the monologues, they often have too similar content, probably they would also need context of previous texts in order to generate more different content.

Steps and Tools

This project uses a combination of tools together with custom written code:

Step Task Tool Cost per Video Free Alternative Code Example
0 Choose Topics, Voice and Clickbait Title LLM (Claude 3.5) <$0.011 Llama 3.1 step-00-find-topic.ts topic.json
1 Write monologue script LLM (Claude 3.5) $0.012 Llama 3.1 step-01-write-monologue.ts monologue.txt
2 Read monologue TTS (Elevenlabs) $0.203 coqui-ai step-02-text-to-speech.ts speech.mp3
3 Split monologue into scenes, create image prompts, calculate start+end times LLM (Claude 3.5) $0.01 Llama 3.1 step-03-text-to-image-prompt.ts alignments.json
4 Create starting image for each scene Text to Image (Flux.1 Pro) $0.354 self-hosted Flux.1 Dev step-04-text-to-image.ts example.jpg
5 Create scene video Image to Video (RunwayML Gen-3 Alpha) 1x$6.00 or $80/month5 ? step-05-image-to-video.ts example.mp4
6 Create music prompt LLM (Claude 3.5) $0.01 Llama 3.1 step-06-text-to-music-prompt.ts music-prompt.txt
7 Create music from prompt Text to Audio (MusicGen) $0.176 MusicGen self-hosted step-07-music.ts music.mp3
8 Create subtitles for burn in None $0 - step-08-subtitles.ts subtitles.ass
9 Merge and cut video, normalize loudness and merge audio, burn in subtitles ffmpeg $0 - step-09-ffmpeg.ts merged.mp4

Apart from the video generation the whole process is pretty cheap ($0.7 per video). The video generation is both the most enticing part and also the worst (wrt API and quality). I assume this will change in the coming months.

Setup / Running

Install the dependencies using pnpm install. You can get pnpm by running corepack enable, which comes with node.

Copy the .env.example file to .env and add all the needed API keys. This is going to be pretty annoying since it needs five independent accounts most of them with payment set up.

Generate a list of 40 topics using pnpm generate.

Choose a topic from that list to generate a full video for using pnpm generate 12. This will take a few minutes, depending mainly on the speed of the video AI.

Footnotes

  1. 1200 input tokens * 3$/million + 2000 output tokens * 3$/million = $0.03 for 10 videos

  2. 3000 tokens * 3$/million + 200 tokens * 15$/million

  3. around $20 for 100 minutes worth of tokens

  4. $0.05 per image, ~7 images per video

  5. 100 credits for 10 seconds of video, $1 for 100 credits. For 60 seconds $6. Unlimited generation for $80/month.

  6. on replicate.com around 2min of one A100 80GB at $5/hour makes ~$0.17