Generate Voice, Music, Video, and Images with MiniMax AI: The Multimodal Toolkit Guide
I needed to add AI-generated media to my application: voice narration, background music, and short video clips. I looked at several providers and found MiniMax offers a unified API for all of these—TTS, music, video, and images. The best part? I could use it with pure bash scripts, no Python SDK required.
The Problem: Multiple APIs, Multiple Integrations
Most AI media providers specialize in one thing. You get TTS from one service, music from another, video from a third. Each has its own SDK, authentication, and workflow.
I wanted something simpler. When I found the minimax-multimodal-toolkit skill, I realized MiniMax provides all four modalities through a single API. The toolkit wraps everything in bash scripts using only curl, ffmpeg, and jq.
What MiniMax Offers
| Capability | Entry Point | Key Features |
|---|---|---|
| TTS | scripts/tts/generate_voice.sh | Voice cloning, voice design, multi-segment |
| Music | scripts/music/generate_music.sh | Songs with lyrics, instrumentals |
| Image | scripts/image/generate_image.sh | Text-to-image, character reference |
| Video | scripts/video/generate_video.sh | Text-to-video, image-to-video, templates |
| Long Video | scripts/video/generate_long_video.sh | Multi-scene with crossfade |
API Configuration
MiniMax has two endpoints depending on your account region:
# China Mainlandexport MINIMAX_API_HOST="https://api.minimaxi.com"
# or Globalexport MINIMAX_API_HOST="https://api.minimax.io"
# Your API key (starts with sk-api- or sk-cp-)export MINIMAX_API_KEY="sk-api-your-key-here"Get your key from the MiniMax platform for global accounts or MiniMax China for China accounts.
TTS: Text-to-Speech
The TTS system supports single-voice and multi-voice generation. I tested both approaches.
Single Voice (Default)
For simple narration, the tts command handles everything in one call:
bash scripts/tts/generate_voice.sh tts "Hello world, this is a test." \ -o minimax-output/hello.mp3The default model is speech-2.8-hd, which auto-matches emotion from text. I didn’t need to specify emotion manually.
Multi-Segment Audiobook
For a project with narrator and dialogue, I needed multiple voices. The toolkit uses a segments.json file:
[ { "text": "Morning sunlight streamed into the classroom.", "voice_id": "narrator", "emotion": "" }, { "text": "Tom smiled and turned to Lisa:", "voice_id": "narrator", "emotion": "" }, { "text": "The weather is amazing today!", "voice_id": "tom", "emotion": "happy" }, { "text": "Lisa thought for a moment, then replied:", "voice_id": "narrator", "emotion": "" }, { "text": "Sure, but I need to drop off my backpack first.", "voice_id": "lisa", "emotion": "" }]Then generate with the generate command:
bash scripts/tts/generate_voice.sh generate segments.json \ -o minimax-output/audiobook.mp3 --crossfade 200The script generates each segment individually and merges them with 200ms crossfade transitions.
Voice Cloning
I cloned my own voice from a 30-second sample:
bash scripts/tts/generate_voice.sh clone my_voice_sample.mp3 \ --voice-id my-custom-voiceAfter cloning, I could use my-custom-voice as a voice_id in any TTS call.
Voice Design
For a project needing a specific voice profile, I used the design feature:
bash scripts/tts/generate_voice.sh design "A warm female narrator voice, calm and professional" \ --voice-id narrator-voiceThis created a voice matching my description. The API returns a preview audio file automatically.
TTS Models
| Model | Notes |
|---|---|
speech-2.8-hd | Recommended, auto emotion matching |
speech-2.8-turbo | Faster variant |
speech-2.6-hd | Previous generation, manual emotion |
Music Generation
I needed background music for videos. The music API generates both instrumental tracks and songs with lyrics.
Instrumental (For BGM)
bash scripts/music/generate_music.sh \ --instrumental \ --prompt "ambient electronic, atmospheric, calming" \ --output minimax-output/bgm.mp3 --downloadFor BGM, always use --instrumental. The toolkit handles the underlying API requirements automatically.
With Lyrics
When I wanted an actual song:
bash scripts/music/generate_music.sh \ --lyrics "[verse]\nWalking down the street\n[chorus]\nLa la la la" \ --prompt "indie folk, acoustic guitar, melancholic" \ --output minimax-output/song.mp3 --downloadThe default model is music-2.5.
Image Generation
The image API supports text-to-image (t2i) and image-to-image with character reference (i2i).
Text-to-Image
bash scripts/image/generate_image.sh \ --prompt "A cat on a rooftop at sunset, cinematic lighting, warm tones" \ --aspect-ratio 16:9 \ -o minimax-output/cat.pngThe aspect ratio matters. I matched it to my use case:
| Use Case | Ratio | Resolution |
|---|---|---|
| Avatar/icon | 1:1 | 1024x1024 |
| Desktop wallpaper | 16:9 | 1280x720 |
| Phone wallpaper | 9:16 | 720x1280 |
| Photography | 3:2 | 1248x832 |
Character Reference (Image-to-Image)
To generate images with a consistent character:
bash scripts/image/generate_image.sh \ --mode i2i \ --prompt "A girl in a library, warm afternoon light, reading a book" \ --ref-image face.jpg \ --aspect-ratio 16:9 \ -o minimax-output/library_scene.pngThe --ref-image should be a front-facing portrait for best results. JPG, JPEG, and PNG formats are supported up to 10MB.
Video Generation
This is where MiniMax really shines. The video API supports text-to-video, image-to-video, start-end frame interpolation, and subject reference.
Text-to-Video
The default is 10 seconds at 768P:
bash scripts/video/generate_video.sh \ --mode t2v \ --prompt "A golden retriever puppy runs toward camera, [跟随] tracking shot, golden hour lighting" \ --output minimax-output/puppy.mp4I learned the prompt formula matters: Main subject + Scene + Movement + Camera motion + Aesthetic atmosphere.
Camera instructions use Chinese brackets:
| Instruction | Meaning |
|---|---|
[推进] | Push in |
[拉远] | Pull out |
[跟随] | Tracking shot |
[固定] | Fixed camera |
[左摇] | Pan left |
Image-to-Video
To animate a static image:
bash scripts/video/generate_video.sh \ --mode i2v \ --prompt "Gentle ripples spread across water, soft breeze, [固定] fixed camera" \ --first-frame photo.jpg \ --output minimax-output/animated.mp4For i2v mode, the prompt should focus on movement only—the image already provides the visual.
Subject Reference (Face Consistency)
To keep the same person across videos:
bash scripts/video/generate_video.sh \ --mode ref \ --prompt "A woman walking through a garden, [跟随] tracking shot" \ --subject-image face.jpg \ --output minimax-output/person.mp4This mode uses the S2V-01 model, limited to 6 seconds at 720P.
Long-Form Multi-Scene Video
For a video story with multiple scenes:
bash scripts/video/generate_long_video.sh \ --scenes \ "An astronaut stands on a red planet, [推进] slow push in, cinematic" \ "The astronaut walks toward a glowing structure, [跟随] tracking shot" \ "The astronaut reaches the doorway, blue energy pulses, [推进] push in" \ --music-prompt "cinematic orchestral ambient" \ --output minimax-output/long_video.mp4The first scene generates via text-to-video. Each subsequent scene uses the previous scene’s last frame as its starting point. Scenes join with 0.5s crossfade transitions.
Video Model Constraints
| Model | Duration | Resolution |
|---|---|---|
| MiniMax-Hailuo-2.3 | 6s or 10s | 768P, 1080P |
| MiniMax-Hailuo-02 | 6s or 10s | 512P, 768P, 1080P |
| S2V-01 (ref) | 6s only | 720P |
The key rule: 10s duration only works with 768P (or 512P on Hailuo-02). Never combine 10s with 1080P.
Media Tools
The toolkit includes FFmpeg-based utilities for processing generated media:
# Convert video formatbash scripts/media_tools.sh convert-video input.webm -o output.mp4
# Concatenate videos with crossfadebash scripts/media_tools.sh concat-video seg1.mp4 seg2.mp4 -o merged.mp4
# Extract audio from videobash scripts/media_tools.sh extract-audio video.mp4 -o audio.mp3
# Add background musicbash scripts/media_tools.sh add-audio --video video.mp4 --audio bgm.mp3 \ -o output.mp4 --volume 0.3Prerequisites
No Python required. I only needed:
brew install ffmpeg jq # macOS# orapt install ffmpeg jq # Linux
# Verify environmentbash scripts/check_environment.shThe scripts are pure bash using curl, ffmpeg, jq, and xxd.
Output Directory
All generated files should go to minimax-output/ in your working directory:
mkdir -p minimax-outputbash scripts/tts/generate_voice.sh tts "Hello" -o minimax-output/hello.mp3Summary
In this post, I showed how to use MiniMax’s multimodal AI toolkit to generate voice, music, video, and images through simple bash scripts. The toolkit provides unified access to MiniMax’s TTS (with voice cloning and design), music generation (instrumental and with lyrics), image generation (text-to-image and character reference), and video generation (text-to-video, image-to-video, and multi-scene).
The pure bash approach means no SDK dependencies—just curl, ffmpeg, and jq. This makes it easy to integrate into any workflow or scripting environment.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments