Generate Voice, Music, Video, and Images with MiniMax AI: The Multimodal Toolkit Guide

Mar 27, 2026

I needed to add AI-generated media to my application: voice narration, background music, and short video clips. I looked at several providers and found MiniMax offers a unified API for all of these—TTS, music, video, and images. The best part? I could use it with pure bash scripts, no Python SDK required.

The Problem: Multiple APIs, Multiple Integrations

Most AI media providers specialize in one thing. You get TTS from one service, music from another, video from a third. Each has its own SDK, authentication, and workflow.

I wanted something simpler. When I found the minimax-multimodal-toolkit skill, I realized MiniMax provides all four modalities through a single API. The toolkit wraps everything in bash scripts using only curl, ffmpeg, and jq.

What MiniMax Offers

Capability	Entry Point	Key Features
TTS	`scripts/tts/generate_voice.sh`	Voice cloning, voice design, multi-segment
Music	`scripts/music/generate_music.sh`	Songs with lyrics, instrumentals
Image	`scripts/image/generate_image.sh`	Text-to-image, character reference
Video	`scripts/video/generate_video.sh`	Text-to-video, image-to-video, templates
Long Video	`scripts/video/generate_long_video.sh`	Multi-scene with crossfade

API Configuration

MiniMax has two endpoints depending on your account region:

# China Mainland
export MINIMAX_API_HOST="https://api.minimaxi.com"

# or Global
export MINIMAX_API_HOST="https://api.minimax.io"

# Your API key (starts with sk-api- or sk-cp-)
export MINIMAX_API_KEY="sk-api-your-key-here"

Get your key from the MiniMax platform for global accounts or MiniMax China for China accounts.

TTS: Text-to-Speech

The TTS system supports single-voice and multi-voice generation. I tested both approaches.

Single Voice (Default)

For simple narration, the tts command handles everything in one call:

bash scripts/tts/generate_voice.sh tts "Hello world, this is a test." \
  -o minimax-output/hello.mp3

The default model is speech-2.8-hd, which auto-matches emotion from text. I didn’t need to specify emotion manually.

Multi-Segment Audiobook

For a project with narrator and dialogue, I needed multiple voices. The toolkit uses a segments.json file:

[
  { "text": "Morning sunlight streamed into the classroom.", "voice_id": "narrator", "emotion": "" },
  { "text": "Tom smiled and turned to Lisa:", "voice_id": "narrator", "emotion": "" },
  { "text": "The weather is amazing today!", "voice_id": "tom", "emotion": "happy" },
  { "text": "Lisa thought for a moment, then replied:", "voice_id": "narrator", "emotion": "" },
  { "text": "Sure, but I need to drop off my backpack first.", "voice_id": "lisa", "emotion": "" }
]

Then generate with the generate command:

bash scripts/tts/generate_voice.sh generate segments.json \
  -o minimax-output/audiobook.mp3 --crossfade 200

The script generates each segment individually and merges them with 200ms crossfade transitions.

Voice Cloning

I cloned my own voice from a 30-second sample:

bash scripts/tts/generate_voice.sh clone my_voice_sample.mp3 \
  --voice-id my-custom-voice

After cloning, I could use my-custom-voice as a voice_id in any TTS call.

Voice Design

For a project needing a specific voice profile, I used the design feature:

bash scripts/tts/generate_voice.sh design "A warm female narrator voice, calm and professional" \
  --voice-id narrator-voice

This created a voice matching my description. The API returns a preview audio file automatically.

TTS Models

Model	Notes
`speech-2.8-hd`	Recommended, auto emotion matching
`speech-2.8-turbo`	Faster variant
`speech-2.6-hd`	Previous generation, manual emotion

Music Generation

I needed background music for videos. The music API generates both instrumental tracks and songs with lyrics.

Instrumental (For BGM)

bash scripts/music/generate_music.sh \
  --instrumental \
  --prompt "ambient electronic, atmospheric, calming" \
  --output minimax-output/bgm.mp3 --download

For BGM, always use --instrumental. The toolkit handles the underlying API requirements automatically.

With Lyrics

When I wanted an actual song:

bash scripts/music/generate_music.sh \
  --lyrics "[verse]\nWalking down the street\n[chorus]\nLa la la la" \
  --prompt "indie folk, acoustic guitar, melancholic" \
  --output minimax-output/song.mp3 --download

The default model is music-2.5.

Image Generation

The image API supports text-to-image (t2i) and image-to-image with character reference (i2i).

Text-to-Image

bash scripts/image/generate_image.sh \
  --prompt "A cat on a rooftop at sunset, cinematic lighting, warm tones" \
  --aspect-ratio 16:9 \
  -o minimax-output/cat.png

The aspect ratio matters. I matched it to my use case:

Use Case	Ratio	Resolution
Avatar/icon	1:1	1024x1024
Desktop wallpaper	16:9	1280x720
Phone wallpaper	9:16	720x1280
Photography	3:2	1248x832

Character Reference (Image-to-Image)

To generate images with a consistent character:

bash scripts/image/generate_image.sh \
  --mode i2i \
  --prompt "A girl in a library, warm afternoon light, reading a book" \
  --ref-image face.jpg \
  --aspect-ratio 16:9 \
  -o minimax-output/library_scene.png

The --ref-image should be a front-facing portrait for best results. JPG, JPEG, and PNG formats are supported up to 10MB.

Video Generation

This is where MiniMax really shines. The video API supports text-to-video, image-to-video, start-end frame interpolation, and subject reference.

Text-to-Video

The default is 10 seconds at 768P:

bash scripts/video/generate_video.sh \
  --mode t2v \
  --prompt "A golden retriever puppy runs toward camera, [跟随] tracking shot, golden hour lighting" \
  --output minimax-output/puppy.mp4

I learned the prompt formula matters: Main subject + Scene + Movement + Camera motion + Aesthetic atmosphere.

Camera instructions use Chinese brackets:

Instruction	Meaning
`[推进]`	Push in
`[拉远]`	Pull out
`[跟随]`	Tracking shot
`[固定]`	Fixed camera
`[左摇]`	Pan left

Image-to-Video

To animate a static image:

bash scripts/video/generate_video.sh \
  --mode i2v \
  --prompt "Gentle ripples spread across water, soft breeze, [固定] fixed camera" \
  --first-frame photo.jpg \
  --output minimax-output/animated.mp4

For i2v mode, the prompt should focus on movement only—the image already provides the visual.

Subject Reference (Face Consistency)

To keep the same person across videos:

bash scripts/video/generate_video.sh \
  --mode ref \
  --prompt "A woman walking through a garden, [跟随] tracking shot" \
  --subject-image face.jpg \
  --output minimax-output/person.mp4

This mode uses the S2V-01 model, limited to 6 seconds at 720P.

Long-Form Multi-Scene Video

For a video story with multiple scenes:

bash scripts/video/generate_long_video.sh \
  --scenes \
    "An astronaut stands on a red planet, [推进] slow push in, cinematic" \
    "The astronaut walks toward a glowing structure, [跟随] tracking shot" \
    "The astronaut reaches the doorway, blue energy pulses, [推进] push in" \
  --music-prompt "cinematic orchestral ambient" \
  --output minimax-output/long_video.mp4

The first scene generates via text-to-video. Each subsequent scene uses the previous scene’s last frame as its starting point. Scenes join with 0.5s crossfade transitions.

Video Model Constraints

Model	Duration	Resolution
MiniMax-Hailuo-2.3	6s or 10s	768P, 1080P
MiniMax-Hailuo-02	6s or 10s	512P, 768P, 1080P
S2V-01 (ref)	6s only	720P

The key rule: 10s duration only works with 768P (or 512P on Hailuo-02). Never combine 10s with 1080P.

Media Tools

The toolkit includes FFmpeg-based utilities for processing generated media:

# Convert video format
bash scripts/media_tools.sh convert-video input.webm -o output.mp4

# Concatenate videos with crossfade
bash scripts/media_tools.sh concat-video seg1.mp4 seg2.mp4 -o merged.mp4

# Extract audio from video
bash scripts/media_tools.sh extract-audio video.mp4 -o audio.mp3

# Add background music
bash scripts/media_tools.sh add-audio --video video.mp4 --audio bgm.mp3 \
  -o output.mp4 --volume 0.3

Prerequisites

No Python required. I only needed:

brew install ffmpeg jq              # macOS
# or
apt install ffmpeg jq               # Linux

# Verify environment
bash scripts/check_environment.sh

The scripts are pure bash using curl, ffmpeg, jq, and xxd.

Output Directory

All generated files should go to minimax-output/ in your working directory:

mkdir -p minimax-output
bash scripts/tts/generate_voice.sh tts "Hello" -o minimax-output/hello.mp3

Summary

In this post, I showed how to use MiniMax’s multimodal AI toolkit to generate voice, music, video, and images through simple bash scripts. The toolkit provides unified access to MiniMax’s TTS (with voice cloning and design), music generation (instrumental and with lyrics), image generation (text-to-image and character reference), and video generation (text-to-video, image-to-video, and multi-scene).

The pure bash approach means no SDK dependencies—just curl, ffmpeg, and jq. This makes it easy to integrate into any workflow or scripting environment.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!