SteadyDancer: Harmonized AI Human Image Animation Generation

This guide covers installing and setting up SteadyDancer for human image animation. Follow these steps to prepare your environment, download the necessary checkpoints, and run inference on your own images and videos.

System Requirements

Hardware Requirements

SteadyDancer requires a GPU with substantial memory for optimal performance:

Single-GPU inference: At least 40GB GPU memory recommended for the 14B model at 1024x576 resolution
Multi-GPU inference: Can distribute across 2 or more GPUs using FSDP and xDiT USP parallelization
Storage: At least 50GB for model weights and preprocessed data
CPU: Modern multi-core processor for preprocessing tasks

Software Requirements

Before starting, ensure you have the following software installed:

Python 3.8 or higher
CUDA 11.7 or higher for GPU support
Git for cloning repositories
Hugging Face CLI for downloading model weights

Environment Setup

Clone the Repository

First, clone the SteadyDancer repository from GitHub:

git clone https://github.com/MCG-NJU/SteadyDancer.git
cd SteadyDancer

Create Virtual Environment

Create and activate a Python virtual environment to isolate dependencies:

python -m venv steadydancer_env
source steadydancer_env/bin/activate  # On Linux/Mac
# or
steadydancer_env\Scripts\activate  # On Windows

Install Dependencies

Install the required Python packages:

pip install -r requirements.txt

Install MMPose Dependencies

SteadyDancer uses MMPose for pose extraction. Install the required MMPose components:

pip install -U openmim
mim install mmengine
mim install "mmcv>=2.0.1"
mim install "mmdet>=3.1.0"
mim install "mmpose>=1.1.0"

Verify the installation by running these test imports:

python -c "from mmpose.apis import inference_topdown"
python -c "from mmpose.apis import init_model as init_pose_estimator"
python -c "from mmpose.evaluation.functional import nms"
python -c "from mmpose.utils import adapt_mmdet_pipeline"
python -c "from mmpose.structures import merge_data_samples"

Note: If you encounter issues with mmcv installation, you may need to build it from source. Refer to the official MMPose documentation for troubleshooting steps.

Download Checkpoints

Install Hugging Face CLI

If not already installed, install the Hugging Face CLI:

pip install huggingface-hub

Download DWPose Weights

Download the DWPose pretrained weights for pose extraction:

mkdir -p ./preprocess/pretrained_weights/dwpose
huggingface-cli download yzd-v/DWPose --local-dir ./preprocess/pretrained_weights/dwpose --include "dw-ll_ucoco_384.pth"
wget https://download.openmmlab.com/mmdetection/v2.0/yolox/yolox_l_8x8_300e_coco/yolox_l_8x8_300e_coco_20211126_140236-d3bd2b23.pth -O ./preprocess/pretrained_weights/dwpose/yolox_l_8x8_300e_coco.pth

Download SteadyDancer Model

Download the SteadyDancer-14B model weights from Hugging Face:

huggingface-cli download MCG-NJU/SteadyDancer-14B --local-dir ./SteadyDancer-14B

Inference Instructions

Prepare Your Data

Organize your input data with the following structure:

Reference image: A single portrait image of the person you want to animate
Driving video: A video containing the motion you want to apply
Text prompt: A description of the desired animation

Step 1: Pose Extraction and Alignment

Extract and align poses from your reference image and driving video:

ref_image_path="data/images/your_image.png"
driving_video_path="data/videos/your_video"
pair_id="your_pair_id"
output=./preprocess/output/${pair_id}/$(date +"%Y%m%d%H%M%S")

# Extract and align pose (Positive Condition)
outfn=$output/positive/all.mp4
outfn_align_pose_video=$output/positive/single.mp4
python preprocess/pose_align.py \
    --imgfn_refer "$ref_image_path" \
    --vidfn "${driving_video_path}/video.mp4" \
    --outfn "$outfn" \
    --outfn_align_pose_video "$outfn_align_pose_video"

outfn_align_pose_video=$output/positive/single.mp4
python preprocess/dump_video_images.py "$outfn_align_pose_video" "$(dirname "$outfn_align_pose_video")"

# Extract and align pose (Negative Condition)
outfn=$output/negative/all.mp4
outfn_align_pose_video=$output/negative/single.mp4
python preprocess/pose_align_withdiffaug.py \
    --imgfn_refer "$ref_image_path" \
    --vidfn "${driving_video_path}/video.mp4" \
    --outfn "$outfn" \
    --outfn_align_pose_video "$outfn_align_pose_video"

outfn_align_pose_video=$output/negative/single_aug.mp4
python preprocess/dump_video_images.py "$outfn_align_pose_video" "$(dirname "$outfn_align_pose_video")"

# Copy other files
cp "$ref_image_path" "$output/ref_image.png"
cp "${driving_video_path}/video.mp4" "$output/driving_video.mp4"
cp "${driving_video_path}/prompt.txt" "$output/prompt.txt"

Step 2: Generate Animation

Run the generation script to create your animated video:

ckpt_dir="./SteadyDancer-14B"

input_dir="preprocess/output/your_pair_id/your_timestamp"
image="$input_dir/ref_image.png"
cond_pos_folder="$input_dir/positive/"
cond_neg_folder="$input_dir/negative/"
prompt=$(cat $input_dir/prompt.txt)
save_file="$(basename "$(dirname "$input_dir")")--Pair$(basename "$input_dir").mp4"

cfg_scale=5.0
condition_guide_scale=1.0
pro=0.4
base_seed=106060

# Single-GPU inference
CUDA_VISIBLE_DEVICES=0 python generate_dancer.py \
    --task i2v-14B --size 1024*576 \
    --ckpt_dir $ckpt_dir \
    --prompt "$prompt" \
    --image $image \
    --cond_pos_folder $cond_pos_folder \
    --cond_neg_folder $cond_neg_folder \
    --sample_guide_scale $cfg_scale \
    --condition_guide_scale $condition_guide_scale \
    --end_cond_cfg $pro \
    --base_seed $base_seed \
    --save_file "${save_file}--$(date +"%Y%m%d%H%M%S")"

Multi-GPU Inference (Optional)

For faster inference with multiple GPUs:

GPUs=2
torchrun --nproc_per_node=${GPUs} generate_dancer.py \
    --dit_fsdp --t5_fsdp --ulysses_size ${GPUs} \
    --task i2v-14B --size 1024*576 \
    --ckpt_dir $ckpt_dir \
    --prompt "$prompt" \
    --image $image \
    --cond_pos_folder $cond_pos_folder \
    --cond_neg_folder $cond_neg_folder \
    --sample_guide_scale $cfg_scale \
    --condition_guide_scale $condition_guide_scale \
    --end_cond_cfg $pro \
    --base_seed $base_seed \
    --save_file "${save_file}--$(date +"%Y%m%d%H%M%S")--xDiTUSP${GPUs}"

Note: Multi-GPU inference may be faster and use less memory per device, but results may differ slightly from single-GPU inference due to distributed computing characteristics. For best reproducibility, use single-GPU inference.

Configuration Parameters

Key Parameters

cfg_scale (5.0): Classifier-free guidance scale for generation quality
condition_guide_scale (1.0): Balance between positive and negative conditions
pro (0.4): Proportion of steps to apply conditional guidance
base_seed (106060): Random seed for reproducibility
size (1024*576): Output resolution

Adjusting Parameters

You can adjust these parameters to control the generation:

Higher cfg_scale values produce results that more closely follow the prompt and conditions
The pro parameter controls when conditional guidance ends during generation
Different seeds produce different variations of the animation
Resolution can be adjusted based on available GPU memory

Troubleshooting

Common Issues

If you encounter problems during installation or inference:

Out of memory errors: Reduce batch size or use multi-GPU inference
MMPose installation issues: Build mmcv from source following official documentation
Pose extraction failures: Ensure driving video contains clear human poses
Generation quality issues: Adjust cfg_scale and condition_guide_scale parameters

For additional help and the latest updates, refer to the SteadyDancer repository on GitHub. The community is active and can assist with specific issues.

Installation Guide