aliyun-wan-digital-human

$npx mdskill add cinience/alicloud-skills/aliyun-wan-digital-human

Generates talking, singing, or presentation videos from image and audio using Alibaba Cloud Model Studio

  • Solves the task of creating animated characters for narration, singing, or presentations
  • Relies on Alibaba Cloud Model Studio APIs and the `wan2.2-s2v` digital-human model
  • Validates input image with `wan2.2-s2v-detect` before video generation
  • Delivers output videos by processing image and audio inputs into synchronized character animations
SKILL.md
.github/skills/aliyun-wan-digital-humanView on GitHub ↗
---
name: aliyun-wan-digital-human
description: Use when generating talking, singing, or presentation videos from a single character image and audio with Alibaba Cloud Model Studio digital-human model `wan2.2-s2v`. Use when creating narrated avatar videos, singing portraits, or broadcast-style talking-head clips.
version: 1.0.0
---

Category: provider

# Model Studio Digital Human

## Validation

```bash
mkdir -p output/aliyun-wan-digital-human
python -m py_compile skills/ai/video/aliyun-wan-digital-human/scripts/prepare_digital_human_request.py && echo "py_compile_ok" > output/aliyun-wan-digital-human/validate.txt
```

Pass criteria: command exits 0 and `output/aliyun-wan-digital-human/validate.txt` is generated.

## Output And Evidence

- Save normalized request payloads, chosen resolution, and task polling snapshots under `output/aliyun-wan-digital-human/`.
- Record image/audio URLs and whether the input image passed detection.

Use this skill for image + audio driven speaking, singing, or presenting characters.

## Critical model names

Use these exact model strings:
- `wan2.2-s2v-detect`
- `wan2.2-s2v`

Selection guidance:
- Run `wan2.2-s2v-detect` first to validate the image.
- Use `wan2.2-s2v` for the actual video generation job.

## Prerequisites

- China mainland (Beijing) only.
- Set `DASHSCOPE_API_KEY` in your environment, or add `dashscope_api_key` to `~/.alibabacloud/credentials`.
- Input audio should contain clear speech or singing, and input image should depict a clear subject.

## Normalized interface (video.digital_human)

### Detect Request
- `model` (string, optional): default `wan2.2-s2v-detect`
- `image_url` (string, required)

### Generate Request
- `model` (string, optional): default `wan2.2-s2v`
- `image_url` (string, required)
- `audio_url` (string, required)
- `resolution` (string, optional): `480P` or `720P`
- `scenario` (string, optional): `talk`, `sing`, or `perform`

### Response
- `task_id` (string)
- `task_status` (string)
- `video_url` (string, when finished)

## Quick start

```bash
python skills/ai/video/aliyun-wan-digital-human/scripts/prepare_digital_human_request.py \
  --image-url "https://example.com/anchor.png" \
  --audio-url "https://example.com/voice.mp3" \
  --resolution 720P \
  --scenario talk
```

## Operational guidance

- Use a portrait, half-body, or full-body image with a clear face and stable framing.
- Match audio length to the desired output duration; the output follows the audio length up to the model limit.
- Keep image and audio as public HTTP/HTTPS URLs.
- If the image fails detection, do not proceed directly to video generation.

## Output location

- Default output: `output/aliyun-wan-digital-human/request.json`
- Override base dir with `OUTPUT_DIR`.

## References

- `references/sources.md`
More from cinience/alicloud-skills