The ACM Multimedia Asia 2025 Grand Challenge: Multimodal Multiethnic Talking-Head Video Generation

Introduction

Overview: Talking-Head Generation.

We propose the "Multimodal Multiethnic Talking-Head Video Generation" Grand Challenge on the ACM Multimedia Asia 2025 platform. This challenge aims to advance the state-of-the-art beyond models that often exhibit poor generalization across different ethnicities and suffer from slow inference speeds. Given a static reference face and multimodal prompts (such as audio or text), models are expected to generate high-quality, realistic, and expressive talking-head videos that feature strong identity consistency, accurate lip synchronization, and controllable speaking styles.

Challenge Task Definition

Task: High-Quality Talking-Head Video Generation

Participants are required to train a generalizable, high-performance talking-head video generation model. While we recommend using the provided DH-FaceVid-1K dataset for training, participants are free to use any additional or alternative training datasets. There are no restrictions on the technical architecture or experimental setups. This means that for each test sample, beyond the provided reference face image and speaking audio, participants are free to choose the input modalities for their model (e.g., using phoneme labels or text prompts derived from the audio). Furthermore, we do not impose any limitations on the training hardware configurations, such as the operating system or graphics cards used.

After completing model training, participants are required to use their trained model to perform inference on the test set samples, which are provided in the official Hugging Face repository: https://huggingface.co/datasets/jjuik2014/MMAsia-Challenge-Testset. Subsequently, they must submit the generated talking-head videos as instructed in the Submission Guidelines. Optionally, participants may create an online demo page to facilitate our evaluation of each method's specific inference speed.

Participation

The registration process is simple. Please submit your team name via this Google Form: https://docs.google.com/forms/d/1w2vO4VlJefpM7KPodLJr3omOMIkLgaDG8LahCKH4Eqw. You can use a placeholder for the results link during the initial submission and update it later.

Evaluation Metrics

Objective Metrics

Below are the objective metrics evaluated on the test set of DH-FaceVid-1K. We will use these metrics to automatically evaluate the performance of the submitted test set samples. Participants can also use them to evaluate model performance during training.

FID (Fréchet Inception Distance)
FVD (Fréchet Video Distance)
Inference Speed (frames per second, FPS)
Sync-C & Sync-D (Synchronization Confidence & Distance for lip-sync quality)
AKD (Average Keypoint Distance for facial motion accuracy)
F-LMD (Facial Landmark Distance for expression fidelity)

These metrics quantitatively assess generation quality, temporal coherence, speed, and speaking style transferability in face video synthesis.

Subjective Evaluation

For manual subjective evaluation, we will assess the following aspects in the videos provided by the authors: identity consistency, video quality, lip synchronization, and speaking style controllability. Each item has a full score of 5 points.

Final Ranking

The final rankings of participants will be determined based on a combination of these objective and subjective scores, along with the model's inference speed demonstrated on the optional web demo page.

Dataset

The dataset for this challenge is DH-FaceVid-1K: A Large-Scale High-Quality Dataset for Face Video Generation.

The dataset is available for download on the official GitHub repository: https://github.com/luna-ai-lab/DH-FaceVid-1K.

For the test set samples, we provide the reference face image, speaking audio, and corresponding text annotations. An example is shown below:

Timeline

Current Time (Anywhere on Earth, UTC-12):

Please note: The submission deadline is at 11:59 p.m. (Anywhere on Earth) of the stated deadline date.

~~Training data and participant instruction release~~	~~September 8, 2025~~
Registration deadline	September 30, 2025
Test dataset release	September 9, 2025
Result submission start	September 10, 2025
Result submission end	September 30, 2025
Paper invitation notification	October 1, 2025
Paper submission deadline	October 14, 2025
Camera-ready paper	October 24, 2025

Baseline

Link to the provided training dataset repository DH-FaceVid-1K.

Rewards

Top-ranked participants in this competition will receive a certificate of achievement, and will be recommended to submit a technical paper to ACM MM Asia 2025.

Organizers

Donglin Di (Li Auto)

Lei Fan (The University of New South Wales)

Tonghua Su (Harbin Institute of Technology)

Xun Yang (University of Science and Technology of China)

Yongjia Ma (Li Auto)

He Feng (Harbin Institute of Technology)

References

[1] Di D, Feng H, Sun W, et al. DH-FaceVid-1K: A Large-Scale High-Quality Dataset for Face Video Generation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025.