The ACM Multimedia Asia 2025 Grand Challenge: Multimodal Multiethnic Talking-Head Video Generation

Registration is now open. We warmly welcome your registration and participation!

📢 Important: All participants are requested to review the Submission Guidelines.

News & Updates

  • 🆕 September 9, 2025: The submission portal is now open. Please send the download link for your test results to the submission email address: 24b903117@stu.hit.edu.cn.

Introduction

Overview: Talking-Head Generation.

We propose the "Multimodal Multiethnic Talking-Head Video Generation" Grand Challenge on the ACM Multimedia Asia 2025 platform. This challenge aims to advance the state-of-the-art beyond models that often exhibit poor generalization across different ethnicities and suffer from slow inference speeds. Given a static reference face and multimodal prompts (such as audio or text), models are expected to generate high-quality, realistic, and expressive talking-head videos that feature strong identity consistency, accurate lip synchronization, and controllable speaking styles.

Challenge Task Definition

Task: High-Quality Talking-Head Video Generation

Challenge Task Overview

Participants are required to train a generalizable, high-performance talking-head video generation model. While we recommend using the provided DH-FaceVid-1K dataset for training, participants are free to use any additional or alternative training datasets. There are no restrictions on the technical architecture or experimental setups. This means that for each test sample, beyond the provided reference face image and speaking audio, participants are free to choose the input modalities for their model (e.g., using phoneme labels or text prompts derived from the audio). Furthermore, we do not impose any limitations on the training hardware configurations, such as the operating system or graphics cards used.

After completing model training, participants are required to use their trained model to perform inference on the test set samples, which are provided in the official Hugging Face repository: https://huggingface.co/datasets/jjuik2014/MMAsia-Challenge-Testset. Subsequently, they must submit the generated talking-head videos as instructed in the Submission Guidelines. Optionally, participants may create an online demo page to facilitate our evaluation of each method's specific inference speed.

Participation

The registration process is simple. Please submit your team name via this Google Form: https://docs.google.com/forms/d/1w2vO4VlJefpM7KPodLJr3omOMIkLgaDG8LahCKH4Eqw. You can use a placeholder for the results link during the initial submission and update it later.

Evaluation Metrics

Objective Metrics

Below are the objective metrics evaluated on the test set of DH-FaceVid-1K. We will use these metrics to automatically evaluate the performance of the submitted test set samples. Participants can also use them to evaluate model performance during training.

  • FID (Fréchet Inception Distance)
  • FVD (Fréchet Video Distance)
  • Inference Speed (frames per second, FPS)
  • Sync-C & Sync-D (Synchronization Confidence & Distance for lip-sync quality)
  • AKD (Average Keypoint Distance for facial motion accuracy)
  • F-LMD (Facial Landmark Distance for expression fidelity)

These metrics quantitatively assess generation quality, temporal coherence, speed, and speaking style transferability in face video synthesis.


Subjective Evaluation

For manual subjective evaluation, we will assess the following aspects in the videos provided by the authors: identity consistency, video quality, lip synchronization, and speaking style controllability. Each item has a full score of 5 points.


Final Ranking

The final rankings of participants will be determined based on a combination of these objective and subjective scores, along with the model's inference speed demonstrated on the optional web demo page.

Dataset

The dataset for this challenge is DH-FaceVid-1K: A Large-Scale High-Quality Dataset for Face Video Generation.

The dataset is available for download on the official GitHub repository: https://github.com/luna-ai-lab/DH-FaceVid-1K.

Overview of the DH-FaceVid-1K dataset

For the test set samples, we provide the reference face image, speaking audio, and corresponding text annotations. An example is shown below:

Example of a test set sample
Timeline with AoE Time

Timeline

Current Time (Anywhere on Earth, UTC-12):

Please note: The submission deadline is at 11:59 p.m. (Anywhere on Earth) of the stated deadline date.

Training data and participant instruction release September 8, 2025
Registration deadline September 30, 2025
Test dataset release September 9, 2025
Result submission start September 10, 2025
Result submission end September 30, 2025
Paper invitation notification October 1, 2025
Paper submission deadline October 14, 2025
Camera-ready paper October 24, 2025

Baseline

Link to the provided training dataset repository DH-FaceVid-1K.

Rewards

Top-ranked participants in this competition will receive a certificate of achievement, and will be recommended to submit a technical paper to ACM MM Asia 2025.

Organizers

Donglin Di (Li Auto)

Lei Fan (The University of New South Wales)

Tonghua Su (Harbin Institute of Technology)

Xun Yang (University of Science and Technology of China)

Yongjia Ma (Li Auto)

He Feng (Harbin Institute of Technology)

References

[1] Di D, Feng H, Sun W, et al. DH-FaceVid-1K: A Large-Scale High-Quality Dataset for Face Video Generation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025.