Registration is now open. We warmly welcome your registration and participation!
Overview: Talking-Head Generation.
We propose the "Multimodal Multiethnic Talking-Head Video Generation" Grand Challenge on the ACM Multimedia Asia 2025 platform. This challenge aims to advance the state-of-the-art beyond models that often exhibit poor generalization across different ethnicities and suffer from slow inference speeds. Given a static reference face and multimodal prompts (such as audio or text), models are expected to generate high-quality, realistic, and expressive talking-head videos that feature strong identity consistency, accurate lip synchronization, and controllable speaking styles.
Task: High-Quality Talking-Head Video Generation
Participants are required to train a generalizable, high-performance talking-head video generation model. While we recommend using the provided DH-FaceVid-1K dataset for training, participants are free to use any additional or alternative training datasets. There are no restrictions on the technical architecture or experimental setups. This means that for each test sample, beyond the provided reference face image and speaking audio, participants are free to choose the input modalities for their model (e.g., using phoneme labels or text prompts derived from the audio). Furthermore, we do not impose any limitations on the training hardware configurations, such as the operating system or graphics cards used.
After completing model training, participants are required to use their trained model to perform inference on the test set samples, which are provided in the official Hugging Face repository: https://huggingface.co/datasets/jjuik2014/MMAsia-Challenge-Testset. Subsequently, they must submit the generated talking-head videos as instructed in the Submission Guidelines. Optionally, participants may create an online demo page to facilitate our evaluation of each method's specific inference speed.
The registration process is simple. Please submit your team name via this Google Form: https://docs.google.com/forms/d/1w2vO4VlJefpM7KPodLJr3omOMIkLgaDG8LahCKH4Eqw. You can use a placeholder for the results link during the initial submission and update it later.
Objective Metrics
Below are the objective metrics evaluated on the test set of DH-FaceVid-1K. We will use these metrics to automatically evaluate the performance of the submitted test set samples. Participants can also use them to evaluate model performance during training.
These metrics quantitatively assess generation quality, temporal coherence, speed, and speaking style transferability in face video synthesis.
Subjective Evaluation
For manual subjective evaluation, we will assess the following aspects in the videos provided by the authors: identity consistency, video quality, lip synchronization, and speaking style controllability. Each item has a full score of 5 points.
Final Ranking
The final rankings of participants will be determined based on a combination of these objective and subjective scores, along with the model's inference speed demonstrated on the optional web demo page.
The dataset for this challenge is DH-FaceVid-1K: A Large-Scale High-Quality Dataset for Face Video Generation.
The dataset is available for download on the official GitHub repository: https://github.com/luna-ai-lab/DH-FaceVid-1K.
For the test set samples, we provide the reference face image, speaking audio, and corresponding text annotations. An example is shown below:
Current Time (Anywhere on Earth, UTC-12):
Please note: The submission deadline is at 11:59 p.m. (Anywhere on Earth) of the stated deadline date.
Registration deadline | September 30, 2025 |
Test dataset release | September 9, 2025 |
Result submission start | September 10, 2025 |
Result submission end | September 30, 2025 |
Paper invitation notification | October 1, 2025 |
Paper submission deadline | October 14, 2025 |
Camera-ready paper | October 24, 2025 |
Link to the provided training dataset repository DH-FaceVid-1K.
Top-ranked participants in this competition will receive a certificate of achievement, and will be recommended to submit a technical paper to ACM MM Asia 2025.
Donglin Di (Li Auto)
Lei Fan (The University of New South Wales)
Tonghua Su (Harbin Institute of Technology)
Xun Yang (University of Science and Technology of China)
Yongjia Ma (Li Auto)
He Feng (Harbin Institute of Technology)
[1] Di D, Feng H, Sun W, et al. DH-FaceVid-1K: A Large-Scale High-Quality Dataset for Face Video Generation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025.