Registered Teams
| ID | Team Name |
|---|---|
| 1 | 3dScanning |
| 2 | AnthGen |
| 3 | cmvs |
| 4 | DHAL |
| 5 | HairBrother_Team |
| 6 | SZFaceU |
| 7 | Talk_Sclab |
| 8 | Team_EQUES |
Registration is now open. We warmly welcome your registration and participation!
Overview: Talking-Head Generation.
We propose the "Multimodal Multiethnic Talking-Head Video Generation" Grand Challenge on the ACM Multimedia Asia 2025 platform. This challenge aims to advance the state-of-the-art beyond models that often exhibit poor generalization across different ethnicities and suffer from slow inference speeds. Given a static reference face and multimodal prompts (such as audio or text), models are expected to generate high-quality, realistic, and expressive talking-head videos that feature strong identity consistency, accurate lip synchronization, and controllable speaking styles.
Task: High-Quality Talking-Head Video Generation
Participants are required to train a generalizable, high-performance talking-head video generation model. While we recommend using the provided DH-FaceVid-1K dataset for training, participants are free to use any additional or alternative training datasets. There are no restrictions on the technical architecture or experimental setups. This means that for each test sample, beyond the provided reference face image and speaking audio, participants are free to choose the input modalities for their model (e.g., using phoneme labels or text prompts derived from the audio). Furthermore, we do not impose any limitations on the training hardware configurations, such as the operating system or graphics cards used.
After completing model training, participants are required to use their trained model to perform inference on the test set samples, which are provided in the official Hugging Face repository: https://huggingface.co/datasets/jjuik2014/MMAsia-Challenge-Testset. Subsequently, they must submit the generated talking-head videos as instructed in the Submission Guidelines. Optionally, participants may create an online demo page to facilitate our evaluation of each method's specific inference speed.
The registration process is simple. Please submit your team name via this Google Form: https://docs.google.com/forms/d/1w2vO4VlJefpM7KPodLJr3omOMIkLgaDG8LahCKH4Eqw. You can use a placeholder for the results link during the initial submission and update it later.
| ID | Team Name |
|---|---|
| 1 | 3dScanning |
| 2 | AnthGen |
| 3 | cmvs |
| 4 | DHAL |
| 5 | HairBrother_Team |
| 6 | SZFaceU |
| 7 | Talk_Sclab |
| 8 | Team_EQUES |
Objective Metrics
Below are the objective metrics evaluated on the test set of DH-FaceVid-1K. We will use these metrics to automatically evaluate the performance of the submitted test set samples. Participants can also use them to evaluate model performance during training.
These metrics quantitatively assess generation quality, temporal coherence, speed, and speaking style transferability in face video synthesis.
Subjective Evaluation
For manual subjective evaluation, we will assess the following aspects in the videos provided by the authors: identity consistency, video quality, lip synchronization, and speaking style controllability. Each item has a full score of 5 points.
Final Ranking
The final rankings of participants will be determined based on a combination of these objective and subjective scores, along with the model's inference speed demonstrated on the optional web demo page.
The dataset for this challenge is DH-FaceVid-1K: A Large-Scale High-Quality Dataset for Face Video Generation.
The dataset is available for download on the official GitHub repository: https://github.com/luna-ai-lab/DH-FaceVid-1K.
For the test set samples, we provide the reference face image, speaking audio, and corresponding text annotations. An example is shown above.
Current Time (Anywhere on Earth, UTC-12):
Please note: The submission deadline is at 11:59 p.m. (Anywhere on Earth) of the stated deadline date.
| ACM Multimedia Asia 2025 Challenge Session | December 9, 2025 |
Link to the provided training dataset repository DH-FaceVid-1K.
Date: December 9, 2025
10:30 - 10:50 | 3rd Place: Team EQUES
Title: AKITalk: Audio-Implicit Keypoints for Identity-Preserving Talking-Head Video Synthesis
Presenter: Riku Takahashi (HOSEI University)
Existing talking-head video generation methods using speech audio often suffer from high computational costs or degraded identity preservation due to reliance on external super-resolution models. To address these issues, we propose a lightweight framework that predicts temporally consistent implicit 3D features from a reference image and speech audio. These features guide a generator trained on large-scale data to synthesize high-quality video frames efficiently. Experimental results demonstrate that our method achieves comparable or superior visual quality and identity preservation, while ensuring high inference efficiency. This balance of realism and computational performance suggests broad applicability in real-world scenarios such as virtual avatars and social media.
10:50 - 11:10 | 2nd Place: Team SZFaceU
Title: FlowTalk: Real-Time Audio-Driven Talking Head Synthesis via Motion-Space Flow Matching
Presenter: Kaijun Deng (Shenzhen University)
Audio-driven talking head synthesis has achieved significant progress, yet existing methods face critical trade-offs among generation quality, inference efficiency, and cross-ethnic generalization. Diffusion-based approaches produce high-fidelity results but suffer from slow inference due to iterative denoising, while GAN-based methods achieve faster speed at the cost of reduced motion naturalness and limited generalization. To address these challenges, we propose FlowTalk, a novel framework that enables real-time high-fidelity talking head video synthesis. Our approach leverages Flow Matching technology to perform efficient motion modeling in a decoupled motion space rather than pixel space, achieving significant speedup while maintaining generation quality. Specifically, we adopt an off-the-shelf motion extractor to disentangle facial appearance from motion, and employ an OT-based flow matching model with a transformer architecture to predict identity-agnostic motion sequences conditioned on audio features. To improve cross-ethnic generalization, we train on a balanced combination of DH-FaceVid-1K and HDTF datasets with HuBert-CN as the audio encoder. Experimental results demonstrate that FlowTalk achieves over 100 FPS with 32 ODE solver steps, approximately 5 times faster than diffusion-based baselines with 500 steps, while preserving comparable visual quality in lip synchronization, facial expressions, and head movements.
11:10 - 11:30 | 1st Place: Team CMVS
Title: Detection-Aware Inference for Robust Talking-Head Video Generation
Presenter: Yueyi Yang (In-person)
Talking-head video generation has seen significant progress with the rise of multimodal and multiethnic datasets; however, existing systems still suffer from stability issues when dealing with occluded or partially missing facial regions in the input. In this work, we present our model developed for the ACM Multimedia Asia 2025 Grand Challenge on Multimodal Multiethnic Talking Head Video Generation. Built upon the Hallo2 framework, our model introduces two key improvements to enhance inference robustness and usability. First, it detects cases where facial regions are heavily occluded or incompletely detected and automatically substitutes a static image across all frames, preventing inference failures and maintaining visual consistency when facial information is insufficient. Second, it supports multi-input inference, enabling the generation of multiple talking-head videos within a single execution process. These modifications result in a more reliable and flexible talking-head generation pipeline, suitable for diverse multimodal datasets and large-scale evaluation.
Donglin Di (Li Auto)
Lei Fan (The University of New South Wales)
Tonghua Su (Harbin Institute of Technology)
Xun Yang (University of Science and Technology of China)
Yongjia Ma (Li Auto)
He Feng (Harbin Institute of Technology)
[1] Di D, Feng H, Sun W, et al. DH-FaceVid-1K: A Large-Scale High-Quality Dataset for Face Video Generation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025.