The ACM Multimedia Asia 2025 Grand Challenge: Multimodal Multiethnic Talking-Head Video Generation

Registration is now open. We warmly welcome your registration and participation!

📢 Important: All participants are requested to review the Submission Guidelines.

News & Updates

  • 🆕December 3, 2025: The detailed program for the challenge session has been officially released.
  • 🆕October 15, 2025: The final ranking and results of the challenge have been released. Congratulations to the winning teams!

Introduction

Overview: Talking-Head Generation.

We propose the "Multimodal Multiethnic Talking-Head Video Generation" Grand Challenge on the ACM Multimedia Asia 2025 platform. This challenge aims to advance the state-of-the-art beyond models that often exhibit poor generalization across different ethnicities and suffer from slow inference speeds. Given a static reference face and multimodal prompts (such as audio or text), models are expected to generate high-quality, realistic, and expressive talking-head videos that feature strong identity consistency, accurate lip synchronization, and controllable speaking styles.

Challenge Task Definition

Task: High-Quality Talking-Head Video Generation

Challenge Task Overview

Participants are required to train a generalizable, high-performance talking-head video generation model. While we recommend using the provided DH-FaceVid-1K dataset for training, participants are free to use any additional or alternative training datasets. There are no restrictions on the technical architecture or experimental setups. This means that for each test sample, beyond the provided reference face image and speaking audio, participants are free to choose the input modalities for their model (e.g., using phoneme labels or text prompts derived from the audio). Furthermore, we do not impose any limitations on the training hardware configurations, such as the operating system or graphics cards used.

After completing model training, participants are required to use their trained model to perform inference on the test set samples, which are provided in the official Hugging Face repository: https://huggingface.co/datasets/jjuik2014/MMAsia-Challenge-Testset. Subsequently, they must submit the generated talking-head videos as instructed in the Submission Guidelines. Optionally, participants may create an online demo page to facilitate our evaluation of each method's specific inference speed.

Participation

The registration process is simple. Please submit your team name via this Google Form: https://docs.google.com/forms/d/1w2vO4VlJefpM7KPodLJr3omOMIkLgaDG8LahCKH4Eqw. You can use a placeholder for the results link during the initial submission and update it later.

Registered Teams

ID Team Name
1 3dScanning
2 AnthGen
3 cmvs
4 DHAL
5 HairBrother_Team
6 SZFaceU
7 Talk_Sclab
8 Team_EQUES

Evaluation Metrics

Objective Metrics

Below are the objective metrics evaluated on the test set of DH-FaceVid-1K. We will use these metrics to automatically evaluate the performance of the submitted test set samples. Participants can also use them to evaluate model performance during training.

  • FID (Fréchet Inception Distance)
  • FVD (Fréchet Video Distance)
  • Inference Speed (frames per second, FPS)
  • Sync-C & Sync-D (Synchronization Confidence & Distance for lip-sync quality)
  • AKD (Average Keypoint Distance for facial motion accuracy)
  • F-LMD (Facial Landmark Distance for expression fidelity)

These metrics quantitatively assess generation quality, temporal coherence, speed, and speaking style transferability in face video synthesis.


Subjective Evaluation

For manual subjective evaluation, we will assess the following aspects in the videos provided by the authors: identity consistency, video quality, lip synchronization, and speaking style controllability. Each item has a full score of 5 points.


Final Ranking

The final rankings of participants will be determined based on a combination of these objective and subjective scores, along with the model's inference speed demonstrated on the optional web demo page.

Dataset

The dataset for this challenge is DH-FaceVid-1K: A Large-Scale High-Quality Dataset for Face Video Generation.

The dataset is available for download on the official GitHub repository: https://github.com/luna-ai-lab/DH-FaceVid-1K.

Overview of the DH-FaceVid-1K dataset

For the test set samples, we provide the reference face image, speaking audio, and corresponding text annotations. An example is shown above.

Timeline with AoE Time

Timeline

Current Time (Anywhere on Earth, UTC-12):

Please note: The submission deadline is at 11:59 p.m. (Anywhere on Earth) of the stated deadline date.

Training data and participant instruction release September 8, 2025
Registration deadline September 30, 2025
Test dataset release September 9, 2025
Result submission start September 10, 2025
Result submission end September 30, 2025
Paper invitation notification October 1, 2025
Paper submission deadline October 14, 2025
Camera-ready paper October 24, 2025
ACM Multimedia Asia 2025 Challenge Session December 9, 2025

Baseline

Link to the provided training dataset repository DH-FaceVid-1K.

Challenge Results

  • 🏆 1st Place: CMVS
  • 🥈 2nd Place: SZFaceU
  • 🥉 3rd Place: EQUES

Challenge Session Program

Date: December 9, 2025

09:30 - 10:00: Opening & Awards

  • 09:30 - 09:55: Opening Remarks & Challenge Introduction
  • 09:55 - 10:00: Award Presentation

10:30 - 11:30: Team Presentations

10:30 - 10:50 | 3rd Place: Team EQUES

Title: AKITalk: Audio-Implicit Keypoints for Identity-Preserving Talking-Head Video Synthesis

Presenter: Riku Takahashi (HOSEI University)

Click to view Abstract

Existing talking-head video generation methods using speech audio often suffer from high computational costs or degraded identity preservation due to reliance on external super-resolution models. To address these issues, we propose a lightweight framework that predicts temporally consistent implicit 3D features from a reference image and speech audio. These features guide a generator trained on large-scale data to synthesize high-quality video frames efficiently. Experimental results demonstrate that our method achieves comparable or superior visual quality and identity preservation, while ensuring high inference efficiency. This balance of realism and computational performance suggests broad applicability in real-world scenarios such as virtual avatars and social media.

10:50 - 11:10 | 2nd Place: Team SZFaceU

Title: FlowTalk: Real-Time Audio-Driven Talking Head Synthesis via Motion-Space Flow Matching

Presenter: Kaijun Deng (Shenzhen University)

Click to view Abstract

Audio-driven talking head synthesis has achieved significant progress, yet existing methods face critical trade-offs among generation quality, inference efficiency, and cross-ethnic generalization. Diffusion-based approaches produce high-fidelity results but suffer from slow inference due to iterative denoising, while GAN-based methods achieve faster speed at the cost of reduced motion naturalness and limited generalization. To address these challenges, we propose FlowTalk, a novel framework that enables real-time high-fidelity talking head video synthesis. Our approach leverages Flow Matching technology to perform efficient motion modeling in a decoupled motion space rather than pixel space, achieving significant speedup while maintaining generation quality. Specifically, we adopt an off-the-shelf motion extractor to disentangle facial appearance from motion, and employ an OT-based flow matching model with a transformer architecture to predict identity-agnostic motion sequences conditioned on audio features. To improve cross-ethnic generalization, we train on a balanced combination of DH-FaceVid-1K and HDTF datasets with HuBert-CN as the audio encoder. Experimental results demonstrate that FlowTalk achieves over 100 FPS with 32 ODE solver steps, approximately 5 times faster than diffusion-based baselines with 500 steps, while preserving comparable visual quality in lip synchronization, facial expressions, and head movements.

11:10 - 11:30 | 1st Place: Team CMVS

Title: Detection-Aware Inference for Robust Talking-Head Video Generation

Presenter: Yueyi Yang (In-person)

Click to view Abstract

Talking-head video generation has seen significant progress with the rise of multimodal and multiethnic datasets; however, existing systems still suffer from stability issues when dealing with occluded or partially missing facial regions in the input. In this work, we present our model developed for the ACM Multimedia Asia 2025 Grand Challenge on Multimodal Multiethnic Talking Head Video Generation. Built upon the Hallo2 framework, our model introduces two key improvements to enhance inference robustness and usability. First, it detects cases where facial regions are heavily occluded or incompletely detected and automatically substitutes a static image across all frames, preventing inference failures and maintaining visual consistency when facial information is insufficient. Second, it supports multi-input inference, enabling the generation of multiple talking-head videos within a single execution process. These modifications result in a more reliable and flexible talking-head generation pipeline, suitable for diverse multimodal datasets and large-scale evaluation.

Organizers

Donglin Di (Li Auto)

Lei Fan (The University of New South Wales)

Tonghua Su (Harbin Institute of Technology)

Xun Yang (University of Science and Technology of China)

Yongjia Ma (Li Auto)

He Feng (Harbin Institute of Technology)

References

[1] Di D, Feng H, Sun W, et al. DH-FaceVid-1K: A Large-Scale High-Quality Dataset for Face Video Generation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025.