I am currently a member of the Next-gen Kaldi team at Xiaomi Corp., working under the supervision of Dr. Daniel Povey.

I received my Ph.D. degree from the Institute of Acoustics, Chinese Academy of Sciences (IOACAS) and the University of Chinese Academy of Sciences (UCAS) in 2024, and my bachelor’s degree from the University of Electronic Science and Technology of China (UESTC) in 2019.

My research interests primarily focus on text-to-speech (TTS) and automatic speech recognition (ASR). My work has been published in top speech and AI venues, including IEEE/ACM TASLP, INTERSPEECH, ICASSP, ASRU, ACL, and ICLR. I built open-source TTS projects OmniVoice Stars and ZipVoice Stars.

🔥 News

  • 2026.04:  🎉 We released OmniVoice, a SOTA voice-cloning TTS model supporting 600+ languages.
  • 2026.04:  🎉 ZipVoice-Dialog is accepted by ACL 2026 (Findings).
  • 2026.01:  🎉 Flow2GAN is accepted by ICLR 2026.
  • 2025.08:  🎉 ZipVoice is accepted by ASRU 2025.
  • 2025.01:  🎉 CR-CTC is accepted by ICLR 2025.
  • 2024.07:  😄 I joined the Next-gen Kaldi team at Xiaomi Corp.
  • 2024.06:  🎓 I received my Ph.D. degree from the Institute of Acoustics, Chinese Academy of Sciences (IOACAS) and the University of Chinese Academy of Sciences (UCAS).

📝 Selected Publications

Full list on Google Scholar.

🔊 Text-to-Speech (TTS)

  • [Preprint 2026] OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models
    H Zhu, L Ye, W Kang, Z Yao, L Guo, F Kuang, Z Han, W Zhuang, L Lin, D Povey
  • [ASRU 2025] ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching
    H Zhu, W Kang, Z Yao, L Guo, F Kuang, Z Li, W Zhuang, L Lin, D Povey
  • [ACL 2026 Findings] ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching
    H Zhu, W Kang, L Guo, Z Yao, F Kuang, W Zhuang, Z Li, Z Han, D Zhang, X Zhang, X Song, L Ye, L Lin, D Povey
  • [ICLR 2026] Flow2GAN: Hybrid Flow Matching and GAN with Multi-Resolution Network for Few-step High-Fidelity Audio Generation
    Z Yao, W Kang, H Zhu, L Guo, L Ye, F Kuang, W Zhuang, Z Li, Z Han, L Lin, D Povey

🎙️ Automatic Speech Recognition (ASR)

  • [IEEE/ACM TASLP 2024] Boosting Cross-Domain Speech Recognition with Self-Supervision
    H Zhu, G Cheng, J Wang, W Hou, P Zhang, Y Yan
  • [IEEE/ACM TASLP 2023] Alternative Pseudo-Labeling for Semi-Supervised Automatic Speech Recognition
    H Zhu, D Gao, G Cheng, D Povey, P Zhang, Y Yan
  • [INTERSPEECH 2022] Decoupled Federated Learning for ASR with Non-IID Data
    H Zhu, J Wang, G Cheng, P Zhang, Y Yan
  • [INTERSPEECH 2022] Wav2vec-S: Semi-Supervised Pre-Training for Low-Resource ASR
    H Zhu, L Wang, J Wang, G Cheng, P Zhang, Y Yan
  • [INTERSPEECH 2020] Domain Adaptation Using Class Similarity for Robust Speech Recognition
    H Zhu, J Zhao, Y Ren, L Wang, P Zhang
  • [INTERSPEECH 2019] Multi-Accent Adaptation Based on Gate Mechanism
    H Zhu, L Wang, P Zhang, Y Yan
  • [ICLR 2025] CR-CTC: Consistency Regularization on CTC for Improved Speech Recognition
    Z Yao, W Kang, X Yang, F Kuang, L Guo, H Zhu, Z Jin, Z Li, L Lin, D Povey
  • [IEEE/ACM TASLP 2026] Pseudo-Labeling Based Unsupervised Domain Adaptation for LLM-Based ASR
    L Zheng, H Zhu, X Wang, X Li, T Li, Y Yan
  • [ICASSP 2025] Hybrid Pseudo-Labeling for Semi-Supervised Automatic Speech Recognition
    L Zheng, H Zhu, C Yang, X Wang, G Cheng, T Li
  • [IEEE SPL 2024] Unsupervised Domain Adaptation on End-to-End Multi-Talker Overlapped Speech Recognition
    L Zheng, H Zhu, S Tian, Q Zhao, T Li
  • [IEEE/ACM TASLP 2022] Exploiting Adapters for Cross-Lingual Low-Resource Speech Recognition
    W Hou, H Zhu, Y Wang, J Wang, T Qin, R Xu, T Shinozaki
  • [ICASSP 2021] Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Text Data
    C Gao, G Cheng, R Yang, H Zhu, P Zhang, Y Yan

💻 Open Source Projects

OmniVoice

OmniVoice

A SOTA voice-cloning TTS model supporting 600+ languages, powered by a novel diffusion language model-style architecture.

Stars
ZipVoice⚡️

ZipVoice

A series of fast and high-quality zero-shot text-to-speech models built with the flow matching objective and the Zipformer backbone, including ZipVoice, ZipVoice-Distill and ZipVoice-Dialog.

Stars

📖 Educations

  • Ph.D., Institute of Acoustics, Chinese Academy of Sciences (IOACAS), and University of Chinese Academy of Sciences (UCAS)
    2019.09 – 2024.06
  • B.Eng., University of Electronic Science and Technology of China (UESTC)
    2015.09 – 2019.06

💼 Work Experiences

  • Next-gen Kaldi Team, Xiaomi Corp.
    Supervised by Dr. Daniel Povey
    2024.07 – Present