现代信号与数据处理实验室
Advanced Data & Signal Processing Laboratory

项目成果介绍

国家自然科学基金面上项目“基于多模态语义关系推理与对齐的动态视觉场景描述深度模型研究”,项目编号62176008,2022年1月-2025年12月,58万元,项目负责人:邹月娴教授

(一)中文摘要
动态视觉场景描述(DVSD)任务旨在从视觉场景中理解复杂语义并生成准确、自然的语言表述,是实现复杂环境智能感知与人机交互的关键基础技术之一,在智慧交通、公共安全、智能监控及辅助决策等领域具有重要应用前景。然而,由于动态视觉场景内容的时空复杂性、视频时序信息的非线性结构特征以及视觉与语言模态之间显著的语义鸿沟,现有动态视觉场景描述模型能力仍显著落后于人类对动态视觉场景的理解水平。围绕上述关键挑战,本项目深入探索基于多模态语义关系推理与对齐的动态视觉场景描述深度模型,系统研究动态视觉场景描述的表示学习、语义推理与对齐以及生成,开展了以下创新性研究工作:
(1)针对动态视觉场景时序运动特征的高效建模,提出了基于伪动作定位的无监督视频表征学习方法PAL以及于多粒度对比学习的视频—文本联合表征学习方法 LocVTP,提升了时序运动特征的区分性;
(2)针对多模态自适应对齐的高效动态视觉场景描述,提出了视觉关系感知框架 VRVC、解耦式多模态预训练与对齐框架 UVC-VI 和 ZeroNLG,以及基于共识引导与关键词自适应加权的多模态对齐训练方法 CGKT,显著改善了模型在弱监督、零样本和跨语言场景下的描述性能;
(3) 针对交互式图网络的多模态高阶语义关系推理,提出了基于图神经网络与光流的运动信息建模方法 DMP-Net与FTM,增强了模态内与模态间的语义关联建模能力;
(4) 面向智慧交通场景的实际应用需求,构建了高质量交通场景中文视频描述数据集 TV-CL,并进行了端到端视频交通场景描述生成模型设计与实现。
在 MSR-VTT、ActivityNet 等公开视频理解与视频描述数据集上的实验结果表明,所提方法在动态视觉场景描述、视频—文本检索及相关时序理解任务上均取得了先进性能,并在交通场景数据上验证了其实用价值。本项目围绕多模态信息联合表征与自适应对齐这一关键科学问题,系统揭示了复杂动态视觉场景中多模态时序演化与语义关联建模规律,为缓解视觉—语言语义不一致与对齐不稳定问题提供了新的理论依据,丰富了多模态时序理解与语言生成的理论框架,对推动多模态智能感知与人机交互领域的基础研究具有重要的科学意义。

(二)英文摘要
Dynamic Visual Scene Description (DVSD) aims to understand complex semantics from visual scenes and generate accurate and natural language descriptions. It serves as a fundamental enabling technology for intelligent perception and human–computer interaction in complex environments, with broad application prospects in smart transportation, public safety, intelligent surveillance, and decision support. However, due to the spatiotemporal complexity of dynamic visual scenes, the nonlinear structural characteristics of video temporal information, and the significant semantic gap between visual and linguistic modalities, existing DVSD models still lag far behind human-level understanding of dynamic visual scenes. To address these challenges, this project investigates deep DVSD models driven by multimodal semantic relation reasoning and alignment. The project systematically studies representation learning, semantic reasoning and alignment, and generation for dynamic visual scene description, and has achieved the following innovative results.

  • For efficient modeling of temporal motion characteristics in dynamic visual scenes, an unsupervised video representation learning method based on pseudo-action localization (PAL) and a video–text joint representation learning method based on multi-granularity contrastive learning (LocVTP) are proposed, which enhance the discriminability of temporal motion features.
  • For multimodal high-level semantic relation reasoning and cross-modal alignment, motion modeling methods based on graph neural networks and optical flow (DMP-Net and FTM), as well as a semantic alignment method based on geodesic distance and game-theoretic mechanisms (G2L), are proposed, strengthening semantic relation modeling within and across modalities.
  • For efficient generation and generalization of DVSD, decoupled multimodal pre-training and alignment frameworks, namely UVC-VI and ZeroNLG, are developed. Through multimodal semantic adaptive alignment, the proposed methods significantly improve description performance under weakly supervised, zero-shot, and cross-lingual settings.
  • To meet practical requirements in smart transportation scenarios, a high-quality Chinese traffic-scene video description dataset (TV-CL) is constructed, and an end-to-end video-based traffic scene description generation model is designed and implemented.

Experimental results on multiple public video understanding and video description benchmarks, such as MSR-VTT and ActivityNet, demonstrate that the proposed methods achieve state-of-the-art performance on tasks including dynamic visual scene description, video–text retrieval, and temporal action detection. Experiments on real-world traffic scenarios further verify the effectiveness and practical value of the proposed methods. Focusing on the key scientific problem of multimodal joint representation learning and adaptive alignment, this project systematically reveals the intrinsic mechanisms of multimodal temporal evolution and semantic relation modeling in complex dynamic visual scenes. The results provide new theoretical foundations for overcoming semantic inconsistency and unstable alignment between visual and language modalities, enrich the theoretical framework of multimodal temporal understanding and language generation, and hold significant scientific importance for advancing fundamental research in multimodal intelligent perception and human–computer interaction.

(三)研究成果
本项目累计发表领域内高水平期刊论文5篇和会议论文13篇、申请中国发明专利5项,软件著作权2项。具体研究内容和成果列表见附件。           附件:项目学术成果及主要研究内容

 


国家自然科学基金项目“基于声学矢量传感器阵列和稀疏表示的语音声源方位角估计方法研究”,项目编号61271309,2013年1月-2016年12月,72万元,项目负责人:邹月娴教授

(一)中文摘要
空间语音声源方位角(DOA)估计是服务机器人听觉系统的关键技术之一,具有巨大的应用价值和市场潜力。传统DOA估计方法在鲁棒性、精度、系统开销和体积等方面都存在无法逾越的障碍,限制了实际应用。本项目面向服务机器人应用,开展新的DOA估计方法研究,即基于稀疏表示和声学矢量传感器(AVS)的高精度、鲁棒、多语音声源DOA估计新理论和新方法研究。总结如下:
(1) 开展基于AVS阵列和稀疏表示的DOA 估计方法研究,提出了两种新的基于AVS阵列/子阵列数据模型的DOA 估计算法(AVS-SS-LF 和AVS-SS-ST),仿真结果验证了所提出方法的有效性;
(2) 开展基于单AVS、稀疏表示的语音声源DOA估计方法研究,推导了时频域AVS各传感器间数值比(ISDR)近似模型,获得DOA与ISDR的函数关系;推导出基于ISDR的DOA过完备字典稀疏表示模型,提出了一种新的DOA估计算法,即AVS-ISDR-SSR,大量仿真实验和实测实验验证了算法的有效性;
(3) 开展基于语音时频稀疏性和单AVS的DOA估计算法研究,提出了一种新的多源DOA估计算法,即AVS-ISDR,实验表明,该算法可实现多达7个语音声源的DOA估计;以此,提出了四种高局部时频点提取算法,使得AVS-ISDR在较宽信噪比动态范围和混响条件下,获得稳定和高精度的多语音声源DOA估计;
(4) 分析AVS多通道语音信号双频谱特性,利用双频谱域对高斯噪声的抑制特性,提出了两种基于双频谱数据比的DOA估计方法(AVS-BISDR、AVS-MBISDR),能够有效地抑制加性高斯白噪声以及方向性高斯噪声干扰的影响;
(5) 基于语音时频稀疏特性和机器学习策略,提出了两种基于深度学习的鲁棒DOA估计方法(AVS-DNN-ISDR、AVS-WISDR-DNN),获得在低信噪比和强混响环境中的准确DOA估计;
(6) 自主创新研制了传感器AVS和DOA估计实验原型系统,对提出的DOA估计算法进行了实测验证,并围绕机器人听觉关键技术开展了语音增强、声纹识别、音频事件检测等研究。
综上,课题组按照研究计划顺利完成了研究任务,研究成果获得包括华为、海尔、广州视源股份有限公司、优必选、深圳市海岸技术有限公司等的关注,并在积极进行成果转化。

(二)英文摘要:
Spatial speech sound source Direction-of-Arrival (DOA) estimation is one of the key technologies in the auditory system of service robots, which has great application and potential market value. The traditional array-based DOA estimation methods have some limitations and obstacles, such as robustness to noise, DOA estimation accuracy, system hardware cost and physical size, which limits the practical applications. Targeting at the applications of service robot this project develops new DOA estimation methods and theorems of high accuracy, robustness, multi sound/speech sources based on the Acoustic Vector Sensor (AVS) and sparse representation theory. Summarizations are given as follows:

(1) Under the framework of AVS array and sparse representation theory, we have proposed two new DOA estimation algorithms using AVS array / AVS-subarray data model, termed as AVS-SS-LF and AVS-SS-ST algorithms. Numerical simulation results verify the effectiveness of the proposed DOA methods;

(2) Under the framework of single AVS and sparse representation theory, we have derived the inter-sensor model of AVS, termed as ISDR, and obtained the function relationship between DOA and ISDR. And then, the overcomplete dictionary sparse representation model of ISDR is formed and a new DOA method named as AVS-ISDR-SSR algorithm has been developed.

Number of simulation experiments as well as real experiments have been conducted to verify the effectiveness of our proposed DOA estimation algorithm;

(3) Based on speech sparsity property and single AVS, we proposed a new DOA estimation algorithm for single and multiple speech sources, namely AVS-ISDR, experiments show that AVS-ISDR is able to achieve up to 7 sound source DOA estimation. To further improve the performance of the AVS-ISDR, we put forward four algorithms to effectively extract the reliable high local signal-to-noise time-frequency points. As a results, the AVS-ISDR provides high DOA estimation accuracy and robust performance under a wide dynamic range of signal-to-noise ratio as well reverberant conditions;

(4) By analyzing the bispectrum characteristics of the multi-channel speech signals of single AVS and using the suppression characteristics of bispectrum on Gaussian noise, we proposed two DOA estimation methods according to the bispectrum ISDR data model, termed as AVS-BISDR, and AVS-MBISDR algorithms. Number of simulation experiments as well as real experiments verify that AVS-BISDR, and AVS-MBISDR are able to reduce the effect of additive white Gaussian noise and directional Gaussian noise interference;

(5) Based on speech sparsity property and machine learning strategies, we proposed two deep learning based robust DOA estimation methods, named as AVS-DNN-ISDR and AVS-WISDR-DNN, respectively, which are able to obtain accurate DOA estimation under low SNR and strong reverberation conditions;

(6) We have independently and creatively designed and implemented the sensor AVS. Moreover, we developed the DOA estimation prototype system. With the developed AVS sensor and the DOA estimation system, numerous experiments have been conducted to evaluate the performance of the proposed DOA algorithms, which provide more reliable validation than that of the computer simulated data. Based on these achievements, we further carry out some researches on the key technologies for robotic auditory system, such as speech enhancement, voice verification, audio event detection etc.

In conclusion, following the project research plan, we have successfully completed the tasks designed.  The research outcomes of the project have draw great attention by several companies, including HUAWEI, Haier, Guangzhou Shiyuan Ltd, Shenzhen HaiAn speech technology Ltd etc. Some achievement transformations are undergoing.

(三)研究成果
本项目累计发表领域内高水平期刊论文5篇和会议论文18篇、申请中国发明专利5项。具体研究内容和成果列表见附件。
附件:项目学术成果及主要研究内容