摘要
The current technique of spoken term detection is dominated by deep learning, which requires large annotated data for training, and is difficult to be applied in limited-data scenarios. In this paper, a feature trajectory based method of spoken term detection is proposed for limited-data scenarios. The method originated from the fact that a word is a structured organization of small units such as syllable or phoneme and any language unit has steady statistical audio feature, based on the principle of physical location, feature distribution, temporal information of keywords, and local distinguishing information are constructed with speech examples. Spoken keywords are searched with the feature trajectory information of the detected speech segment in hierarchical decision strategy. The method works on a audio feature space defined by a identifier set trained with a large unlabeled speech dataset. Several experimental results show that the proposed method is evidently superior to HMM and CRNN when the training samples is less than 100. For example, when 10 samples are used for training, FRR and FAR of the propose method are absolutely decreased by 20.5% and 8.7 FP/hour respectively compared with HMM-based system. On the other hand, the proposed method achieved the comparable performance v.s. CRNN-based system when the training samples is more than 300. ? 2023 Chinese Institute of Electronics.
- 单位