摘要

Objective Forensic-oriented digital audio technology has been intensively developing in terms of the growth of audio recordings. Digital audio recordings can be as the evidences for the legal disputes issue of civil litigation in common. However, the original semantic information of the audio recordings can be changed very easily by widely via several of digital audio editing software and their online tutorials. Consequently, audio forensics are challenged of the real or fake issue derived from tampered audio recording behavior. A copy-move forgery can distort the original recordings through audio clip. The source and the target segments in the copy-move forgery are both derived from the same audio recording compared to splicing and synthesized forgeries. Such attributes like amplitude, frequency, length, noise, tone, and even velocity can be well-matched between the forged segments and the recording, especially for the segments of very short duration for utterances. The requirement of blind audio tampering detection has promoted blind audio forensics via the copy-move forgery detection and localization on digital audio recordings. However, most of the existing methods divide the audio recording into very short multiple segments based on voice activity detection (VAD) related techniques. The accuracy of localization and forgery is challenged although the two similar segments can be identified within the recording. We facilitate multi-feature decision fusion method for detecting and localizing the audio copy-move forgeries. Method First, the audio recording is segmented into many voiced and unvoiced parts in terms of spectral-entropy-based VAD technology. Next, all the voiced segments are further split into syllables, each of which contains a Chinese character only according to the energy to spectral entropy ratio. Then, the pitch frequency, color auto-correlogram, and short-time energy features of each syllable are extracted respectively. The similarity of any two syllables on the pitch frequency features is calculated by the dynamic time warping distance. The similarity of the two syllables on the color auto-correlogram features is obtained by the cosine distance, and the similarity of the two syllables on the short-time energy features is generated by the difference of the short-time energy sum, respectively. Finally, audio forgeries are accurately localized on the basis of multi-feature decision fusion and the three similarities mentioned above. In detail, a copy-move forgery has occurred, and the approximate forgery locations are preliminarily determined for any two pending syllables if each similarity of the two syllables cannot meet the requirement of pre-specified threshold. After that, two new syllables are constructed through both of the two forged syllables by one frame. It is calculated by the three similarities of the new syllables compared to the threshold. If each similarity is still less than the threshold, the two syllables are extended by one frame again until one of the three similarities is beyond the corresponding threshold. The phase of two new syllables positions are based on forgery locations exaction only. Result A classical database is used to generate our copy-move forged dataset, which includes 500 authentic recordings and 500 forged recordings. The comparative analyses show that our proposed multi-feature decision fusion method has their potentials in terms of precision and recall of more than 97% . Specifically, the detection precision of the proposed method is improved by roughly 16 percentage points, the recall is improved by about 26 percentage points, and the localization accuracy is improved by more than 45% on average. Additionally, our detection precision and recall can reach more than 94% as well via common signal processing attacks like Gaussian noise addition, low-pass filtering, down-sampling, up-sampling, and MP3 format compression. Moreover, the detection precision is improved by about 16 percentage points, and the recall is improved by about 31 percentage points. Conclusion Our method not only has higher detection precision, recall, and localization accuracy, but also has better robustness against common signal processing attacks. ? 2022 Editorial and Publishing Board of JIG.

全文