If you are a network celebrity, actor, leader, company founder, or you have a large number of public audio and video on short video websites, it is likely to be used by criminals to synthesize your voice, which is called Deepfake in the industry.
2065438+March 2009, an executive of an energy company in the UK received an urgent call from the CEO of the German parent company, asking him to remit the funds to Hungarian suppliers. The caller said that "the request is urgent" and asked the administrative staff to pay 220,000 euros (about1730,806 RMB) within one hour. The British executive didn't realize what went wrong at first. During the whole call, the CEO showed his incisive performance with a slight German accent. The executive had no doubts until he was asked to transfer money again. Criminals always call three times. After the first transfer of 220,000 euros, they called to say that the parent company had transferred the money to the British company. Then they held a third conference call later that day, pretending to be CEO again and asking for a second transfer. As the third call came from Austria, the administrative department of the British company became suspicious and did not transfer any more money. After investigation, it was found that the 220,000 euros were not transferred to the so-called Hungarian suppliers, but to countries such as Mexico. After investigating the incident, the police found that the fraudster used an artificial intelligence speech synthesis software to imitate the voice of the CEO of the German parent company, but still could not find the fraudster behind the scenes.
A science and technology reporter was influenced by Aviv Ovadya, CTO of the former Social Media Responsibility Center of the University of Michigan, and made an experiment. He imitated his voice with AI synthesis software and then called his mother. Who is the person who is most familiar with your own voice in the world? Your mother must be the most familiar voice, but the terrible result is that her mother didn't hear any difference at all.
Lyrebird, co-founded by three doctors from the University of Montreal, developed a "speech synthesis" technology. As long as the voice of the target character is recorded with high quality for 65,438+0 minutes and thrown to Lyrebird for processing, a special key can be obtained, which can be used to generate whatever the target character wants to say. Qinbird can not only imitate anyone's voice by using phonetic imitation calculus, but also add "emotional" elements to the voice to make it sound more real.
Even if we are not celebrities, our ordinary users have left thousands of historical voices on the mobile social voice platform. Usually, the voice in the APP can't be forwarded, but there is an "enhanced software" on the network that can keep and forward the voice files of conversations in the APP, so that criminals can easily synthesize the voices of family and friends who sound familiar as long as they steal friends' accounts and get friends' voices.
Knowing this and knowing that: several main means of voice fraud and attack
In order to know this and know that, it is necessary to have a deep understanding and research on the common voice attack fraud means. At present, there are three common fraudulent means of voice attack, namely text-to-speech (TTS), voice conversion (VC) and replay. In ASVspoofing, the world's top competition, the scene of speech synthesis and speech conversion system is called LA (Logical Access), and the scene of recording and playback is called PA (Physical Access).
The working principle of speech synthesis and speech conversion is shown in figure 1. The speech generated by waveform modeling technology based on neural network is similar to that generated by WaveNet, which is very close to the real speech. The voice produced by the best system in Voice Conversion Challenge 20 18 greatly improved the naturalness and similarity of simulated human voice.
Figure 1 working principle of speech synthesis and speech conversion
ASVspoofing Challenge is a world-class competition in recent years to study voice attacks and try to solve this problem. The goal is to design an effective anti-attack security system, which can accurately find the forged fake voice generated by the latest algorithm or different algorithms, even the stealth algorithm. Up to now, it has been held for three times, namely ASVspoofing20 15, ASVspoofing20 17 and ASVspoofing20 19. Many top research institutions and well-known companies are involved. In the training, testing and verification data set provided by the organizer of ASVSPOFIG 2019, the latest attack algorithms and means in the industry are listed and included, including TTS 10 mainstream algorithm, 4 mainstream algorithms of VC, and 3 fusion algorithms of TTS and VC. The algorithm and results are shown in Figure 2. It can be seen that the latest algorithms mainly use neural waveform model and waveform filtering, or are variants of these technologies. At the same time, the latest algorithm of TTS/VC also draws lessons from some core technical points in speaker recognition. These algorithms can be generated based on some toolkits such as Merlin, CURRENT, MarryTTS and so on. Meanwhile, we can observe some other important details. An important index to evaluate the performance of automatic speaker verification system is equal error rate EER. The lower the EER, the better the performance of ASV recognition. When there is no false voice attack, the performance of ASV is only 2.48%, but when the system is attacked by false voice synthesized by TTS and VC, the performance drops rapidly. As can be seen from Figure 2, the highest EER can be increased to 64.78%, which shows that the attacking voice has a great influence on speaker recognition, voiceprint recognition and other voice systems, and the significance of security measures in identifying falsehood and resisting attacks.