Speech generation: siblings collaborate on machine learning hackathon project
The first recording that brothers Sam and Joe Eklund, along with their colleague Travis Chambers, played for the audience was a validation.
“I endorse Travis as president of the United States of America,” the audio clip played, in a voice resembling Barack Obama’s.
The second, in the same voice, was a declaration: “Ice is back, our brand new invention” (from the song “Ice Ice Baby” by Vanilla Ice). Ice, however, was pronounced closer to iss—a likely giveaway the two lines were fake.
The presentation, delivered at Lawrence Livermore’s High Performance Computing Innovation Center, was part of Computation’s spring hackathon. The hackathons, held three times a year, are 24-hour sessions where participants can work on new projects of their choice then present their results to the other teams. The Eklunds, Chambers, and new team members Alan Deng and Nick Wong (the Eklunds and Chambers worked together on the first installment of the project at the previous hackathon in November 2018) created the sound bites using the Deep Voice 3 speech synthesis system, which can convert written text into human-like speech.
The playfulness of the text belies the more practical ways the technology could be applied and how the group chose this project for the last two hackathons. Joe, a software engineer, noted that he wanted to gain more hands-on experience with machine learning software at the events.
“In the future, machine learning can be applied to a lot of different things,” said Joe, who had taken classes on the subject and on artificial intelligence in college. “I wanted to take a little bit of time to refresh myself on more of a concrete example of how it could be used. If there is a way that we could apply machine learning to some of the projects that we’re working on, that would be interesting.”
The Eklunds and Chambers had considered other projects leading up to the previous hackathon, including developing a bot to play the online strategy game Generals.io, but they were more interested in gaining experience with Deep Voice 3; they wanted to continue working on the program for the spring event and make improvements to their project.
“We’re all interested in machine learning, but we don’t have a huge amount of experience with it,” Sam, who serves as a data manager, said. “So it’s nice to get some background on it.”
Deep Voice 3
Deep Voice 3 is produced by the Chinese technology company Baidu and was released in late 2017. Employing a neural network architecture, it can clone human voices with just seconds of audio. Users can even train the system to have a particular accent and speak the text they choose. The program can reportedly learn 2,400 voices. Deep Voice 3 differs from the previous two versions in that it trains much faster and produces more realistic-sounding voices.
At the fall hackathon, the team used a Deep Voice 3 model that was pretrained on a data set of a near-monotone voice. They wanted to apply this model to another person and picked President Obama because of the amount of audio available of him speaking and because of the quality of his speeches. They downloaded over 2,000 Obama audio clips from YouTube that were 1–10 seconds long. Joe found the videos then handed them off to Sam, who turned them into data in a format that Deep Voice 3 could process. With upward of about 5 hours of audio, Joe wrote the code to train the model and trained Deep Voice 3 on this new data set on his computer at home overnight between the first and second hackathon days.
For this most recent hackathon, the team increased the size and diversity of the data set and ran the model longer, ending up with about two to three times more audio and more words/sounds that had not been previously included. They also used the program Gentle, which takes media files (in this case, the YouTube clips) and their transcripts (the text of the speeches in the YouTube clips) and returns precisely timed data for each word. For the project, Gentle broke down the words into individual phonemes (units of sound in language), which, according to Sam, provided the cleanest possible data set to feed into the model and helped avoid “muddiness” (extra noise or garbled speech) in the resulting audio.
They added a new feature at the spring hackathon—a web interface to the trained model. A user could enter text into a field, then the site would play an audio clip in Obama’s voice after processing for about a minute. Chambers, a software engineer, headed the process, with assistance from Deng and Wong, who are also software engineers.
Family History in Livermore
Joe’s speech cadence is quick and direct, and Sam’s voice is lower and more languid.
They grew up in Manteca, about 35 miles east of the Lab. Joe joined the Lab in the summer of 2016, and Sam started in January 2018. Since then, they’ve been working together on Computation’s Data Lifecycle Management (DLM) team—Joe at 50% on this team and Sam at 100%—and collaborate most weeks.
On the DLM team, Joe serves as a software developer, having a hand in the design and implementation of software, and as scrum master, working on meta-related issues to improve the software development process for the team. He spends the other 50% working on the Siboka workflow team.
Sam serves as the data manager on the DLM team and is responsible for, among other things, communicating with customers about their data. The group works on a data archive application, so he talks with customers about how their data is formatted as well as how they might want to store and organize it. He is also involved with development of the archive, primarily on deployment of the system.
Sibling rivalry, the brothers explained, is not an issue. “We worked all that stuff out when we were younger,” Sam said.
The Eklunds’ grandfather worked at Lawrence Livermore in the 1960s, which was brought their family to the city. And their father worked at Sandia National Laboratory in Livermore in the 1980s and ‘90s as an engineer.
“I think his working at Sandia had an influence on me going into science in general,” Joe said. “And I knew, based on his experience, that working for a government lab was something that was kind of cool to do.”
Joe and Sam, four and a half years apart, have two other siblings. “We’ve got another brother in between us in age,” Sam said. “And we have an older sister who’s the boss.”
Implications of Speech Synthesis
Computer-generated speech has several daily applications. It could help individuals with visual impairments receive information and could assist individuals with speech impairments in communication.
In contrast, this and other similar technologies could also be used to spread misinformation—for example, by being leveraged to create videos of events that never occurred or speeches that were never given.
Joe pointed to NVIDIA’s Deepfakes technology, which can take two authentic images and create a composite, artificial, and incredibly realistic image from them.
“They can almost paste a face onto somebody else and make movements,” Joe explained. “So if you have an audio clip and you can make a fake video clip, you can make a complete fake video with audio of somebody saying something they never actually said.”
He noted that the ethics of such technologies’ use is a concern. “I think it’s very important that we understand how this media is created,” Joe said. “To me, the next step would be to help design software that can look at these video clips and see if there’s been alterations and see if they are machine generated.”
Sam recalled when Photoshop was becoming popular and there was a perception that no one could know what images were real. Expert users could create fake images but could also detect when images were doctored. “If you don’t have the right kind of people figuring out how all of that works,” he said, “then stuff will go out there and nobody will know if it’s fake.”
Sam also pointed to the impact speech synthesis can have on one’s identity and noted that an applied use of such technology could be constructing a voice “so that people who don’t have the ability to speak might be able to create their own voice that sounds how they want it to sound. They can really reclaim a bit of their identity that they may have lost, or that they never had, that allows them to communicate.”