Chen's job was to cart wagons full of coal and ash to and from the factory's furnace. He kept his mind nimble by listening to his coworkers speak. At night, in the workers' dormitory, he compiled a sort of linguistic ethnography for the Beijing dialect. He finished the book around 1960. Soon after, Communist Party apparatchiks confiscated it.
His fortunes improved after Mao's death, when party leaders realized that China's economy needed intellectuals in order to develop. Chen went back to school, and in 1979, at the age of 42, his test scores earned him a spot in the first group of graduate students to go abroad in decades. He moved to the US and earned a PhD in physics at Columbia University. At the time, America offered more opportunity than China, and like many of his peers, Chen stayed after graduation, getting a job with IBM working on physical science research. IBM had developed some of the world's first speech recognition software, which allowed professionals to haltingly dictate messages without touching a keyboard, and in 1994 the company started looking for someone to adapt it to Mandarin. It wasn't Chen's area, but he eagerly volunteered.
Right away, Chen realized that in China speech recognition software could offer far more than a dictation tool for office workers; he believed it stood to completely transform communication in his native tongue. As a written language in the computer age, Chinese had long posed a unique challenge: There was no obvious way to input its 50,000-plus characters on a QWERTY keyboard. By the 1980s, as the first personal computers arrived in China, programmers had come up with several workarounds. The most common method used pinyin, the system of romanized spelling for Mandarin that Chinese students learn in school. Using this approach, to write cat you would type “m-a-o,” then choose 猫 from a drop-down menu that also included characters meaning “trade” and “hat,” and the surname of Mao Zedong. Because Mandarin has so many homophones, typing became an inefficient exercise in word selection.
To build his dictation engine, Chen broke Mandarin down into its smallest elements, called phonemes. Then he recruited 54 Chinese speakers living in New York and recorded them reading articles from People's Daily. IBM's research lab in Beijing added samples from an additional 300 speakers. In October 1996, after he had tested the system, Chen flew to China to display the resulting software, called ViaVoice, at a speech technology conference.
In a packed room festooned with gaudy wallpaper, Chen read aloud from that day's newspaper. In front of him, with a brief delay, his words appeared on a large screen. After he finished, he looked around to see people staring at him, mouths agape. A researcher raised her hand and said she wanted to give it go. He handed over the microphone, and a murmur ran through the crowd. ViaVoice understood her too.
ViaVoice debuted in China in 1997 with a box that read, “The computer understands Mandarin! With your hands free, your thoughts will come alive.” That same year, President Jiang Zemin sat for a demonstration. Soon PC makers across China—including IBM's rivals—were preinstalling the software on their devices. The era of freely conversing with a computer was still a long way off, and ViaVoice had its limitations, but the software eased the headache of text entry in Chinese, and it caught on among China's professional class. “It was the only game in town,” Chen recalls.
But for some scholars who had stayed in China, it stung that a researcher working for an American company had been the one to make a first step toward conquering the Chinese language. China, they felt, needed to match what Chen had done.
AMONG THOSE MOTIVATED by IBM's triumph was Liu Qingfeng, a 26-year-old PhD student in a speech recognition lab at the prestigious University of Science and Technology of China, in Hefei. In 1999, while still at USTC, Liu started a voice computing company called iFlytek. The goal, it seemed, was not just to compete with IBM and other foreign firms but to create products that would recoup Chinese pride. Early on, Liu and his colleagues worked out of the USTC campus. Later they moved elsewhere in Hefei. It was a second-tier city—USTC had been relocated there during the Cultural Revolution—but staying in Hefei meant iFlytek was close to the university's intellectual talent.
When Liu explained his business concept to Kai-Fu Lee, then the head of Microsoft Research Asia, Lee warned that it would be impossible to catch up with American speech recognition giants. In the US, the industry was led by several formidable companies in addition to IBM and Microsoft, including BellSouth, Dragon, and Nuance Communications, which had recently spun off from the nonprofit research lab SRI International. These companies were locked in a slog to overcome the limitations of early-2000s computing and build a voice-computer interface that didn't exasperate users, but they were far ahead of Chinese competitors.
Liu didn't listen to Lee's warnings. Even if voice-interface technology was a crowded, unglamorous niche, Liu's ambition gave it a towering moral urgency. “Voice is the foundation of culture and the symbol of a nation,” he later said, recounting iFlytek's origin story. “Many people thought that they”—meaning foreign companies—“had us by the throat.” When some members of his team suggested that the company diversify by getting into real estate, Liu was resolute: Anyone who didn't believe in voice computing could leave. Nuance was building a healthy business helping corporate clients begin to automate their call centers, replacing human switchboard operators with voice-activated phone menus (“To make a payment, say ‘payment’”). iFlytek got off the ground by doing the same sort of work for the telecommunications company Huawei.
iFlytek went public in 2008 and launched a major consumer product, the app iFlytek Input, in 2010. That same year, Apple's iPhone began to carry Siri, which had been developed by SRI International and acquired by Apple. But while Siri was a “personal assistant”—a talking digital concierge that could answer questions—iFlytek Input was far more focused. It allowed people to dictate text anywhere on their phones: in an email, in a web search, or on WeChat, the super app that dominates both work and play in China.
Like any technology trained on interactions with human speech, Input was imprecise in the beginning. “With the first version of that product, the user experience was not that good,” said Jun Du, a scientist at USTC who oversaw technical development of the app. But as data from actual users' interactions with the app began to pour in, Input's accuracy at speech-to-text transcription improved dramatically.
As it happened, Siri and Input were relatively early arrivals in a coming onslaught of mature voice-interface technologies. First came Microsoft's Cortana, then Amazon's Alexa, and then Google Assistant. But while iFlytek launched its first generation of virtual assistant, Yudian, in 2012, the company was soon training much of its AI firepower on a different challenge: providing real-time translation to help users understand speakers of other dialects and languages. Later versions of Input allowed people to translate their face-to-face conversations and get closed captioning of phone calls in 23 Chinese dialects and four foreign languages. When combined with China's large population, the emphasis on translation has allowed the company to collect massive amounts of data.
Americans might tap Alexa or Google Assistant for specific requests, but in China people often use Input to navigate entire conversations. iFlytek Input's data privacy agreement allows it to collect and use personal information for “national security and national defense security,” without users' consent. “In the West, there are user privacy problems,” Du says. “But in China, we sign some contract with the users, and we can use their data.” Voice data can be leaky in China. The broker Data Tang, for example, describes specific data sets on its website, including one that includes nearly 100,000 speech samples from 3- to 5-year-old children.
In 2017, MIT Technology Review named iFlytek to its list of the world's 50 smartest companies, and the Chinese government gave it a coveted spot on its hand-picked national “AI team.” The other companies selected that year were platform giants Baidu, Alibaba, and Tencent. Soon after, iFlytek signed a five-year cooperation agreement with MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL), a leading AI lab. The company's translation technology is used by the Spanish football club RCD Espanyol, and it signed an exclusive deal to provide automated translation for the 2022 Beijing Winter Olympics. As of mid-April, iFlytek was valued on the Shenzhen Stock Exchange at $10.8 billion, and it claims to have 70 percent of the Chinese voice market, with 700 million end users. Nuance was valued at $5.3 billion during the same time. In China, the company's other major competitors in voice computing are mainly platforms like Alibaba and Baidu.
Two decades after Julian Chen intuited that voice computing would revolutionize how people interact with computers in China, its impact there is indeed dramatic. Every day, WeChat users send around 6 billion voice texts, casual spoken messages that are more intimate and immediate than the typical voicemail, according to 2017 figures. Because WeChat caps the messages at one minute, people often dash them off in one long string. iFlytek makes a tablet that automatically transcribes business meetings, a digital recorder that generates instantaneous transcripts, and a voice assistant that is installed in cars across the country.
Consumer products are important to iFlytek, but about 60 percent of its profits come from what is described in the company's 2019 semiannual report as “projects involving government subsidies.” These include an “intelligent criminal investigation assistant system,” as well as big data support for the Shanghai city government. Such projects bring access to data. “That might be everything that's recorded in a court proceeding, call center data, a bunch of security stuff,” says Jeffrey Ding, a scholar at Oxford University's Future of Humanity Institute who studies AI governance in China. Liu, iFlytek's founder and CEO, is a delegate to the National People's Congress, China's rubber-stamp parliament. “He has a very good relationship with the government,” Du says.
Liu has a vision that voice computing will someday penetrate every sphere of society. He recently told an interviewer for an online state media video channel: “It will be everywhere, as common as water and electricity.” That's a dream that aligns neatly with the Chinese Communist Party's vision for a surveillance state.
ONE DAY THIS past fall, I tested out a recent model of the Translator, an instant translation device made by iFlytek, with a man I'll call Al Cheng. The Translator, a device powered by a Qualcomm Snapdragon chip, works offline for major world languages. Cheng and his wife live in a congested city in southern China, but every other year they travel to the Midwest to visit family. To get exercise, they walk half a mile each morning to the mall. But Cheng, who likes to hold forth on art and culture in Mandarin, Cantonese, and Hakka, does not speak any English. Much of the time while in the US, he is unhappily silent. He is exactly the sort of person who needs a Translator.
No comments:
Post a Comment