Dimitri Kanevsky develops products that help people to communicate with loved ones, colleagues, mobile devices, and the whole world. This writes VC.RU.
Lost hearing, he learned to lip-read, graduated from Moscow state University, became a PhD, moved to the USA and now works as a researcher at Google.
The last 40 years, Dmitry is developing devices and technologies that help people with hearing impairments. Among the inventions device that allows me to “hear” with the skin, and app that translates the text of the speech of people with a strong accent, stuttering, and other speech peculiarities.
The inventor told how to set up your camera for lip reading, got to Google and helped to develop an algorithm to automatically create captions on YouTube.
Hereinafter in the first person.
In the childhood I lost hearing. But I was taught to read lips, and I went to a regular school.
I had a lot of friends. Then I have not experienced much difficulty in communication. It became difficult, when in the eighth grade moved into the second mathematical school in Moscow. There were other guys and complex technological objects and had to learn mostly from textbooks.
However after high school I entered the Moscow state University in 1969, and then another eight years he studied mathematics and became a PhD, writing a dissertation on algebraic geometry.
I think math has made me more independent. It is one on one with the problem. You can focus on, to deal with it. It fits my character.
Finishing the thesis, I met future wife. She moved with her parents to Israel, and I decided to go after her.
Understand that in a new country will not be as good to read lips, as in the USSR, and will not be able to communicate freely with people. Then I developed a machine that helped to read lips.
The device was attached to the body and allowed to “hear” skin — caught the sounds and translates them into vibrations. The problem was that some sounds, such as “C”, “W”, “I”, “a”, located at high frequencies, making them difficult to feel skin. Then I thought to translate high frequency to low.
I was able to do so little apparatus, that under the clothes he had not noticed.
I got permission to take the device in Israel, and it’s helped me to speak Hebrew, in which a large number of words with the “high frequency” sounds like “Shabbat”, “Shalom” and so on.
In Israel, I have shown the apparatus of one doctor. He said that this is a great thing and you need to open the company selling the device.
We called it SensorAid. In parallel with this startup I worked as a mathematician at the Weizmann Institute. In the month and was earning 2000 NIS, it was in 1981.
The device is then applied in many countries — it was universal for the whole world. In one hospital compared with the development of the company Cohler, who has implanted a transmitter in the ear of man, so he could read the sounds.
My machine showed the same result as Cohler, but their development was worth $25 thousand and required major surgery, but my version was several times cheaper and did not require the intervention of surgeons.
In 1984, the American company Spectro bought the copyright of the apparatus. I went first to work in academic institutions in Germany and the US, and then moved to IBM.
Work in IBM
First, I developed an algorithm for speech recognition.
To translate speech to text, the system needed to consider acoustic signal and associate it with the word which it represents.
To do this, the sound is represented as a sequence of numbers which is compared with each word in the dictionary, using some criterion. Spoken is the word that best agreed with this number sequence. Criteria — polynomials, which consisted of 50 million variables or parameters.
In the 1990s, to calculate the polynomials with 50 million of the parameters in linear time allowed methods of dynamic programming.
Better criteria were not based on polynomials and rational functions, relationships of polynomials. For them for a long time could not find a way to calculate values for 50 million of the parameters in linear time. I found this method. And when it was applied, the recognition accuracy is greatly improved.
I am constantly working on technologies that help people with hearing impairments. At that time the Internet emerged, and with his help, I created the world’s first services, which helped to understand speech.
For example, a service that allowed you to translate spoken words to written. This client did call people who can quickly print the included speakerphone, and they gained the text, which is heard during a call.
The real-time text displayed on a computer screen of the client, and he knew what to say around him. This service cost up to $120-150 per hour.
Also I was engaged in the invention, not associated with speech recognition.
One of these technologies — Artificial Passenger. She helped the drivers stay awake behind the wheel. The system was observing the man, talked with him, so the driver answering the questions, didn’t fall asleep.
Another development related to the security in banks. To confirm the identity of the client, the consultants are usually asked to call his mother’s name or wife.
I have developed a system that allowed the Bank to gather more information about the client to the staff each time to ask a new question. For example: “How’s your dog’s name?” or “When you returned from vacation?”
At the same time, technology has identified the voice of the caller and check whether it belongs to a customer of the Bank. If everything was in order and the man gave the correct answer to the question, the Bank employee knew that it’s not a fraud.
Speech recognition for YouTube
In 2014 I moved to Google, where he continued to work on speech recognition.
I went into system Closed Caption for YouTube that automatically detects speech in the video and puts it in the subtitles. While the technology worked poorly, and my team had to improve its algorithm.
To create acoustic models of words we needed data — the texts and their dubbed versions in order to teach it. These words were pronounced by different voices.
Previously hired people who listened to and transcribed the audio into text. So gaining a few thousand hours of examples of speech, that is not enough for a good recognition system.
YouTube is interesting because there is a huge amount of videos where the sound and text are already available. Many users upload videos that they themselves put subtitles with the transcript. Partly this was done because the videos with subtitles search gave above.
I had an idea to use for learning algorithms hundreds of thousands of hours of ready data from users. The problem was only that people often do not only errors in the text, but just put in subtitles a random set of letters to get a high rank in the search. We had to put filters that distinguish good data from bad.
In the end, we finished the development in 2016, and the Closed Caption is much better to recognize the speech. What users see now by clicking on the automatic creation of subtitles, the result of this work.
Projects for people with disabilities
In 2017 I moved from new York to California branch of Google.
Already here for six months with the team I created a Live Transcribe, which uses the same technology translation of speech to text and YouTube, but in a separate app. With the help of it people with impaired hearing can read what they say.
The system recognizes the additional sounds, which also writes to: dog barking, baby crying, the sound of a guitar, a knock on the door, laughter and so on. This portion of the audio information is processed on the phone itself, and the transcript direct speech works through the Internet.
One of the main creators of this app — Even Gnagy. Often Google employees develop projects to solve the problems of their colleagues. Gnagy seen how I use services where people writing for me speech that is heard and decided to help.
He created the first prototype of the application. It helped us to work together and eventually grew into a separate Google project called Live Transcribe.
Another project in which I participate, — Euphonia. This app is for people with non — standard speech of those who have ALS, for the deaf, stutterers, people experienced a stroke.
For this project we need many examples of non-standard speech. Only this time not find them even on YouTube. So it is very individual, and we need a different approach for data collection.
I dictated the first 25 hours of recording. Pre-written reports, which had planned to act, and then recorded them on audio. So I trained the system. I was able to speak, and the audience saw the text version of my reports.
With each new performance system to better understand me and recognize even a new phrase. Now I don’t have to write the papers well in advance — the algorithm translates into text everything that I say.
So it became clear that this approach works, and we started to invite people with special speech will also read and write text.
In the case of people with als we started with, that gave them typical phrases that they say in order to interact, for example with Google Home. They need to repeat 100 sentences to train the system for themselves. Such people find it difficult to talk, and they get tired quickly, so we can’t expect from them a large number of records.
However, gradually we began to introduce examples of speeches of different people with this disease in the future to create a universal system. It’s a slow process — too little data, and Euphonia is still a research project, not a finished product.
Euphonia does not require online connection, as is the case with Live Transcribe. The smartphone is a small computing power which is difficult to do the audio decoding. However, the team managed to cope with it.
Many people fear that their data is processed via the Internet. If the user comes to the doctor, and he and the doctor are worried that their dialogue will be taken to remote servers. There is not, because for Euphonia don’t need a network connection.
Now we give a link where people speaking can register and leave samples of their speech. In some cases, Google is trying to do for them free of charge individual a speech Recognizer.
Also I’m working on a project for the recognition of sign language. Here we are working with visual information. This task is more difficult than speech recognition. Now the development is at an early stage.
In the language of gestures one gesture may mean a separate letter, but a whole phrase. And once again, we need to find a huge number of examples. For this project we cooperate with Galdetskiy University. In the United States is the only higher education institution for hearing impaired and deaf people.
In addition, I returned to the idea of the device that transferred high frequency low. My colleagues are working on a new version, more modern, which was able to convey more information.