Amazon Alexa’s crash after the Christmas Day overload is the evidence, that virtual assistants are constantly growing in popularity.
We want our assistant to be compact and portable.
As its brain we will use a Raspberry Pi 3 B – a small, affordable micro-computer. If you are not familiar with Raspberry Pi, you can read about it on Raspberry Pi Foundation’s official website www.raspberrypi.org.
As its ears we will use an external USB microphone (unless we have an external USB sound card). Raspberry Pi does not provide voice input support out of the box. It has a 3.5 mm jack port for output purposes only. It has several USB ports, so plugging in a USB microphone works like a charm without any further configuration. You can also use a USB webcam with an inbuilt microphone.
Last but not least, is an external speaker, that will work as its mouth. We can connect it to Raspberry Pi via USB or, as mentioned above, 3.5 mm jack port.
Let’s plan how we want to be using our device.
First of all, we want to be able to wake up the device to start issuing commands. This is called Voice activation. We want the device to be constantly listening to the surroundings and be waiting to be woken up. Let’s achieve it by saying “Hey Teddy”. Now the device confirms with a gentle sound that it’s ready for our commands.
We can talk to Teddy and hope that he understands us. This step is called Speech recognition. It’s a bit different than the previous step. Once he knows what we said, he can analyze the request. If we ask what the weather is like, he should respond out loud. In order to achieve it, we will use the service called Text to speech.
We asked a question and got the answer. We are happy and so is Teddy. He goes to sleep now, but will be listening with his one ear to be activated again and help us willingly! Now, let’s learn how to complete each step to get Teddy ready to use.
It would be impossible for Teddy to understand everything we say all the time, so in order to activate it we have to use a hotword. In our case, it will be “Hey Teddy”.
The device will be constantly listening out for voice patterns. Once it detects the hotword, it will get activated and be ready for our requests. We can specify multiple activation hotwords, but it’s not advisable to make the number too high, as the more hotwords we specify the more processing power it will require.
To set up the voice activation, we will use a service called Snowboy – a hotword detection engine. It provides a great documentation, that guides you step by step how to use it. You can even record and test your hotword on their website.
Example Snowboy library usage:
import snowboydecoder detector = snowboydecoder.HotwordDetector(model, sensitivity=0.5) detector.start(detected_callback=callback_method)
Remember, that the device will be listening out for a voice pattern that was recorded by you and, as a result, it might not work when the hotword is said by somebody else. As a solution to this problem, and for a better pattern recognition, you can use one of the general models that were already trained by hundreds of people. It includes hotwords like Jarvis or Alexa.
Speech recognition and the logic
Once we’ve successfully activated the device we can begin issuing commands.
In order to recognize an issued phrase, it first needs to be recorded. Then, we need to send the voice sample to an external service, that will be able to recognize our phrase and return it in a form of a text.
There is an existing python library, called SpeechRecognition, that can do all of this for us. It first listens to background noise, so it knows when we are silent, and then starts recording our voice. The recording stops automatically when we are finished. We can now recognize our voice with one of the existing services like Sphinx, Google Speech Recognition, Microsoft Bing Voice Recognition or IBM Speech to Text. Once that is finished, the required phrase is quickly returned in a written form.
Example SpeechRecognition library usage:
import speech_recognition as sr r = sr.Recognizer() m = sr.Microphone() with m as source: r.adjust_for_ambient_noise(source) with m as source: audio = r.listen(source) word = r.recognize_google(audio, language="pl-PL")
The device now knows the exact command. What can it do now? Well, anything we want! But it’s our responsibility to create the logic. It won’t be as intelligent as Google Assistant out of the box. We need to parse the request, and perform certain action based on it. As we are using a Raspberry Pi, we have the possibility to control various electronic devices. Sky is the limit!
Text to speech
Let’s say we asked Teddy what the temperature is outside (assuming there is a temperature sensor connected to the Raspberry Pi), and we want the result to be said out loud. The logic composed the text result “The temperature outside is 5 degrees Celcius”. In order to hear the answer we need to reverse the process done in the previous step. We have to send the text to an external service that will return the phrase in a voice format. To achieve this, we will use Google Translate text-to-speech that allows to specify the language of spoken text. Once it returns the result, we can save the voice sample to a file and play it using the speaker.
Example Google Translate text-to-speech library usage:
from gtts import gTTS tts = gTTS(text='sentence', lang='pl') tts.save('/tmp/sentence.mp3') os.system('mpg321 /tmp/sentence.mp3')
As you can see, it’s not that complicated if we tackle each step one by one. They perform different operations, but when combined, give you limitless opportunities. You can start testing the described services by simply installing Python and the required libraries on your computer. When everything starts to work separately, pack it up to work together as your own personal assistant.
More posts by this author