SpeechRecognition
 

Goal

 The aim is to be able to run my home automation system with voice commands.


To this end, I'm looking at have microphone arrays at various points around the house, feeding into a speech recognition system, which is backended by an interface to the HA gateway.


Input Devices

Having a (maybe) reasonable input device should generate an audio stream over TCP. This would be processed at the SBC end if there's enough CPU power to do the beam forming, othewise on a PC.

 

Speech Recognition

Known options.

Sphinx

This is the application behind Gnome Voice Control. There's a variety of code bases of varying quality. Given the very limited grammar model targeted, it's probable that pocket sphinx would do the job.

Julius

Apparent favorite of VoxForge who are busy building speaker independant models for it. It appears to be under active development, and is far less fragmented than the CMU Sphinx efforts. 

Something built on HTK

Looks like it would be a reasonable amount of work.

Something custom

This is mostly wishful thinking. Speech recognition is a complex and well studied field, but I still have the itch to see what Restricted Boltzmann Machines would do in this sort of space. (In particular, doing unlabelled phoneme classification seems fairly ideal).

And maybe doing things like using a speech synth system, plus noise, plus transforms to generate large labelled corpus. (using the noise to try and achieve the speaker independence).


Interfacing

This bit should be pretty easy. With a word stream out of the speech engine, this should be basically a FSA with actions arising from state transitions.

 

Issues

Where to start! LOTS of them.

 

1. Picking up spoken voice for a HA system means that the speaker will be 1 - 4 meters away from the microphone. This means low signal levels, highly variable signal levels, and truly awful signal to noise ratios. It means high gain is needed for the mic preamp, and noise ingress will be a serious issue at every point.

 

2. Reverb and echo are going to be large relative to the signal, and will be difficult to control. The only bright side here is that rooms themselves are unlikely to change shape, so the adaption needed will likely only be for people moving around. But even with that, this is going to be nasty.


3. Beam forming to try and control the noise levels is highly non-trivial. Steering the beam to track speakers is cutting edge technology. And due to the huge variation in audio wavelengths, aliasing is a serious issue with the small number of microphones. (4 is really inadequate. 16 is really a minimum for a useful system).

 

4. Cross talk supression will be tough. 'Crosstalk' here is used in the broader sense: Voice from unwanted sources (TV, Movies, Radio); Multiple simultaneous speakers; and multiple arrays hearing the same speaker. The best idea I have here is to do pre and post cueing. (i.e. have a word that must be used before and after command phrases, and discard all speech without that).

 

5. Speech recognition accuracy. This generally sucks. Which means that the models will need to be re-tuned to match the speakers (my family), and also re-tuned to match the awful audio pickup (microphone arrays). In addition, accuracy on existing open source engines is not great. So it's an open question is the models can be tuned enough to make a useful system at all.

 

6. Latency. Most of the engines are compute intensive, and don't process in parallel with speech. This means that there will be a lag in actioning commands, which is problematic both because this is generally a poor user experience, and because given the (poor) system accuracy, the user needs to wait to see if the command has been understood so they can repeat it. Yuck.

 

7. Time. It's not unlikely that this will take a large investment of effort to get a working system, and my enthusiasm lasting long enough to finish is an open question (and I should point out my track record here isn't ideal!)