PITA Fiscal Year 2009 Projects - Product and Process Design and Optimization

Front-to-Back High-Speed Speech Recognition in Silicon

Principal Investigators: Rob A. Rutenbar, Tsuhan Chen, Dr. Edward C. Lin

Speech recognition tools translate human speech data into searchable text. Whether running on a desktop PC or an enterprise server farm, all of today’s state-of-the-art recognizers exist as complex software running on conventional computers. This is profoundly limiting for applications that require extreme recognition speed. Today’s most sophisticated recognizers fully occupy the computational resources of a high-end server to deliver results at real-time speed: each hour of audio input requires roughly an hour of computation for recognition. We need much faster recognizers to triage vast streams of audio intercepts for threats to national security; to extract searchable text from the torrent of audio/video media uploaded to the web; to transcribe physician-dictated medical diagnoses into electronic medical records; and to mine business intelligence from recorded interactions in call centers. Our solution is to move today’s best-quality speech recognition strategies directly into hardware. In prior work, the team has demonstrated fast experimental prototypes for some of the critical components of best-quality speech recognition. In this ‘seed’ proposal, we propose for the first time to integrate all the disparate components of our prior academic recognizers into a unified, front-to-back (voice to text) high-rate recognizer prototype, running on a commercial-grade configurable hardware platform. This recognizer will perform at speeds of least 20x faster than realtime (e.g., 20 hours of speech processed in 1 hour of elapsed time), run large-vocabulary benchmarks (at least 20,000 words), which is the entry point for practical use in applications such as homeland security, call center customer relationship management (CRM), and medical dictation.