Advanced voice

Knowles advanced voice technology, called earSmart®, dramatically improves the user experience of mobile devices.With an advanced voice-enabled processor inside your mobile device, you can enjoy:

  • Improved voice quality, for clear conversations even when it’s noisy
  • Higher accuracy for voice-enabled applications and speech recognition services
  • Enhanced audio for multimedia recording and playback

Reverse-engineering human hearing.

What makes earSmart advanced voice technology unique is that it can hear just like we do. Based on pioneering work to reverse-engineer the brilliance of the human hearing system and replicate its processes onto a chip, Knowles advanced voice and audio processors bring auditory intelligence to your mobile device. In the same way we can have a conversation, even in a noisy room, Knowles advanced voice and audio processors hear and distinguish sounds, isolate and enhance your voice and suppress surrounding noise. The end result is a mobile device that delivers a better experience across three key areas:

  • Communications: clear conversations in nearly all situations, however you talk, from hands-free on your smartphone to video chat over VoIP; earSmart Advanced Voice technology enables wideband high-definition (HD) sound quality for richer, life-like calls
  • Speech Recognition: dramatically improved accuracy and performance for speech-based services, such as voice search, voice dialing, navigation, speech-to-text or voice as a user interface
  • Multimedia: improved sound quality for multimedia capture and playback, from movies to music, or recording audio for video, voice memos, and more

Use your mobile phone in more places with earSmart advanced voice technology. Talk to anyone, at any time without worrying about surrounding noise, having to find a quiet place, or missing an important call. Stay connected, accessible and productive wherever you go.

The science behind earSmart™ technology

A noisy environment is highly disruptive to all forms of mobile communication – Knowles has taken a unique approach to solve this challenge. Working with leading experts in auditory neuroscience, our engineers sought to understand the complex processes of the human auditory pathway, from the inner ear to the brain, which allows us to process and perceive sounds.

We then mapped the science of human hearing, to reverse-engineer the process: from the ears which capture all the sounds around you, to the inner ear that translates this information to a form the brain can manage and process, to the brain which distinguishes and identifies specific sounds in the world around you. These intricate functions provided a roadmap for our team to design a voice processor based on these biological processes and named it earSmart™. Our earSmart Advanced Voice processor is designed to harness the complexity and performance of human hearing to deliver breakthrough voice enhancement and noise suppression in mobile devices.

Knowles is a pioneer in developing commercial products based on the science of Computational Auditory Scene Analysis (CASA) to manage the characterization, grouping and processing of complex mixtures of sound. This unique capability is what enables Knowles to provide a highly accurate, efficient and consistent solution for various types of acoustic noise, stationary and non-stationary, from both the transmit side of the call (surrounding your phone), and the "far-end", received from the other end of the line.

The science of CASA, deployed in our earSmart technology, enables Knowles to effectively address the various types of noise found in our everyday environment.

In human hearing, when two or more sounds occur at the same time, the various frequency components from each sound are received by the ear at basically the same time. This presents the problem of determining which components should be grouped together and identified as the same sound. To manage this, the human hearing system evaluates acoustic energy based on an array of characteristics or “cues” – such as pitch, frequency, intensity, onset, spatial location, and duration – which are coded in a rapid-fire series of electrical signals that the brain can recognize. The auditory areas of the brain can then sort and group the acoustic information to perceive a sound from one particular source or event. Through these cues, acoustic elements can also be grouped and linked together in time, producing an auditory stream, which can be interpreted as arising from the same sound-generating event, so we can perceive a dog barking, a person talking, or a piano playing.

The processes that allow the human auditory system to perceive and organize sound, are known as “Auditory Scene Analysis” or ASA, which is a term first coined by psychologist Albert Bregman to define the principles the human auditory system employs to organize acoustic inputs into perceptually meaningful elements. Through ASA, we can accurately group sounds – even when comprised of multiple frequencies, as in music, or when heard simultaneously – and avoid blending sources that should be perceived as coming from separate sound sources or events. As a result, ASA allows you to correctly distinguish and identify a sound of interest, like a voice, from other noise sources. For example, we can reject all the other voices, music and noise at a party, in order to selectively listen to just one conversation.

This is a well-known illustration of ASA, often referred to as the “cocktail party effect”. The “cocktail party effect” illustrates our ability to focus attention on one person talking, or a single noise source, and block out the surrounding noise. This same ability enables us to hear and converse even in a noisy place, such as a crowded café or a busy street.

Knowles is a pioneer in developing commercial products based on the principles of ASA, employing the science of “Computational Auditory Scene Analysis” or CASA – the field of study that attempts to recreate sound source separation in the same manner as human hearing in machines. Using CASA, Audience’s earSmart Advanced Voice processor is able to mimic the processes of human hearing to accurately characterize and group complex mixtures of sound into “sources”, based on a diverse list of cues such as pitch, onset/offset time, spatial location and harmonicity. The processor evaluates these groupings to correctly identify and isolate the primary voice signal or conversation. Dozens of grouping principles underlie CASA. But these can be broadly categorized into those that operate across time, and those that operate across frequency. In addition, learned patterns play an important role. The job of CASA is to group incoming sensory information to form an accurate representation of the environmental sounds. Regardless of whether a sound source is steady and constant, or transitory and moving, a CASA based system handles each effectively. But before these processes can take place, the sound signals collected by the two microphones and arriving to the earSmart voice processor must be digitized and then transformed to the frequency domain. We accomplish this transformation through the Fast Cochlea Transform™.

The cochlea is the most complex part of the ear and central to the human auditory system. It is responsible for transforming sound pressure waves into electrical information that the brain can interpret as a sound. Audience’s proprietary “Fast Cochlea Transform™” (FCT) performs a similar transformation function within the voice processor, converting a time domain representation of sound into a frequency domain representation.

Sound captured by the microphones is sent to the voice processor and digitized before entering the FCT. Operating like the human cochlea, the FCT separates the sound into its frequency components to map the digital audio stream in a three-dimensional, high-quality spectral representation of the sound mixture. The ability to transform sound into a spectral domain is what enables the characteristics or cues in the digital audio stream to be separately identified as different sound sources.

This is particularly important to manage simultaneous sounds, allowing these to be uniquely evaluated and grouped by source using the principles of CASA.

The FCT is similar to the “Fast Fourier Transform” (FFT) that is commonly used in digital signal processing. Both transform the signal into the frequency domain for audio processing, but the FCT is better able to map audio signals in several critical ways:

  • Log-Frequency Scale: The FFT transforms the audio signal into the frequency domain on a linear scale, while the FCT performs its transformation on a logarithmic-frequency scale. A log-frequency scale improves the efficiency of the transformation by putting the resolution and computational resources where the listener can hear it.
  • Direct Computation of Critical Bandwidths: The uniform bandwidth of the FFT contrasts with the well-known Critical Bandwidths of the human ear. The FCT computes its spectral transformation with the Critical Bandwidths built directly into the computation.
  • Optimal Time-Frequency Tradeoff: The FCT provides greater accuracy in representing the audio signal at both low and high frequencies. At low frequencies, the FCT provides greater spectral resolution, which allows the detection of harmonics and recreation of sound more accurately. At high frequencies, the FCT provides a faster response rate, which captures dynamic changes more accurately.
  • Continuous Signal Processing: The FFT transforms the audio signal by reading blocks or frames of data that are taken at a particular frame rate. When the audio signal spans these data frames, spurious artifacts can be introduced and wreak havoc with the audio stream by introducing extra "noise" into the signal. This "noise" can take the form of additional audio signals at additional frequencies, as well as inaccurate frequency representation of the real signal. The FCT takes a completely different approach. Instead of operating on blocks of data, the FCT continuously streams the incoming signal into the transformation. The result is that it can transform the audio signal into its appropriate representation without introducing frame artifacts, or additional "noise" that would have to be removed from the system.

The audio signals now move through a series of functions for processing and interpreting sounds within the CASA system, to produce clear voice communication and suppress background or ambient sounds. 

In the Characterization process, the “cues” or characteristics of sound components are computed – coding the acoustic information according to measures such as pitch, spatial location, onset time, and more, so these can be later used to inform grouping and interpretation.

One of the most powerful simultaneous grouping cues is “Pitch”. The harmonics generated from a pitched sound source form distinct frequency patterns. As a result, this is a helpful means to group and identify components as belonging to one sound versus another. For example, comparing a male to a female voice, it’s easy to separate these signals using Pitch.

Another important grouping cue is “Spatial Location”, which is enabled by using two microphones to function like ears and collect sounds surrounding your mobile phone. Spatial Location allows the voice processor to determine both the direction from which a sound is coming, as well as the distance of the sound from each of the microphones. This allows the components of a sound to be grouped based on displaced location relative to the two microphones. Characterization also relies on another powerful grouping cue, “Common Onset/Offset Time”. Frequency components from a single sound source often start and/or stop at the same time, which is described as onset/offset time. When a group of frequency components arrive at the ear simultaneously, it usually indicates that they have come from the same source. These cues are combined with the raw frequency data from the Fast Cochlea Transform™, to provide the tags that allow the acoustic information to be grouped, and this Grouping process is what allows sound components to be interpreted as a particular sound. 


Dozens of grouping principles are involved for CASA to determine which sound or frequency components, should be grouped together and treated as one sound. The grouping process involves clustering sound components with common or similar attributes so these can be interpreted as a single sound source. Similar attributes are linked to form a single auditory stream, while sound components with sufficiently dissimilar attributes are associated with different auditory streams. These streams can then be tracked in time, to be associated with persistent or recurring sound sources in the auditory environment. The grouping process combines the raw frequency data from the Fast Cochlea Transform associated with each stream identified, and the corresponding acoustic tags.

Voice isolation

After all the sound components are grouped, these can now be interpreted and identified as individual sound sources, which can be prioritized to select particular sounds. For your mobile phone, the voice processor can accurately identify and isolate your voice conversation, separating it from all the other auditory streams, so these can be suppressed or filtered out.

Inverse fast cochlea transform

This last process is the inverse of the Fast Cochlea Transform, and is responsible for converting the Fast Cochlea Transform data back into reconstructed, clear, high-quality digitized audio and made available for transmission over the wireless network. 

Listed below is a glossary of general technology terms used within this website and terms featured in various Audience diagrams.


GENERAL TERMS:  A – E | F – J | K – O | P – T | U– Z

Acoustic Echo Cancellation - The elimination of an echo in a two-way voice transmission, used to enhance voice communications.

Audio Codecs - A codec is a process that compresses/decompresses audio data according to a given format. The objective of a codec algorithm is to represent the high-fidelity audio signal with minimum number of bits while retaining the quality. This can effectively reduce the storage space and the bandwidth required for transmission of the stored audio.

Auditory Scene Analysis (ASA) – Refers to the processes that allow the human auditory system to perceive and organize sound. The term, first coined by psychologist Albert Bregman, is used to define the principles the human auditory system employs to organize acoustic inputs into perceptually meaningful elements.

Auto-calibration – The IC (integrated circuit) is self-calibrating. No engineering intervention required.

Baseband Chipsets – An IC (integrated circuit) that is mainly used in a mobile phone to process communication functions.

Brainstem – The stem-like part of the brain that is connected to the spinal cord. It manages messages going between the brain and the rest of the body and is involved in controlling vital functions, movement, sensation, and functions such as breathing, heart rate, blood pressure.

CASA (Computational Auditory Scene Analysis) – Refers to the field of study that attempts to recreate sound source separation in the same manner as human hearing in machines, to group and process complex mixtures of sound.

Close-talk – The act of speaking into a mobile phone that is positioned close to the sound source, your mouth.

Cochlea – Part of the inner ear, it’s the snail-shaped tube that converts mechanical energy (sound vibrations) into nerve impulses sent to the brain.

Cortex – The outermost layer of the cerebral hemispheres of the brain. It is responsible for all forms of conscious experience, including perception, emotion, thought and planning.

Distortion – The alteration of the original shape (or other characteristic) of an object, image, sound, waveform. Distortion is usually unwanted, and often many methods are employed to minimize it in practice.

DSP or Digital Signal Processing – A category of techniques used to analyze and process signals from sources such as sound. Signals are converted into digital data and analyzed through the use of various algorithms.

Dynamic Noise Suppression – The ability to effectively suppress all types of noise, whether a sound source is steady and constant, or transitory and moving.

Far-talk – Refers to the act of speaking into a mobile phone positioned away from the sound source, your mouth, as when using the phone’s speakerphone function.

Fast Cochlea Transform™ (FCT) – Audience’s proprietary technology that separates sound into its frequency components to map the digital audio stream in a three-dimensional spectral representation of the sound mixture – operating like the cochlea in the human ear, which transforms sound pressure waves into electrical information that the brain can interpret. The FCT decomposes the sound into its frequency components, enables the cues in the audio stream to be analyzed, and the energy to be grouped by source according to the principles of Computational Auditory Scene Analysis.

Fast Fourier Transform (FFT) – A computer algorithm used in digital signal processing (DSP) to convert time domain signals into frequency domain representations.

Full Duplex – A channel providing simultaneous transmission in both directions, as in simultaneous, two-way communication.

Harmonics – Whole number multiples of the fundamental frequency that determines the tone rather than the pitch of sound. 

Host Interface – Refers to the connection between an IC (integrated circuit) and the baseband or application voice processor.

Integrated Circuit (IC) – A miniaturized electronic circuit that has been manufactured in the surface of a thin substrate of semiconductor material. In the electronics industry, Integrated Circuit is the formal name for what is commonly known as a “chip”.

Inverse Fast Cochlea Transform – The process used by Audience technology to convert the Fast Cochlea Transform data back into reconstructed, cleaned-up, high-quality digital audio, which is then converted to an analog signal, and made available for transmission.

Log-Frequency Scale – A scale with frequency represented in logarithmic values and increments.

MMS (Multimedia Messaging Service) – An enhancement of SMS (Short Messaging Service) text messaging that allows a user to include images, audio, video and graphics files with a text message transmitted to a mobile phone.

MOS – Mean opinion score (MOS) provides a numerical indication of perceived quality. It is a subjective measurement obtained by having people listen to calls and rate the audio quality from 1 (lowest) to 5 (highest). The MOS score is generated by averaging the results for a particular codec.

Multi-Band Compander (MBC) – A compander works by compressing or expanding the dynamic range of an analog electronic signal such as sound. A multi-band companding system includes compression and expansion circuits in which each element is utilized in connection with a number of frequency bands.

Non-Stationary Noise – Noise that is characterized by rapid or random change in characteristics such as pitch, space and onset time. Examples include music, a person talking, keyboard typing, etc. By the time non-stationary noise is recognized as noise, it has already passed, so it requires more sophisticated noise-suppression techniques.

Omni Directional Mic – Microphones that pick up sounds from virtually any direction, as opposed to unidirectional mics that only pick up sounds aimed directly into their centers.

Onset Time – One of the attributes used to interpret sound, helping to identify the location or source of a sound in terms of direction. Because sound travels as physical waves through the air, the time it takes for sound waves to reach the ears will vary, providing information on the location of a sound relative to the listener. The time at which sounds arrive is referred to as onset time.

Pitch – One of the attributes used to interpret sound. Pitch refers to the frequency of a sound, or the number of vibrations per second, usually measured in hertz. A sound with a high frequency will have a high pitch, and a shorter wavelength. A low pitch indicates deep tones.

Post-Equalization Filter – Enables handset manufacturers to manage frequency response.

RoHS Requirements – RoHS stands for Restriction of Hazardous Substances. RoHS regulations were passed by the EU in 2003 and took effect in 2006. RoHS regulates the amount of 6 select hazardous substances (lead, mercury, cadmium, hexavalent chromium, polybrominated biphenyls, polybrominateddiphenyl ether) in electronic and electrical equipment.

Schemas – Learned patterns.

Sequential Grouping Cues – Auditory grouping cues that occur across time.

Simultaneous Grouping Cues – Auditory grouping cues that occur across a frequency.

Spatial Location – The location of a sound based on its distance and direction. This information can be used to group sounds and differentiate them from the voice of interest.

Stationary Noise – Noise that has a relatively constant nature in terms of pitch, space, and onset time. This type of noise can be identified and effectively subtracted through conventional signal-processing techniques.

System-on-Chip (SOC) – Refers to integrating all components of a computer or other electronic system into a single integrated circuit (IC). It may contain digital, analog, mixed-signal, and often, radio-frequency functions – all on a single IC substrate.

Thalamus – A collection of nerve cells in the brain with the primary role to relay sensory information from other parts of the brain to the cerebral cortex.

Time-Frequency Resolution – Techniques for characterizing and manipulating signals whose statistics vary in time, such as transient signals.

Voice Color Tuning – Modifying subjective sound quality by frequency domain emphasis and de-emphasis.

Voice Equalization (VEQ) – The increase or decrease of voice signal strength to enable increased clarity.

Voice Search – Also called voice-enabled search, allows the user to use a voice command to search the Internet, on a portable device.



ADC – Analog to Digital Converter.

Application Processor – An IC (integrated circuit) that processes data in contrast with one that performs control functions.

ASR – Automatic Speech Recognition, technology that enables a computer to identify the words that a person speaks into a microphone or telephone.

Bluetooth® – An open wireless technology standard for exchanging data over short distances between fixed and mobile devices.

BT PCM – Bluetooth Pulse Code Modulation, a method of encoding an audio signal in digital format.

DAC – Digital to Analog Converter.

EEPROM – Electrically Erasable Programmable Read-Only Memory, a type of non-volatile memory used in computers and other electronic devices to store small amounts of data that must be saved when power is removed.

GPIO – General Purpose Input/Output.

I2C Slave – One side of a specialized serial interface used for inter-IC communications. The “slave” device is dependent upon a “master” device to initiate communications.

Instruction ROM – Permanent memory of a computer from which application programs can be executed.

Instruction RAM – The very fast, temporary memory of a computer into which application programs currently in use can be loaded and executed.

I/O Power – Input/Output power supply.

JTAG – Joint Test Access Group, a consortium of individuals from North American companies whose objective is to tackle the challenges of testing high density IC (integrated circuit) devices.

LDO – Low Drop-Out regulator is a DC linear voltage regulator, which can operate with a very small input–output differential voltage.

MUX – Multiplexor, a device that funnels several different streams of data over a common communications line.

NS – Noise Suppression, the reduction of unwanted sound.

PCM – Pulse Code Modulation, a method of encoding an audio signal in digital format.

PDM – Pulse-Density Modulation is a form of modulation used to represent an analog signal in the digital domain. In a PDM signal, specific amplitude values are not encoded into pulses as they would be in PCM. Instead it is the relative density of the pulses that corresponds to the analog signal's amplitude.

PLL – Phase Locked Loop, a highly stable electronic circuit that can be used in radios to give accurate, drift-free tuning.

PMU – Power Management Unit, a microcontroller that governs power functions of digital platforms.

RF – Radio Frequency, an alternating current that generates an electromagnetic field when applied to an antenna. The generated electromagnetic field is suitable for wireless broadcasting and communications.

RX – An abbreviation for receive when in reference to a communication interface.

SPI Interface – Serial Peripheral Interface is an asynchronous serial data link standard that operates in full duplex mode. Devices communicate in master/slave mode where the master device initiates the data frame.

TX – An abbreviation for transmit when in reference to a communication interface.

VEQ (Voice Equalization) – The increase or decrease of voice signal strength to enable increased clarity.


To deliver clear voice communications in all environments, a noise suppression solution has to provide strong, consistent performance across all types of conditions. And a high quality solution must also minimize voice distortion or artifacts when suppressing noise and echoes.

Audience has taken a leading role in helping the industry identify testing standards for noise suppression solutions to validate performance in real world environments. Working closely with leading wireless service providers and handset manufacturers, Audience has helped to develop and modify recommendations for testing standards to manage objective and subjective assessments for noise suppression quality for the International Telecommunication Union (ITU). The ITU is the United Nations specialized agency in the field of telecommunications, information and communication technologies and its “Standardization Sector” (ITU-T) is responsible for studying technical and operating questions and setting recommendations to standardize telecommunication on a worldwide basis.

Recommendation ITU-T P.835 – Subjective Testing Standard

The ITU-T P.835 recommendation (approved in November 2003) provided a methodology to evaluate the subjective quality of speech in noise and to evaluate noise suppression algorithms, using separate rating scales to evaluate the subjective quality of the speech signal alone, the background noise alone, and overall quality – along three dimensions: signal distortion, noise distortion, and overall quality. But this focused on stationary noise only.

Audience worked closely with leading mobile operators and handset makers to ratify a new standard for a comprehensive test methodology to address subjective testing of non-stationary noise suppression. On October 2007, the ITU-T standardization committee approved Amendment I, Appendix III to enhance the P.835 standard and provide test methodologies to emulate real world conditions of multiple, fast-changing, non-stationary noise sources. This amendment was an important step in helping the mobile industry effectively evaluate the performance of noise suppression solutions in the busy environment mobile users encounter every day – and to ensure a high quality voice experience for customers. Click here to download the ITU-T P.835 Standard: Amendment 1, Appendix III.

Recommendation ITU-T G.160 – Objective Testing Standard

The ITU-T G.160 standard specified three objective metrics designed to appraise the performance of noise suppression algorithm: Total Noise-Level Reduction (TNLR), Signal-to-Noise Ratio Improvement (SNRI), and Delta Signal-to-Noise (DSN). Within these objective metrics, SNRI is the most indicative measure of noise suppression effectiveness. SNRI is measured during speech activity and factors in the relationship of the background noise level to the level of speech. It computes the Signal-to-Noise Ratio improvement achieved by the Noise Suppression algorithm – accounting for speech attenuation or amplification in addition to noise suppression. Some noise suppression algorithms attenuate the speech and the background noise at the same time, lowering the overall SNRI after applying noise suppression.

Audience was requested by the ITU committee to lead an effort to create an amendment to Appendix II of the G.160 standard to ensure that objective measurement methods and calculations were accurate in evaluating performance based on new technology for noise reduction systems. Amendment 1 of G.160 Appendix II was approved by the ITU in November 2009. Click here to download the ITU-T G.160 Standard: Amendment 1, Appendix II.

Audience conducts rigorous testing to ensure the quality of its noise suppression solutions under a wide variety of noise conditions, using the ITU-T test protocols, and continues to set performance benchmarks. And we’re continuing to work with industry groups to support the deployment of these test methodology standards to ensure the highest quality voice communications.