CHAPTER 1:
INTRODUCTION
A signal coming out from a system is due to
the input excitation and also the response of the system. From the signal
processing point of view, the output of a system can be treated as the
convolution of the input excitation with the system response. At times, we need
each of the components separately for study and/or processing. The process of
separating the two components is termed as deconvolution.
In
the first case, if we knew the input excitation, then the system component can
be separated/ constructed by exciting the system with the inputs and collecting
its responses. This is what is done in same channel estimation problems.
In the second case, if we knew the system response, then the input excitation
can be recovered using the inverse filter theory concept. For instance,
Linear Prediction(LP) analysis of speech to recover excitation. There is yet
another type of deconvolution, where the assumption is both input excitations
as well as system responses are unknown. The present study of cepstral analysis
of speech comes under this category.
Speech
is composed of excitation source and vocal tract system components. In order to
analyze and model the excitation and system components of the speech
independently and also use that in various speech processing applications,
these two components have to be separated from the speech. The objective of cepstral
analysis is to separate the
speech into its source and system components without any a priori knowledge
about source and / or system. According to the source filter theory of speech
production, voiced sounds are produced by exciting the time varying system
characteristics with periodic impulse sequence and unvoiced sounds are produced
by exciting the time varying system with a random noise sequence. The resulting
speech can be considered as the convolution of respective excitation sequence
and vocal tract filter characteristics. If e(n) is the excitation sequence and
h(n) is the vocal tract filter sequence, then the speech sequence s(n) can be
expressed
as follows:S(n)=e(n)*h(n) (1)
This
can be represented in frequency domain as,S(w)=E(w)*H(w) (2)
The
Eqn. (2) indicates that the multiplication of excitation and system components
in the frequency domain for the convolved sequence of the same in the time domain. The
speech sequence has to be deconvolved into the excitation and vocal tract components in the time domain.
For this, multiplication of the two components in the frequency domain has to
be converted to a linear combination of the two components. For this purpose
cepstral analysis is used for transforming the multiplied source and system
components in the frequency domain
to
linear combination of the two components in the cepstral domain.
Basic Principles
of Cepstral Analysis:
From
the Eqn. (2) the magnitude spectrum of given speech sequence can be represented
as,
To
linearly combine the E(ω) and H(ω) in the frequency domain, logarithmic
representation is used. So the logarithmic representation of Eqn. (3) will be,
As
indicated in Eqn. (4), the log operation transforms the magnitude speech
spectrum where the excitation component and vocal tract component are
multiplied, to a linear combination (summation) of these components i.e. log
operation converted the "*" operation into "+" operation in
the frequency domain. The separation can
be done
by taking the inverse discrete fourier transform (IDFT) of the linearly combined
log spectra of excitation and vocal tract system components. It should be noted that IDFT of linear
spectra transforms back to the time domain but the IDFT of log spectra
transforms to quefrency domain or the cepstral domain which is
similar to time domain. This is mathematically explained in Eqn. (5). In the
quefrency domain the vocal tract components are represented by the slowly
varying components concentrated near the lower quefrency region and excitation
components are represented by the fast varying components at the higher
quefrency region.
Fig 1
details the various steps involved in converting the given short term speech
signal to its cepstral domain representation. The output obtained at different
stages of cepstrum computation as described in Figure 1, is given in Figure2.
In Fig 2, s(n) is the voiced frame considered and x(n) is the windowed frame.
Here s(n) is multiplied by a hamming window to get x(n). |x(ω)| in Fig 2
represent the spectrum of the windowed sequence x(n). As the spectrum of the
given frame is symmetric, only one half of the spectral components is plotted.
The log|x(ω)| represents the log
magnitude spectrum obtained by taking logarithm of the |x(ω)|. c(n) of Fig 2 shows
the computed spectrum for the voiced frame s(n). The obtained cepstrum contains
vocal tract components which are linearly combined according Eqn.(5). As
the cepstrum is derived from the log magnitude of the linear spectrum, it is
also symmetrical in the quefrency domain. Here also only one symmetric part of
the cepstrum
is used for plotting.
No comments:
Post a Comment