AIM:
1)
To Load, display and manipulate the sample
speech signal
2)
To estimate pitch of speech signal by auto
correlation method
3)
To estimate pitch of speech signal by cepstrum
method
INTRODUCTION:
Speech signal can be
classified into voiced, unvoiced and silence regions. The near periodic
vibration of vocal folds is excitation for the production of voiced speech. The
random ...like excitation is present for unvoiced speech. There is no
excitation during silence region. Majority of speech regions are voiced in nature
that include vowels, semivowels and other voiced components. The voiced regions
looks like a near periodic signal in the time domain representation. In a short
term, we may treat the voiced speech segments to be periodic for all practical
analysis and processing. The periodicity associated with such segmentsis
defined is 'pitch period To' in the time domain and 'Pitch frequency
or Fundamental Frequency Fo' in the frequency domain. Unless
specified, the term 'pitch' refers to the fundamental frequency ' Fo'.
Pitch is an important attribute of voiced speech. It contains speaker-specific
information. It is also needed for speech coding task. Thus estimation of pitch
is one of the important issue in speech processing. There are a large set of
methods that have been developed in the speech processing area for the
estimation of pitch. Among them the three mostly used methods include,
autocorrelation of speech, cepstrum pitch determination and single inverse
filtering technique (SIFT) pitch estimation. One success of these methods is
due to the involvement of simple steps for the estimation of pitch. Even though
autocorrelation method is of theoretical interest, it produce a frame work for
SIFT methods.
2. PROJECT DESCRIPTION:
2.1 Autocorrelation:
The term autocorrelation can be stated as the similarity between
observations as a function of the time lag between them. Autocorrelation is
often used in signal processing for analyzing functions or series of
values, such as time domain signals. It is a mathematical tool for
finding repeating patterns, such as the presence of a periodic signal obscured
by noise, or identifying the missing
fundamental frequency in a signal
implied by harmonic frequencies. Initially we should have the basic
understanding of identifying the voiced/unvoiced/silence regions of speech from
their time domain and frequency domain representations. For this we need to plot the speech
signal in time and frequency domains. The time domain representation is termed
as waveform and frequency domain
representation is termed as spectrum. we consider speech signals in short
ranges for plotting their waveforms and spectra. The typical lengths include
10-30 msec. The time domain and frequency domain characteristics are distinct
for the three cases. Voiced segment represents periodicity in time domain and
harmonic structure in frequency domain. Unvoiced segment is random noise-like
in time domain and spectrum without harmonic structure in frequency domain.
Silence region does not have energy in either time or frequency domain.
Analysis of voiced speech
We should be able to
identify whether given segment of speech, typically, 20 - 50 msec, is voiced
speech or not. The voiced speech segment is characterized by the periodic
nature, relatively high energy, less number of zero crossings and more
correlation among successive samples. The voiced speech can be identified by
observation of the waveform in the time domain due to its periodicity nature.
In the frequency domain, the presence of harmonic structure is the evidence
that the segment is voiced. Further, the spectrum will have more energy,
typically, in the low frequency region. The spectrum will also have a downward
trend starting from zero frequency and moving upwards. The autocorrelation of a
segment of voiced speech will have a strong peak at the pitch period. The high
energy can be observed in terms of high amplitude values for voiced segment.
However, energy alone cannot decide the voicing information. Periodicity is
crucial along with energy to identify the voiced segment unambiguously.
Similarly the relatively low zero-crossings can also be indirectly observed as
smooth variations among sequence of sample values. Figure 2 below shows the
code to generate the waveform, spectrum and autocorrelation sequence for a
given segment of voiced speech.
Analysis of Unvoiced speech:
We should be able to
identify whether given segment of speech, typically, 20 - 50 msec, is unvoiced
or not. The unvoiced speech segment is characterized by the non-periodic
nature, relatively low energy compared to voiced speech, more number of zero
crossings and relatively less correlation among successive samples. The
unvoiced speech can be identified by observation of the waveform in the time
domain due to its non-periodicity
nature. In the frequency domain, the absence of harmonic structure is
the evidence that the segment is unvoiced. Further, the spectrum will have more
energy, typically, in the high frequency region. The spectrum will also have an
upward trend starting from zero frequency and moving upwards. The
autocorrelation of a segment of unvoiced speech will match typically that of
random noise. The low energy can be observed in terms of low amplitude values
for unvoiced segment. However, energy alone cannot decide the unvoicing information.
The number of zero-crossings is crucial along with energy to identify the
unvoiced segment unambiguously. The relatively high zero-crossings can also be
indirectly observed as rapid variations among sequence of sample values. Here we find no prominent peak as in the case of voiced
speech. This is the fundamental distinction between voiced and unvoiced speech.
2.2 Estimation of pitch by Cepstrum method:
A cepstrum is
the result of taking the Inverse Fourier transform(IFT) of the logarithm of the estimated spectrum of a signal. The
name "cepstrum" was derived by reversing the first four letters of
"spectrum". By using cepstrum method we can separate the vocal tract
and excitation source related information in the speech signal. The main limitation we find during the estimation of
pitch by autocorrelation method is there may be peaks larger than the peak
which corresponds to the pitch period. As a result there is a chance of wrong
estimation of pitch. The approach to minimize such errors is to separate
the vocal tract and excitation source related information in the speech signal
and there use the source information for pitch estimation. The cepstral analysis
of speech provides such an approach. The ceptrum of speech is defined as the
inverse Fourier transform of the log magnitude spectrum. In cepstrum method all
the slowly varying components in log magnitude spectrum refers to the low
frequency region and fast varying components to the high frequency regions. In
the log magnitude spectrum, the slowly varying components represent the envelope
corresponds to the vocal tract and the fast varying components to the
excitation source. As a result the vocal tract and excitation source components
get represented naturally in the spectrum of speech.
The graph will depict a 30 msec segment of voiced
speech, its log magnitude spectrum and ceptrum. The initial few values in the
cepstrum typically 13-15 cepstral values represent the vocal tract information.
The large peak present after these initial values represent the excitation
information. The pitch period T0 starting from the zeroth value in
number of samples. As a result the peaks that may be occuring in case of
autocorrelation analysis get naturally eliminated in cepstrum pitch
determination. By comparing voiced speech graph and unvoiced speech graph
we can observe that there is no prominent peak in case of ceptrum of unvoiced
speech after the 13-15 initial cepstral values. This is the main distinction
between cepstrum of voiced and unvoiced speech. For the estimation of pitch,
the target peak in the cepstral sequence after initial 2 msec 16 cepstral
values, gives the estimation of pitch period in case of voiced speech. This illustrates the
procedure for computing the pitch period by the high time liftering of the
cepstrum of the voiced speech.
No comments:
Post a Comment