Also from Numerically Python: 
In the last five articles we performed numerical calculations using the Numerical Python module and plotted our calculations using the DISLIN plotting package. This article concludes the series with an application for analyzing sound files on your PC. This small but interesting program will bring together many of the components we have talked about. The concise nature of the application demonstrates the power of the Python language.
In the digital age, sound files are (of course) stored digitally. Leaving the world of analog cassette tapes and phonographs behind, computers and compact discs (CDs) store music in a sampled form. Essentially, the sound wave is sampled or measured at discrete points in time and recorded onto a storage media. Digitally recorded sounds are long sequences of numbers that can easily be manipulated using computers.
Before there were computers or even digital recording, back in the '20s, Harry Nyquist showed that to preserve the integrity of a sampled sound, you must sample at twice the highest frequency you wish to record. Audiophiles know that audio equipment records and reproduces sound between 20 and 20,000 Hz (or Hertz, from turn of the century scientist Heinrich Hertz, meaning cycles per second). So it was with Nyquist in mind that the music industry designed the compact disc to accurately record audio at 44,100 samples per second. Thus a CD can record and reproduce sound up to 22,050 Hz, sufficient for even the most picky audiophile! If you are interested, there are plenty of references on the Web. As a start, check digitalrecordings.com.
Our tool will analyze files from the sound recorder available on your
computer. In Win32 operating systems, the sound recorder is found under
the Accessories part of the Start menu. (Apologies to Linux users, you
will have to interpret a bit to make the tool work under Linux.) In
Win32, the output of the sound recorder is known as a wave file. These
multimedia files contain sound information as well as a header with all
the requisite information for reproduction. A Python module, aptly named
wave
, has functions to deal with these file types.
To save a recording you must choose a sample rate, a sample width, and either mono or stereo. For sample rate, your choices are 44100 Hz, 22050 Hz, 11025 Hz, or 8000 Hz. Higher sample rates preserve higher frequencies but also result in larger files. For sample types, you can choose 8 or 16bit width. To make things simple, we are going to limit ourselves to working with 8bit samples. Finally, you can choose stereo (two channel recording) or mono (single channel recording); for this article we will be limited to files recorded in mono mode.
Using the sound recorder, the parameters of the sound file are set after a recording is made. Use the 'Save As' command to select the output filename. At the bottom of the dialog box, you will see a button for altering the parameters of the recording. A good setting to get started is "PCM 11.025 kHz, 8 bit, Mono." This will give you a file sampled at 11.025 kHz (11,025 Hz; k = Kilo = *1000) with 8bit samples recorded on a single track (mono).Our tool will create a sonogram, measuring the frequency content of the signal as a function of time and plotting the results. You can use the tool to view and characterize different sounds. With the tool you can analyze voice, music, or other sounds and see the harmonic components. Do you think that people's voices are different? With this tool, you will be able to see that they are.
The fundamental algorithm we will use to perform our measurements is known as the Fourier transform. This transform takes a sequence of time samples (our sound file) and measures the frequency components, computing how much energy is present in the sequence at various frequencies. The Fourier transform is a general technique; the actual algorithm we will be using is called the fast Fourier transform, or FFT. The FFT is a very efficient method for calculating the transform and is designed for use on digital computers. The main restriction for its use is that it works on sequences that are a power of two in length (2, 4, 8, 16, 32 ...).
Let's try an example to see if we can get a handle on using the FFT module.
First we must import the modules we are going to use:
>>> from Numeric import *
>>> from FFT import *
>>> from dislin import *
Next we create a signal:
>>> x=arange(256.0)
>>> sig = sin(2*pi*(1250.0/10000.0)*x) + \
sin(2*pi*(625.0/10000.0)*x)
Now sig
contains 256 samples of a signal, which is
comprised of two tones (generated by the sin function). The numbers in
the inner parentheses (1250/10000
and
625/10000
) specify the frequency of the tones. Assuming a
sample rate of 10,000 Hz, we generated a 1250Hz and a 625Hz tone. Now
let's plot the output of the FFT of this signal.
>>> plot(x[0:129],10*log10(abs(real_fft(sig))))
Figure 1. Plot of the energy in our signal. 
You might ask, "What happened to the other 127 samples? The output
only contains 129 outputs from the original 256 input values." Because
the FFT is designed for complex data (containing both a real and
an imaginary component), and we are using real data, the output is symmetric.
The missing 127 samples are simply replicas of the first 127 in the
output sequence. The real_fft()
function we are using knows
this and saves effort by not computing them. Using
10*log10(abs(
converts the energy measurements of the FFT
into energy on a logarithmic scale called decibels. This is better
suited for displaying signal power. The two peaks tell us that this time
sequence was comprised of two tones. The xaxis represents all
frequencies from 0 to 5000 Hz divided into 128 bins, each representing
energy found in a 5000/128 =39.0625 Hz swath of the frequency spectrum.
Peak one is located at (5000/128)*16 = 625 Hz, and peak two is located at
(5000/128)*32 = 1250 Hz. These correspond to the numbers we used to
generate the tones. The values found in the remaining bins are just
numerical roundoff noise. Compared to the tones, with a power of 20 dB,
the "noise" is ~160 dB less in power  about 1E16 in amplitude compared
to unity.

Natural sounds are comprised of many frequencies with varying amplitudes. It is the complex interaction of both frequency and amplitude varied across time that makes sound pleasant to hear. But no matter how complex the sound, the FFT can be used to decompose it and to estimate the frequency content. Performing this analysis on samples of data through the length of a recording, we can get a measure of the frequency content of a signal over time. The resulting plot is known as a sonogram or a spectrogram. The plot shows the relationship between time (xaxis), frequency (yaxis), and power (color  red is high power, blue low power).
The following plot shows the analysis of a file containing chirped whistles (I was doing my bird interpretation). The green and blue colors show the lower powered frequencies. The red color shows the variable frequency of the chirps.
Figure 2. The analysis of a file containing chirped whistles. 
The code for the program follows. You can view the complete program from here. I'll explain it as we go along:
Basic imports
from Numeric import *
from MLab import *
from FFT import *
from dislin import *
import wave
import sys
import struct
We use the open
method from the wave
module
to open a wave file and assign it to a file pointer, fp
.
Using information about the file, we then calculate some basic constants
to use in our analysis of the file. The length of the fft
is
passed in as a command line argument and should always be a power of
two:
# open the wave file
fp = wave.open(sys.argv[1],"rb")
sample_rate = fp.getframerate()
total_num_samps = fp.getnframes()
fft_length = int(sys.argv[2])
num_fft = (total_num_samps / fft_length )  2
We are going to need a matrix to store temporary results. Making it here allows for more efficient operation later:
# create temporary working array
temp = zeros((num_fft,fft_length),Float)
The data from the wave file is stored as packed numbers in a binary
form. We will read the entire file into the temporary array we just
created. As each section is read in, we unpack it into a floatingpoint
value. Notice that we are storing it into a twodimensional array where
each row contains fft_length
samples.
# read in the data from the file
for i in range(num_fft):
tempb = fp.readframes(fft_length);
temp[i,:] = array(struct.unpack("%dB"%(fft_length), \
tempb),Float)  128.0
fp.close()
Although not necessary, the next step multiplies each row of the data matrix by a "window" function. The window is used to mitigate some numerical side effects of the FFT. Since we are multiplying a vector and a matrix, we are taking advantage of broadcasting here!
# Window the data
temp = temp * hamming(fft_length)
We again make use of the real_fft()
function, which is
optimized for real (as opposed to complex) data. Notice that we are
taking the FFT of the entire array, which defaults to taking the
individual FFT of each row. The 10*log10
is again used to
convert the output into the logarithmic scale. The addition of the
constant (1E20
) solves the problem of log10(0)
being undefined. Rather than check each argument for 0, we just add a
small inconsequential value to guarantee that the 0 case never arises.
There is a lot of work going on here. This is where the muscle is
applied. The result is an array that contains the energy values found in
each of the frequencies across each of the rows (sections of the time
signal).
# Transform with the FFT, Return Power
freq_pwr = 10*log10(1e20+abs(real_fft(temp,fft_length)))
Now we use DISLIN for plotting! Most of the work is to
get the scales correct. Using the contour Quickplot function, we are
going to take a look at all of the rows of the array at the same time.
The conshade
function will plot rows against columns in a
color plot where the color relates to the magnitude of the element (in
our case, the energy in a particular frequency during a particular part of
the file).
# Plot the result
n_out_pts = (fft_length / 2) + 1
y_axis = 0.5*float(sample_rate) / n_out_pts * \
arange(n_out_pts)
x_axis = (total_num_samps / float(sample_rate)) / \
num_fft * arange(num_fft)
setvar('X',"Time (sec)")
setvar('Y',"Frequency (Hertz)")
conshade(freq_pwr,x_axis,y_axis)
disfin()
That's it! The whole function is only 40 lines of code.
Either type in the commands or download the full version  I called
my program pysono.py
. Once in hand, the tool is simple to try
out. First use the sound recorder on your PC to generate a sound file
(the tool works best for files less than 60 seconds long). Then at the
command prompt type:
pysono.py <filename> <fft_size>
Use the name of the wave file you are analyzing for
filename
and the length of the FFT to use for
fft_size
. Remember it must be a
power of two. Give it a try!
Using the tool, I analyzed the first 60 seconds of a recording of The Very Thought of You sung by Natalie Cole. The loosely parallel lines show that her voice has several frequencies present  these are harmonics and are what give the sound quality.
Figure 3. the first 60 seconds of a recording of The Very Thought of You sung by Natalie Cole. 
Using techniques explored throughout this series of articles on Numeric Python, we have created a complete, useful application in only 40 lines of code. With the tool in hand, you can now analyze different wave files. Try saying the same words at different pitches (one in normal tone, the other in falsetto)  do the sonograms look the same? Analyze the same file with varying sizes of FFTs. Can you see the difference in ability to measure frequencies? The mathematics behind the code comes from the field of DSP (digital signal processing). For more information search the web for "DSP". A good place to start is Digital Signal Processing Central.
Eric Hagemann specializes in fast algorithms for crunching numbers on all varieties of computers from embedded to mainframe.
Read more Numerically Python columns.
Discuss this article in the O'Reilly Network Python Forum.
Return to the Python DevCenter.
Copyright © 2009 O'Reilly Media, Inc.