To be applied to discrete audio data, the Fourier transform must be rendered in a discrete form. This is given in the equation for the discrete Fourier transform below.

[equation caption=”Equation 2.13 Discrete Fourier transform”]

$$!F_{n}=\frac{1}{N}\left ( \sum_{k-0}^{N-1}f_{k}\cos \frac{2\pi nk}{N}-if_{k}\sin \frac{2\pi nk}{N} \right )=\frac{1}{N}\left ( \sum_{k=0}^{N-1}f_{k}e^{\frac{-i2\pi kn}{N}} \right )\mathrm{where}\; i=\sqrt{-1}$$

[/equation]

[aside]

The second form of the discrete Fourier transform given in Equation 2.1, $$\frac{1}{N}\left (\sum_{k=0}^{N-1}f_{k}e^{\frac{-i2\pi kn}{N}} \right )$$, uses the constant e.  It is derivable from the first by application of Euler’s identify, $$e^{i2\pi kn}=\cos \left ( 2\pi kn \right )+i\sin \left ( 2\pi kn \right )$$. To see the derivation, see (Burg 2008).

[/aside]

Notice that we’ve switched from the function notation used in Equation 2.12 ( $$F\left ( n \right )$$ and F\left ( k \right )) to array index notation in Equation 2.13 ( F_{n} and f_{k}) to emphasize that we are dealing with an array of discrete audio sample points in the time domain. Casting this equation as an algorithm (Algorithm 2.1) helps us to see how we could turn it into a computer program where the summation becomes a loop nested inside the outer for loop.

[equation caption=”Algorithm 2.1 Discrete Fourier transform” class=”algorithm”]

/*Input:

f, an array of digitized audio samples
N, the number of samples in the array
Note:  $$i=\sqrt{-1}$$

Output:

F, an array of complex numbers which give the frequency components of the sound given by f */

for (n = 0 to N – 1 )

$$F_{n}=\frac{1}{N}\left ( \sum \begin{matrix}N-1\\k=0 \end{matrix} f_{k}\, cos\frac{2\pi nk}{N}-if_{k}\, sin\frac{2\pi nk}{N}\right )$$

[/equation]

[wpfilebase tag=file id=149 tpl=supplement /]

[wpfilebase tag=file id=151 tpl=supplement /]

Each time through the loop, the magnitude and phase of the nth frequency component are computed, $$F_{n}$$. Each $$F_{n}$$ is a complex number with a cosine and sine term, the sine term having the factor i in it.

We assume that you’re familiar with complex numbers, but if not, a short introduction should be enough so that you can work with the Fourier algorithm.

A complex number takes the form $$a+bi$$, where $$i=\sqrt{-1}$$. Thus,  $$cos\frac{2\pi nk}{N}-if_{k}\: sin\left ( \frac{2\pi nk}{N} \right )$$ is a complex number. In this case, a is replaced with $$cos\frac{2\pi nk}{N}$$ and b with $$-f_{k}\: sin\left ( \frac{2\pi nk}{N} \right )$$.  Handling the complex numbers in an implementation of the Fourier transform is not difficult. Although i is an imaginary number, $$\sqrt{-1}$$, and you might wonder how you’re supposed to do computation with it, you really don’t have to do anything with it at all except assume it’s there. The summation in the formula can be replaced by a loop that goes from 0 through N-1.   Each time through that loop, you add another term from the summation into an accumulating total. You can do this separately for the cosine and sine parts, setting aside i. Also, in object-oriented programming languages, you may have a Complex number class to do complex number calculations for you.

The result of the Fourier transform is a list of complex numbers F, each of the form $$a+bi$$, where the magnitude of the frequency component is equal to $$\sqrt{a^{2}+b^{2}}$$.

The inverse Fourier transform transforms audio data from the frequency domain to the time domain. The inverse discrete Fourier transform is given in Algorithm 6.2.

[equation caption=”Algorithm 2.2 Inverse discrete Fourier transform” class=”algorithm”]

/*Input:

F, an array of complex numbers representing audio data in the frequency domain, the elements represented by the coefficients of their real and imaginary parts, a and b respectively N, the number of samples in the array

Note: $$i=\sqrt{-1}$$

Output: f, an array of audio samples in the time domain*/

for $$(n=0 to N-1)$$

$$f_{k}=\sum \begin{matrix}N-1\\ n=0\end{matrix}\left ( a_{n}cos\frac{2\pi nk}{N}+ib_{n}sin\frac{2\pi nk}{N} \right )$$

[/equation]

If you know how to program, it’s not difficult to write your own discrete Fourier transform and its inverse through a literal implementation of the equations above.  However, the “literal” implementation of the transform is computationally expensive.  The equation in Algorithm 2.1 has to be applied N times, where N is the number of audio samples.  The equation itself has a summation that goes over N elements.  Thus, the discrete Fourier transform takes on the order of  $$N^{2}$$ operations.

[wpfilebase tag=file id=160 tpl=supplement /]

The fast Fourier transform (FFT) is a more efficient implementation of the Fourier transform that does on the order of  $$N\ast log_{2}N$$ operations.  The algorithm is made more efficient by eliminating duplicate mathematical operations.  The FFT is the version of the Fourier transform that you’ll often see in audio software and applications.  For example, Adobe Audition uses the FFT to generate its frequency analysis view, as shown in Figure 2.41.

Figure 2.41 Frequency analysis view (left) and waveform view (right) in Adobe Audition, showing audio dat in the frequency domain and time domain, respectively
Figure 2.41 Frequency analysis view (left) and waveform view (right) in Adobe Audition, showing audio data in the frequency domain and time domain, respectively

Generally when you work with digital audio, you don’t have to implement your own FFT. Efficient implementations already exist in many programming language libraries. For example, MATLAB has FFT and inverse FFT functions, fft and ifft, respectively. We can use these to experiment and generate graphs of sound data in the frequency domain. First, let’s use sine functions to generate arrays of numbers that simulate single-pitch sounds. We’ll make three one-second long sounds using the standard sampling rate for CD quality audio, 44,100 samples per second. First, we generate an array of sr*s numbers across which we can evaluate sine functions, putting this array in the variable t.

sr = 44100; %sr is sampling rate
s = 1; %s is number of seconds
t = linspace(0, s, sr*s);

Now we use the array t as input to sine functions at three different frequencies and phases, creating the note A at three different octaves (110 Hz, 220 Hz, and 440 Hz).

x = cos(2*pi*110*t);
y = cos(2*pi*220*t + pi/3);
z = cos(2*pi*440*t + pi/6);

x, y, and z are arrays of numbers that can be used as audio samples. pi/3 and pi/6 represent phase shifts for the 220 Hz and 440 Hz waves, to make our phase response graph more interesting. The figures can be displayed with the following:

figure;
plot(t,x);
axis([0 0.05 -1.5 1.5]);
title('x');
figure;
plot(t,y);
axis([0 0.05 -1.5 1.5]);
title('y');
figure;
plot(t,z);
axis([0 0.05 -1.5 1.5]);
title('z');

We look at only the first 0.05 seconds of the waveforms in order to see their shape better. You can see the phase shifts in the figures below. The second and third waves don’t start at 0 on the vertical axis.

Figure 2.42 110 Hz, no phase offset
Figure 2.42 110 Hz, no phase offset
Figure 2.43 220 Hz, π /3 phase offset
Figure 2.43 220 Hz, π /3 phase offset
Figure 2.44 440 Hz, π/6 phase offset
Figure 2.44 440 Hz, π/6 phase offset

Now we add the three sine waves to create a composite wave that has three frequency components at three different phases.

a = (x + y + z)/3;

Notice that we divide the summed sound waves by three so that the sound doesn’t clip. You can graph the three-component sound wave with the following:

figure;
plot(t, a);
axis([0 0.05 -1.5 1.5]);
title('a = x + y + z');
Figure 2.45 Time domain data for a 3-component waveform
Figure 2.45 Time domain data for a 3-component waveform

This is a graph of the sound wave in the time domain. You could call it an impulse response graph, although when you’re looking at a sound file like this, you usually just think of it as “sound data in the time domain.” The term “impulse response” is used more commonly for time domain filters, as we’ll see in Chapter 7. You might want to play the sound to be sure you have what you think you have. The sound function requires that you tell it the number of samples it should play per second, which for our simulation is 44,100.

sound(a, sr);

When you play the sound file and listen carefully, you can hear that it has three tones. MATLAB’s Fourier transform (fft) returns an array of double complex values (double-precision complex numbers) that represent the magnitudes and phases of the frequency components.

fftdata = fft(a);

In MATLAB’s workspace window, fftdata values are labeled as type double, giving the impression that they are real numbers, but this is not the case. In fact, the Fourier transform produces complex numbers, which you can verify by trying to plot them in MATLAB. The magnitudes of the complex numbers are given in the Min and Max fields, which is computed by the abs function. For a complex number $$a+bi$$, the magnitude is computed as $$\sqrt{a^{2}+b^{2}}$$. MATLAB does this computation and yields the magnitude.

Figure 2.46 Workspace in MATLAB showing values and types of variables currently in memory
Figure 2.46 Workspace in MATLAB showing values and types of variables currently in memory

To plot the results of the fft function such that the values represent the magnitudes of the frequency components, we first apply the abs function to fftdata.

fftmag = abs(fftdata);

Let’s plot the frequency components to be sure we have what we think we have. For a sampling rate of sr on an array of sample values of size N, the Fourier transform returns the magnitudes of $$N/2$$ frequency components evenly spaced between 0 and sr/2 Hz. (We’ll explain this completely in Chapter 5.)   Thus, we want to display frequencies between 0 and sr/2 on the horizontal axis, and only the first sr/2 values from the fftmag vector.

figure;
freqs = [0: (sr/2)-1];
plot(freqs, fftmag(1:sr/2));

[aside]If we would zoom in more closely at each of these spikes at frequencies 110, 220, and 440 Hz, we would see that they are not perfectly horizontal lines.  The “imperfect” results of the FFT will be discussed later in the sections on FFT windows and windowing functions.[/aside] When you do this, you’ll see that all the frequency components are way over on the left side of the graph. Since we know our frequency components should be 110 Hz, 220 Hz, and 440 Hz, we might as well look at only the first, say, 600 frequency components so that we can see the results better. One way to zoom in on the frequency response graph is to use the zoom tool in the graph window, or you can reset the axis properties in the command window, as follows.

axis([0 600 0 8000]);

This yields the frequency response graph for our composite wave, which shows the three frequency components.

Figure 2.47 Frequency response graph for a 3-component wave
Figure 2.47 Frequency response graph for a 3-component wave

To get the phase response graph, we need to extract the phase information from the fftdata. This is done with the angle function. We leave that as an exercise. Let’s try the Fourier transform on a more complex sound wave – a sound file that we read in.

y = audioread('HornsE04Mono.wav');

As before, you can get the Fourier transform with the fft function.

fftdata = fft(y);

You can then get the magnitudes of the frequency components and generate a frequency response graph from this.

fftmag = abs(fftdata);
figure;
freqs = [0:(sr/2)-1];
plot(freqs, fftmag(1:sr/2));
axis([0 sr/2 0 4500]);
title('frequency response for HornsE04Mono.wav'); 

Let’s zoom in on frequencies up to 5000 Hz.

axis([0 5000 0 4500]);

The graph below is generated.

Figure 2.48 Frequency response for HornsE04Mono.wav
Figure 2.48 Frequency response for HornsE04Mono.wav

The inverse Fourier transform gives us back our original sound data in the time domain.

ynew = ifft(fftdata);

If you compare y with ynew, you’ll see that the inverse Fourier transform has recaptured the original sound data.

When we applied the Fourier transform in MATLAB in Section 2.3.9, we didn’t specify a window size.  Thus, we were applying the FFT to the entire piece of audio. If you listen to the WAV file HornsE04Mono.wav, a three second clip, you’ll first hear some low tubas and them some higher trumpets. Our graph of the FFT shows frequency components up to and beyond 5000 Hz, which reflects the sounds in the three seconds. What if we do the FFT on just the first second (44100 samples) of this WAV file, as follows? The resulting frequency components are shown in Figure 2.49.

y = audioread('HornsE04Mono.wav');
sr = 44100;
freqs = [0:(sr/2)-1];
ybegin = y(1:44100);
fftdata2 = fft(ybegin);
fftdata2 = fftdata2(1:22050);
plot(freqs, abs(fftdata2));
axis([0 5000 0 4500]);
Figure 2.49 Frequency components of first second of HornsE04Mono.wav
Figure 2.49 Frequency components of first second of HornsE04Mono.wav

What we’ve done is focus on one short window of time in applying the FFT. An FFT window is a contiguous segment of audio samples on which the transform is applied. If you consider the nature of sound and music, you’ll understand why applying the transform to relatively small windows makes sense. In many of our examples in this book, we generate segments of sound that consist of one or more frequency components that do not change over time, like a single pitch note or a single chord being played without change. These sounds are good for experimenting with the mathematics of digital audio, but they aren’t representative of the music or sounds in our environment, in which the frequencies change constantly. The WAV file HornsE04Mono.wav serves as a good example. The clip is only three seconds long, but the first second is very different in frequencies (the pitches of tubas) from the last two seconds (the pitches of trumpets). When we do the FFT on the entire three seconds, we get a kind of “blurred” view of the frequency components, because the music actually changes over the three second period. It makes more sense to look at small segments of time. This is the purpose of the FFT window.

Figure 2.50 shows an example of how FFT window sizes are used in audio processing programs. Notice the drop down menu, which gives you a choice of FFT sizes ranging from 32 to 65536 samples. The FFT window size is typically a power of 2. If your sampling rate is 44,100 samples per second, then a window size of 32 samples is about 0.0007 s, and a window size of 65536 is about 1.486 s.

There’s a tradeoff in the choice of window size. A small window focuses on the frequencies present in the sound over a short period of time. However, as mentioned earlier, the number of frequency components yielded by an FFT of size N is N/2. Thus, for a window size of, say, 128, only 64 frequency bands are output, these bands spread over the frequencies from 0 Hz to sr/2 Hz where sr is the sampling rate. (See Chapter 5.) For a window size of 65536, 37768 frequency bands are output, which seems like a good thing, except that with the large window size, the FFT is not isolating a short moment of time. A window size of around 2048 usually gives good results. If you set the size to 2048 and play the piece of music loaded into Audition, you’ll see the frequencies in the frequency analysis view bounce up and down, reflecting the changing frequencies in the music as time pass.

Figure 2.50 Choice of FFT window size in Adobe Audition
Figure 2.50 Choice of FFT window size in Adobe Audition

In addition to choosing the FFT window size, audio processing programs often let you choose from a number of windowing functions. The purpose of an FFT windowing function is to smooth out the discontinuities that result from applying the FFT to segments (i.e., windows) of audio data. A simplifying assumption for the FFT is that each windowed segment of audio data contains an integral number of cycles, this cycle repeating throughout the audio. This, of course, is not generally the case. If it were the case – that is, if the window ended exactly where the cycle ended – then the end of the cycle would be at exactly the same amplitude as the beginning. The beginning and end would “match up.” The actual discontinuity between the end of a window and its beginning is interpreted by the FFT as a jump from one level to another, as shown in Figure 2.51.   (In this figure, we’ve cut and pasted a portion from the beginning of the window to its end to show that the ends don’t match up.)

Figure 2.51 Discontinuity between the end of a window and its beginning
Figure 2.51 Discontinuity between the end of a window and its beginning

In the output of the FFT, the discontinuity between the ends and the beginnings of the windows manifests itself as frequency components that don’t really exist in audio – called spurious frequencies, or spectral leakage. You can see the spectral leakage Figure 2.41. Although the audio signal actually contains only one frequency at 880 Hz, the frequency analysis view indicates that there is a small amount of other frequencies across the audible spectrum.

In order to smooth over this discontinuity and thereby reduce the amount of spectral leakage, the windowing functions effectively taper the ends of the segments to 0 so that they connect from beginning to end. The drop-down menu to the left of the FFT size menu in Audition is where you choose the windowing function. In Figure 2.50, the Hanning function is chosen. Four commonly-used windowing functions are given in the table below.

Figure 2.52 Windowing functions
Figure 2.52 Windowing functions

Windowing functions are easy to apply. The segment of audio data being transformed is simply multiplied by the windowing function before the transform is applied. In MATLAB, you can accomplish this with vector multiplication, as shown in the commands below.

y = audioread('HornsE04Mono.wav');
sr = 44100; %sampling rate
w = 2048; %window size
T = w/sr; %period
% t is an array of times at which the hamming function is evaluated
t = linspace(0, 1, 44100);
twindow = t(1:2048); %first 2048 elements of t
% Create the values for the hamming function, stored in vector called hamming
hamming = 0.54 - 0.46 * cos((2 * pi * twindow)/T);
plot(hamming);
title('Hamming');

The Hamming function is shown in

Figure 2.53 Hamming windowing function
Figure 2.53 Hamming windowing function
yshort = y(1:2048); %first 2048 samples from sound file
%Multiply the audio values in the window by the Hamming function values,
% using element by element multiplication with .*.
% first convert hamming from a column vector to a row vector
ywindowed = hamming .* yshort;
figure;
plot(yshort);
title('First 2048 samples of audio data');
figure;
plot(ywindowed);
title('First 2048 samples of audio data, tapered by windowing function');

Before the Hamming function is applied, the first 2048 samples of audio data look like this:

Figure 2.54 Audio data
Figure 2.54 Audio data

After the Hamming function is applied, the audio data look like this:

Figure 2.55 Audio data after application of Hamming windowing function
Figure 2.55 Audio data after application of Hamming windowing function

Notice that the ends of the segment are tapered toward 0.

Figure 2.56 compares the FFT results with no windowing function vs. with the Hamming windowing function applied. The windowing function eliminates some of the high frequency components that are caused by spectral leakage.

figure
plot(abs(fft(yshort)));
axis([0 300 0 60]);
hold on;
plot(abs(fft(ywindowed)),'r');
Figure 2.56 Comparing FFT results with and without windowing function
Figure 2.56 Comparing FFT results with and without windowing function

[wpfilebase tag=file id=126 tpl=supplement /]

[separator top=”1″ bottom=”0″ style=”none”]

If you want to work at an even lower level of abstraction, a good environment for experimentation is the Linux operating system using “from scratch” programs written in C++. In our first example C++ sound program, we show you how to create sound waves of a given frequency, add frequency components to get a complex wave, and play the sounds via the sound device. This program is another implementation of the exercises in Max and MATLAB in Sections 2.3.1 and 2.3.3.   The C++ program is given in Program 2.4.  [aside]In this example program, 8 bits are used to store each audio sample.  That is, the bit depth is 8. The sound library also allows a bit depth of 16.  The concept of bit depth will be explained in detail in Chapter 5.

.[/aside]

//This program uses the OSS library.
#include <sys/ioctl.h> //for ioctl()
#include <math.h> //sin(), floor(), and pow()
#include <stdio.h> //perror
#include <fcntl.h> //open, O_WRONLY
#include <linux/soundcard.h> //SOUND_PCM*
#include <iostream>
#include <unistd.h>
using namespace std;

#define TYPE char
#define LENGTH 1 //number of seconds per frequency
#define RATE 44100 //sampling rate
#define SIZE sizeof(TYPE) //size of sample, in bytes
#define CHANNELS 1 //number of audio channels
#define PI 3.14159
#define NUM_FREQS 3 //total number of frequencies
#define BUFFSIZE (int) (NUM_FREQS*LENGTH*RATE*SIZE*CHANNELS) //bytes sent to audio device
#define ARRAYSIZE (int) (NUM_FREQS*LENGTH*RATE*CHANNELS) //total number of samples
#define SAMPLE_MAX (pow(2,SIZE*8 - 1) - 1) 

void writeToSoundDevice(TYPE buf[], int deviceID) {
	int status;
	status = write(deviceID, buf, BUFFSIZE);
	if (status != BUFFSIZE)
		perror("Wrote wrong number of bytes\n");
	status = ioctl(deviceID, SNDCTL_DSP_SYNC, 0);
	if (status == -1)
		perror("SNDCTL_DSP_SYNC failed\n");
}

int main() {
	int deviceID, arg, status, f, t, a, i;
	TYPE buf[ARRAYSIZE];
	deviceID = open("/dev/dsp", O_WRONLY, 0);
	if (deviceID < 0)
		perror("Opening /dev/dsp failed\n");
// working
	arg = SIZE * 8;
	status = ioctl(deviceID, SNDCTL_DSP_SETFMT, &arg);
	if (status == -1)
		perror("Unable to set sample size\n");
	arg = CHANNELS;
	status = ioctl(deviceID, SNDCTL_DSP_CHANNELS, &arg);
	if (status == -1)
		perror("Unable to set number of channels\n");
	arg = RATE;
	status = ioctl(deviceID, SNDCTL_DSP_SPEED, &arg);
	if (status == -1)
		perror("Unable to set sampling rate\n");
	a = SAMPLE_MAX;
	for (i = 0; i < NUM_FREQS; ++i) {
		switch (i) {
			case 0:
				f = 262;
				break;
			case 1:
				f = 330;
				break;
			case 2:
				f = 392;
				break;
		}
		for (t = 0; t < ARRAYSIZE/NUM_FREQS; ++t) {
			buf[t + ((ARRAYSIZE / NUM_FREQS) * i)] = floor(a * sin(2*PI*f*t/RATE));
		}
	}
	writeToSoundDevice(buf, deviceID);
}

Program 2.4 Adding sine waves and sending sound to sound device in C++

To be able to compile and run a program such as this, you need to install a sound library in your Linux environment. At the time of the writing of this chapter, the two standard low-level sound libraries for Linux are the OSS (Open Sound System) and ALSA (Advanced Linux Sound Architecture). A sound library provides a software interface that allows your program to access the sound devices, sending and receiving sound data. ALSA is the newer of the two libraries and is preferred by most users. At a slightly higher level of abstraction are PulseAudio and Jack, applications which direct multiple sound streams from their inputs to their outputs. Ultimately, PulseAudio and Jack use lower level libraries to communicate directly with the sound cards.

[wpfilebase tag=file id=73 tpl=supplement /]

Program 2.4 uses the OSS library. In a program such as this, the sound device is opened, read from, and written to in a way similar to how files are handled. The sample program shows how you open /dev/dsp, an interface to the sound card device, to ask this device to receive audio data. The variable deviceID serves as an ID of the sound device and is used as a parameter indicating the size of data to expect, the number of channels, and the data rate. We’ve set a size of eight bits (one byte) per audio sample, one channel, and a data rate of 44,100 samples per second. The significance of these numbers will be clearer when we talk about digitization in Chapter 5. The buffer size is a product of the sample size, data rate, and length of the recording (in this case, three seconds), yielding a buffer of 44,100 * 3 bytes.

The sound wave is created by taking the sine of the appropriate frequency (262 Hz, for example) at 44,100 evenly-spaced intervals for one second of audio data. The value returned from the sine function is between -1 and 1. However, the sound card expects a value that is stored in one byte (i.e., 8 bits), ranging from -128 to 127. To put the value into this range, we multiply by 127 and, with the floor function, round down.

The three frequencies are created and concatenated into one array of audio values. The write function has the device ID, the name of the buffer for storing the sound data, and the size of the buffer as its parameters. This function sends the sound data to the sound card to be played. The three frequencies together produce a harmonious chord in the key of C. In Chapter 3, we’ll explore what makes these frequencies harmonious.

[wpfilebase tag=file id=75 tpl=supplement /]

The program requires some header files for definitions of constants like O_WRONLY (restricting access to the sound device to writing) and SOUND_PCM_WRITE_BITS. After you install the sound libraries, you’ll need to locate the appropriate header files and adjust the #include statement accordingly. You’ll also need to check the way your compiler handles the math and sound libraries. You may need to include the option –lm on the compile line to include the math library, or the –lasound option for the ALSA library.

This program introduces you to the notion that sound must be converted to a numeric format that is communicable to a computer. The solution to the programming assignment given as a learning supplement has an explanation of the variables and constants in this program. A full understanding of the program requires that you know something about sampling and quantization, the two main steps in analog-to-digital conversion, a topic that we’ll examine in depth in Chapter 5.

The Java environment allows the programmer to take advantage of Java libraries for sound and to benefit from object-oriented programming features like encapsulation, inheritance, and interfaces. In this chapter, we are going to use the package javax.sound.sampled. This package provides functionality to capture, mix, and play sounds with classes such as SourceDataLine, AudioFormat, AudioSystem, and LineUnvailableException.

Program 2.5 uses a SourceDataLine object. This is the object to which we write audio data. Before doing that, we must set up the data line object with a specified audio format object.   (See line 30.) The AudioFormat class specifies a certain arrangement of data in the sound stream, including the sampling rate, sample size in bits, and number of channels. A SourceDataLine object is created with the specified format, which in the example is 44,100 samples per second, eight bits per sample, and one channel for mono. With this setting, the line gets the required system resource and becomes operational.   After the SourceDataLine is opened, data is written to the mixer using a buffer that contains data generated by a sine function.   Notice that we don’t directly access the Sound Device because we are using a SourceDataLine object to deliver data bytes to the mixer. The mixer mixes the samples and finally delivers the samples to an audio output device on a sound card.

 import javax.sound.sampled.AudioFormat;
 import javax.sound.sampled.AudioSystem;
 import javax.sound.sampled.SourceDataLine;
 import javax.sound.sampled.LineUnavailableException;

 public class ExampleTone1{

   public static void main(String[] args){

     try {
         ExampleTone1.createTone(262, 100);
     } catch (LineUnavailableException lue) {
         System.out.println(lue);
     }
   }

   /** parameters are frequency in Hertz and volume
   **/
   public static void createTone(int Hertz, int volume)
     throws LineUnavailableException {
     /** Exception is thrown when line cannot be opened */

     float rate = 44100;
     byte[] buf;
     AudioFormat audioF;

     buf = new byte[1];
     audioF = new AudioFormat(rate,8,1,true,false);
     //sampleRate, sampleSizeInBits,channels,signed,bigEndian

     SourceDataLine sourceDL = AudioSystem.getSourceDataLine(audioF);
     sourceDL = AudioSystem.getSourceDataLine(audioF);
     sourceDL.open(audioF);
     sourceDL.start();

     for(int i=0; i<rate; i++){
       double angle = (i/rate)*Hertz*2.0*Math.PI;
       buf[0]=(byte)(Math.sin(angle)*volume);
       sourceDL.write(buf,0,1);
     }

     sourceDL.drain();
     sourceDL.stop();
     sourceDL.close();
   }
 }

Program 2.5 A simple sound generating program in Java

This program illustrates a simple of way of generating a sound by using a sine wave and the javax.sound.sampled library. If we change the values of the createTone procedure parameters, which are 262 Hz for frequency and 100 for volume, we can produce a different tone. The second parameter, volume, is used to change the amplitude of the sound. Notice that the sine function result is multiplied by the volume parameter in line 40.

Although the purpose of this section of the book is not to demonstrate how Java graphics classes are used, it may be helpful to use some basic plot features in Java to generate sine wave drawings. An advantage of Java is that it facilitates your control of windows and containers. We inherit this functionality from the JPanel class, which is a container where we are going to paint the sine wave generated. Program 2.6 is a variation of Program 2.5. It produces a Java Window by using the procedure paintComponent. This sine wave generated again has a frequency of 262 Hz and a volume of 100.

 import javax.sound.sampled.AudioFormat;
 import javax.sound.sampled.AudioSystem;
 import javax.sound.sampled.SourceDataLine;
 import javax.sound.sampled.LineUnavailableException;

 import java.awt.*;
 import java.awt.geom.*;
 import javax.swing.*;

 public class ExampleTone2 extends JPanel{

   static double[] sines;
   static int vol;

   public static void main(String[] args){

     try {
         ExampleTone2.createTone(262, 100);
     } catch (LineUnavailableException lue) {
         System.out.println(lue);
     }

     //Frame object for drawing
     JFrame frame = new JFrame();
     frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
     frame.add(new ExampleTone2());
     frame.setSize(800,300);
     frame.setLocation(200,200);
     frame.setVisible(true);
   }

   public static void createTone(int Hertz, int volume)
     throws LineUnavailableException {

     float rate = 44100;
     byte[] buf;
     buf = new byte[1];
     sines = new double[(int)rate];
     vol=volume;

     AudioFormat audioF;
     audioF = new AudioFormat(rate,8,1,true,false);

     SourceDataLine sourceDL = AudioSystem.getSourceDataLine(audioF);
     sourceDL = AudioSystem.getSourceDataLine(audioF);
     sourceDL.open(audioF);
     sourceDL.start();

     for(int i=0; i<rate; i++){
       double angle = (i/rate)*Hertz*2.0*Math.PI;
       buf[0]=(byte)(Math.sin(angle)*vol);
       sourceDL.write(buf,0,1);

       sines[i]=(double)(Math.sin(angle)*vol);
     }

     sourceDL.drain();
     sourceDL.stop();
     sourceDL.close();
   }

   protected void paintComponent(Graphics g) {
         super.paintComponent(g);
         Graphics2D g2 = (Graphics2D)g;
         g2.setRenderingHint(RenderingHints.KEY_ANTIALIASING,
                             RenderingHints.VALUE_ANTIALIAS_ON);

         int pointsToDraw=4000;
         double max=sines[0];
         for(int i=1;i<pointsToDraw;i++)  if (max<sines[i]) max=sines[i];
         int border=10;
         int w = getWidth();
         int h = (2*border+(int)max);

         double xInc = 0.5;

         //Draw x and y axes
         g2.draw(new Line2D.Double(border, border, border, 2*(max+border)));
         g2.draw(new Line2D.Double(border, (h-sines[0]), w-border, (h-sines[0])));

         g2.setPaint(Color.red);

         for(int i = 0; i < pointsToDraw; i++) {
             double x = border + i*xInc;
             double y = (h-sines[i]);
             g2.fill(new Ellipse2D.Double(x-2, y-2, 2, 2));
         }
    }
 }

Program 2.6 Visualizing the sound waves in a Java program

If we increase the value of the frequency in line 18 to 400 Hz, we can notice how the number of cycles increases, as shown in Figure 2.57. On the other hand, by increasing the volume, we obtain a higher amplitude for each frequency.

Figure 2.57 Sound waves generated in a Java program
Figure 2.57 Sound waves generated in a Java program

[wpfilebase tag=file id=77 tpl=supplement /]

We can also create square, triangle, and sawtooth waves in Java by modifying the for loop in lines 49 to 52. For example, to create a square wave, we may change the for loop to something like the following:

 for(int i=0; i<rate; i++){
   double angle1 = i/rate*Hertz*1.0*2.0*Math.PI;
   double angle2 = i/rate*Hertz*3.0*2.0*Math.PI;
   double angle3 = i/rate*Hertz*5.0*2.0*Math.PI;
   double angle4 = i/rate*Hertz*7.0*2.0*Math.PI;

   buf[0]=(byte)(Math.sin(angle1)*vol+
	Math.sin(angle2)*vol/3+Math.sin(angle3)*vol/5+
	Math.sin(angle4)*vol/7);
   sdl.write(buf,0,1);
   sines[i]=(double)(Math.sin(angle1)*vol+
	Math.sin(angle2)*vol/3+Math.sin(angle3)*vol/5+
	Math.sin(angle4)*vol/7);
 }

This for loop produces the sine wave shown in Figure 2.58. This graph doesn’t look like a perfect square wave, but the more harmonic frequencies we add, the closer we get to a square wave. (Note that you can create these waveforms more exactly by adapting the Octave programs above to Java.)

Figure 2.58 Creating a square wave in Java
Figure 2.58 Creating a square wave in Java

In addition to references cited in previous chapters:

 

Burg, Jennifer.  The Science of Digital Media.  Prentice-Hall, 2008.

Everest, F. Alton. Critical Listening Skills for Audio Professionals. Boston, MA: Course Technology CENGAGE Learning, 2007.

Jaffee, D.  1987.  “Spectrum Analysis Tutorial, Part 1:  The Discrete Fourier Transform.  Computer Music Journal 11 (2): 9-24.

__________.  1987.  “Spectrum Analysis Tutorial, Part 2:  Properties and Applications of the Discrete Fourier Transform.”  Computer Music Journal 11 (3): 17-35.

Kientzle, Tim. A Programmer’s Guide to Sound. Reading, MA: Addison-Wesley Developers Press, 1998.

Rossing, Thomas, F. Richard Moore, and Paul A. Wheeler. The Science of Sound. 3rd ed. San Francisco, CA: Addison-Wesley Developers Press, 2002.

Smith, David M.  Engineering Computation with MATLAB.  Boston:  Pearson/Addison Wesley, 2008.

Steiglitz, K.  A Digital Signal Processing Primer.  Prentice-Hall, 1996.

Even if you’re not a musician, if you plan to work in the realm of digital sound you’ll benefit from an understanding of the basic concepts and vocabulary of music. The purpose of this chapter is to give you this foundation.

This chapter describes the vocabulary and musical notation of the Western music tradition – the music tradition that began with classical composers like Bach, Mozart, and Beethoven and that continues as the historical and theoretic foundation of music in the United States, Europe, and Western culture. The major and minor scales and chords are taken from this context, which we refer to as Western music. Many other types of note progressions and intervals have been used in other cultures and time periods, leading to quite different characteristic sounds: the modes of ancient Greece, the Gregorian chants of the Middle Ages, the pentatonic scale of ancient Oriental music, the Hindu 22 note octave, or the whole tone scale of Debussy, for example. While we won’t cover these, we encourage the reader to explore these other musical traditions.

To give us a common language for understanding music, we focus our discussion on the musical notation used for keyboards like the piano. Keyboard music expressed and notated in the Western tradition provides a good basic knowledge of music and gives us a common vocabulary when we start working with MIDI in Chapter 6.

Musicians learn to sing, play instruments, and compose music using a symbolic language of music notation. Before we can approach this symbolic notation, we need to establish a basic vocabulary.

In the vocabulary of music, a sound with a single fundamental frequency is called a tone. The fundamental frequency of a tone is the frequency that gives the tone its essential pitch. The piccolo plays tones with higher fundamental frequencies than the frequencies of a flute, and thus it is higher pitched.

A tone that has an onset and a duration is called a note. The onset of the note is the moment when it begins. The duration is the length of time that the note remains audible. Notes can be represented symbolically in musical notation, as we’ll see in the next section. We will also use the word “note” interchangeably with “key” when referring to a key on a keyboard and the sound it makes when struck.

As described in Chapter 2, tones created by musical instruments, including the human voice, are not single-frequency. These tones have overtones at frequencies higher than the fundamental. The overtones create a timbre, which distinguishes the quality of the tone of one instrument or singer from another. Overtones add a special quality to the sound, but they don’t change our overall perception of the pitch. When the frequency of an overtone is an integer multiple of the fundamental frequency, it is a harmonic overtone. Stated mathematically for frequencies $$f_{1}$$ and $$f_{2}$$, if $$f_{2}=nf_{1}$$ and n is a positive integer, then $$f_{2}$$ is a harmonic frequency relative to fundamental frequency $$f_{1}$$. Notice that every frequency is a harmonic frequency relative to itself. It is called the first harmonic, since $$n=1$$. The second harmonic is the frequency where $$n=2$$. For example, the second harmonic of 440 Hz is 880 Hz; the third harmonic of 440 Hz is 3*440 Hz = 1320 Hz; the fourth harmonic of 440 Hz is 4*440 Hz = 1760 Hz; and so forth. Musical instruments like pianos and violins have harmonic overtones. Drums beats and other non-pitched sounds have overtones that are not harmonic.

Another special relationship among frequencies is the octave. For frequencies $$f_{1}$$ and $$f_{2}$$, if $$f_{2}=2^{n}f_{1}$$ where n is a positive integer, then $$f_{1}$$ and $$f_{2}$$ “sound the same,” except that $$f_{2}$$ is higher pitched than $$f_{1}$$. Frequencies $$f_{1}$$ and $$f_{2}$$ and are separated by n octaves. Another way to describe the octave relationship is to say that each time a frequency is moved up an octave, it is multiplied by 2. A frequency of 880 Hz is one octave above 440 Hz; 1760 Hz is two octaves above 440 Hz; 3520 Hz is three octaves above 440 Hz; and so forth. Two notes separated by one or more octaves are considered equivalent in that one can replace the other in a musical composition without disturbing the harmony of the composition.

In Western music, an octave is separated into 12 frequencies corresponding to notes on a piano keyboard, named as shown in Figure 3.1. From C to B we have 12 notes, and then the next octave starts with another C, after which the sequence of letters repeats. An octave can start on any letter, as long as it ends on the same letter. (The sequence of notes is called an octave because there are eight notes in a diatonic scale, as is explained below.) The white keys are labeled with the letters. Each of the black keys can be called by one of two names. If it is named relative to the white key to its left, a sharp symbol is added to the name, denoted C#, for example. If it is named relative to the white key to its right, a flat symbol is added to the name, denoted D♭, for example.

Figure 3.1  Keyboard showing octave and key labels
Figure 3.1 Keyboard showing octave and key labels

Each note on a piano keyboard corresponds to a physical key that can be played. There are 88 keys on a standard piano keyboard. MIDI keyboards are usually smaller. Since the notes from A through G are repeated on the keyboard, they are sometimes named by the number of the octave that they’re in, as shown in Figure 3.2.

Figure 3.2 MIDI keyboard
Figure 3.2 MIDI keyboard

Middle C on a standard piano has a frequency of approximately 262 Hz. On a piano with 88 keys, middle C is the fourth C, so it is called C4. On the smaller MIDI keyboard shown above, it is C3. Middle C is the central position for playing the piano, with regard to where the right and left hands of the pianist are placed. The standard reference point for tuning a piano is the A above middle C, which has a frequency of 440 Hz. This means that the next A going up the keys to the right has a frequency of 880 Hz. A note of 880 Hz is one octave away from 440 Hz, and both are called A on a piano keyboard.

The interval between two consecutive keys (also called notes) on a keyboard, whether the keys are black or white, is called a semitone. A semitone is the smallest frequency distance between any two notes. Neighboring notes on a piano keyboard (and equivalently, two neighboring notes on a chromatic scale) are separated by a frequency factor of approximately 1.05946. This relationship is described more precisely in the equation below.

[equation caption=”Equation 3.1″]

Let f be the frequency of a note k. Then the note one octave above f has a frequency of $$2f$$. Given this octave relationship and the fact that there are 12 notes in an octave, the frequency of the note after k on a chromatic scale is $$\sqrt[12]{2}\, f\approx 1.05946\, f$$.

[/equation]

Thus, the factor 1.05946 defines a semitone. If two notes are divided by a semitone, then the frequency of the second is 1.05946 times the frequency of the first. The other frequencies between semitones are not used in Western music (except in pitch bending).

Two semitones constitute a whole tone, as illustrated in Figure 3.3. Semitones and whole tones can also be called half steps and whole steps (or just steps), respectively. They are illustrated in Figure 3.3.

Figure 3.3 Semitones and whole tones
Figure 3.3 Semitones and whole tones

The symbol #, called a sharp, denotes that a note is to be raised by a semitone. When you look at the keyboard in Figure 3.3, you can see that moving up by a semitone takes you to the F key. Thus E# denotes and sounds the same note as F. When two notes have different names but are the same pitch, they said to be enharmonically equivalent.

The symbol♭, called a flat, denotes that a note is to be lowered by a semitone. C♭is enharmonically equivalent to B. A natural symbol ♮removes a sharp or flat from a note when it follows the same note in a measure. Sharps, flats, and naturals are examples of accidentals, symbols that raise or lower a note by a semitone.

30/132