Principles of Digital Audio
The principles which underlie almost all digital audio applications, be they digital synthesis, sampling, digital recording or CD playback are based on the following concepts.
To understand the following article, you will need to have a basic understanding of binary numbers, bits, and bytes. For help, see Terms: sample, ADC, sampling rate, Nyquist theorem, Nyquist frequency, aliasing, foldover, sample size (bit depth), quantization, approximation, approximation error, quantizing or digital noise, dither, DAC
Sounds from the real world can be recorded and digitized using an Analog-to-Digital Converter (ADC). As in the diagram below, the circuit takes a sample of the instantaneous amplitude (not frequency) of the analog waveform. Frequencies will be recreated later by playing back the sequential sample amplitudes at a specified rate.

Samples are taken at a regular time intervals called the sampling rate. The sampling rate is responsible for the frequency response of the digitized sound.
According to the Nyquist theorem (named after Harry Nyquist), the highest reproducible frequency of a digital system is 1/2 the sampling rate, often called the Nyquist frequency. |
A sampling rate of 44,100 samples per second, the rate at which CD's are encoded, can reproduce frequencies up to 22,050 Hz, well above the 20,000 Hz limit of human hearing. Frequencies that are recorded above the Nyquist frequency may foldover at a much lower frequency than the original, which is called aliasing.

Therefore, steep (brickwall) filters are usually put in front of an ADC input to prevent signals above the Nyquist frequency from ever entering the system. In direct synthesis, where waveforms are produced 'synthetically' by a computer program, it is desirable to use 'band-limited' waveforms whose frequency components do not exceed the Nyquist frequency. Standard sampling rates are 44.1K are 48K (and even 96K in some high-end recording systems). See the result of two different sample rates below.

The sample is then assigned a numeric value that the computer or digital circuit can use or store in a process called quantization. The number of available values is determined by the number of bits (0's and 1's) used for each sample. Each additional bit doubles the number of values available (1-bit samples have 2 values, 2-bit samples have 4 values, etc.). When a sample is quantized, the analog amplitude has to be rounded off to the nearest available digital value. This rounding-off process is called approximation. The smaller the number of bits used per sample, the greater distances the analog values need to be rounded off to. The difference between the analog value and the digital value is called the approximation or quantizing error as shown in the graph below:
The greater the approximation error, the greater the amount of digital or quantizing noise produced. The solution to reducing digital noise is to use larger sample sizes or bit depth, which therefore correspond to the dynamic range of the system, since it affects the signal-to-noise ratio (for digital systems, this is often measured as SQNR, or signal-to-quantization-noise-ratio). A general rule of thumb is an added 6 dB of dynamic range for every additional bit used per sample. The CD/DAT standard 16-bit samples, with their impressive 65,546 values for quantizing, provide the theoretical playback system optimum of 96 dB dynamic range.
Digital audio editing programs can often use 20-bit, 24-bit, even 32-bit floating-point or long integer samples to minimize fractional values and the noise introduced when mixing or engaging in other mathematical processes on the samples. Another reason higher bit-depth recording is becoming more attractive as prices come down and storage becomes less of an issue is that quantization errors are much more critical at lower amplitudes, due to the linear amplitude divisions of the quantization process. At very low amplitudes, these errors are much more apparent, acting more like distortion than noise. Higher bit-depth files can be reduced back to 16-bit for things like CD burning, often enhanced by using a process called dither, which tries to minimize the inducement of further digital noise. Dither allows higher bit depths, such as 20- and 24-bit files to be reduced to 16-bits by not doing the obvious, which would be rounding off the extra precision to the nearest 16-bit value. Instead, it combines the least significant bits below the most significant 16 with random values, then rounds up or down to the nearest 16-bit value. If you don't record above 16-bit resolution, try to adjust recording levels to avoid prolonged periods of very low amplitudes while not exceeding the maximum amplitude of the system at peaks (digital systems do not provide the fuzzy 'headroom' of analog systems--they just run out of values and clip). You can see the effect of two different bit depths on the diagram below:

Samples can be stored in a wide variety of fashions -- as pits in a Compact Disc picked up by a laser , on Digital Audio Tapes (DATs), on computer hard disks, in the flash memory of an MP3 player or they can be generated in real time by either a computer or digital instrument such as a sampler. No matter what the storage or creation method, digital samples must be converted back into analog voltage values to be amplified and reproduced as sound from a loudspeaker. The circuit required for such a feat is the Digital-to-Analog converter or DAC as pictured below:

For the sake of simplicity, we have pictured a 4-bit DAC, while your CD player would have a 16-bit DAC to correspond to the bit-depth of the samples. Each sample is 'clocked' into the DAC's register. A '1' in a register place will add a voltage to the sum of that sample proportionate to it's binary value. In this hypothetical case, we have a sample whose binary value is 5. The gates or switches for the binary places of '4' and '1' are closed and the value of 5 mvolts is sent out the DAC and held until the next sample is clocked into the register. Before you have visions of 16 physical switches flopping open and shut at 44,100 times per second, these are now very compact and cheap electronic switches, small enough to make your digital watch beep.
If samples are clocked into the DAC at the rate they were sampled, then the original frequencies will be reproduced up to the Nyquist frequency. If the samples are clocked in at twice the rate, then the frequency will be doubled. A common studio mishap is to play back a file which was recorded at 48K at 44.1K, which lowers the pitch by about a wholestep. Because the output of a DAC creates a staircase wave (as in the sampling rate diagram above) instead of a smoother analog one, a smoothing (lowpass) filter tuned to the sampling rate acts to reduce the sharpness of those steps and the unwanted frequencies they can produce. The reason some super high-end audio applications have gone to not only 24-bits, but also a 96K sampling rate is to make sure the roll-off of those and the ADC filters are not in the audio range at all.
Audio File Formats
Audio files, which are the storable and editable collection of samples organized in a standard form, can be stored on computer drives, transferred to other computers or samplers, shared on the Internet to be downloaded or played-back in real-time. They are different from audio CD or DAT tracks, which mostly contain only the raw sample data. That is why a CD track must be 'extracted' or 'ripped' to an audio file format to be usable by a computer application. A standard 16-bit, 44.1K stereo file eats up about 10 megs of disk space per minute of sound. Audio files come in a variety of types, which can influence their bit depth, multi-channel organization, compression scheme, sampling rate, organization of bytes high to low or visa versa (called 'Endian-ness') and amount of non-sample information stored in an area called the header in units called chunks. Many audio programs are capable of opening and converting several file formats within limits. Some additional terms you may see when looking at a soundfile format is related to the bit-depth often tied to how computers store different sized numbers. Common sample sizes are often called 8-bit chars, 16-bit short integers, 32-bit unsigned long or 32-bit floating point (floats).
Stereo soundfiles can be organized as interleaved, where the sample bytes of respective channels alternate in a single stream (LRLRLR, etc), or as two separate files called split stereo, where one file contains the LEFT channel samples and another file contains the RIGHT channel samples. By convention, these are usually labeled with the same name with a .R and .L suffix (ex. myaudio.L, myaudio.R). Most programs will simultaneously open both files by default. Many programs, such as MOTU's Digital Performer and Digidesign's Pro Tools work only with split stereo files--when importing an interleaved file, they will automatically split it into two files. However, some CD burning programs will burn only interleaved stereo files, so the separate files must be "BOUNCED to DISK" and then EXPORT'ed as an interleaved file to be burned.
Many of the file formats were designed to work with a specific processor chip. For example, the AIFF format was designed for the Macintosh-based Motorola 680x0 family (the 'A' in AIFF was originally for Apple, I believe), which uses the Most Significant Byte (MSB) first and the Least Significant Byte (LSB) last (this is called Big Endian--it takes two 8-bit bytes to make a 16-bit sample), as opposed to the Microsoft WAVE format which was designed for the Intel 80x86 processors in which the LSB comes first (Little Endian), which is the way the processor handles most information, confirming some's belief that Intel thinks backwards (decimal '21' would be coded as '12' in Little Endian style). Some of the common sound file format types are:
-
.aif or AIFF (Audio Interchange File Format), the gold standard of 16-bit audio, travels well between almost all computers and software, includes header information like file name, sampling rate, MIDI note number for samplers, loop points, number of bytes in file. Also capable of 24-bit and 32-bit resolution. Has the capability for quad, but rarely used.
-
.aifc or AIFC or AIFF-C (compressed version of AIFF--does not have to be compressed, supports both Big Endian, popular with SGI computers). Was thought to be candidate to supercede AIFF, since it had all the AIFF properties with more capabilities, but that has not happened as of this writing. Some AIFF-happy programs will choke on AIFC, particularly if compressed.
-
.sd2 or SD II (Sound Designer II--same as AIFF with added proprietary information such as markers and regions--still very popular on Macs, even though Sound Designer is defunct)
-
.mp3 (MPEG I-audio layer 3 compression--In 1987, the Fraunhofer IIS-A started to work on perceptual audio coding in the framework of the EUREKA project . In a joint cooperation with the University of Erlangen, the Fraunhofer IIS-A finally devised a very powerful algorithm that is standardized as ISO-MPEG Audio Layer-3 . With the proper codecs, compression rates of up to 24 times can be achieved with near- (but not) CD-quality. The beauty of MP3 is it's size vs. perceived quality, also its ability to be downloaded and then loaded into the flash memory of MP3 players. It can also be streamed to MP3 client software, recognized most Web browser audio helper applications. Files are encoded at certain bit-rates for target download speeds; for example, very good quality can be attained with 160 kbps encoding. Would you want to master all your music on MP3--no, but at least you can listen to it while you're jogging., .ra or .ram (Real Audio--can be streamed on the Internet from a Real Audio server, so sound starts playing before file fully downloaded. They can be encoded at multiple sampling rates to accomodate different user download speeds (modem, DSL, T1 lines, etc.) which range from 8 kbps to 1.5 Mbps (don't try 1.5 Mbps on your grandmother's 28.8 modem). Can also be combined with video for Real Media streaming. It spreads compression artifacts across the spectrum so they are theoretically not as noticeable. Requires Real Audio or compatible player software client., .wav or Microsoft WAVE (Designed for PCs and Windows, but now usable with most audio programs, Mac or PC. Similar to AIFF for bit-depth and sample rates. As mentioned above, it uses MSB's and LSB's in reverse order of AIFF files, so Microsoft developed the RIFF interchange File Format to support the "Little Endian" scheme.
-
WMA or Windows Media Audio designed for use with Window Media Player with various compression ratios.
-
.au (a-law, used with Sun computers or .snd used with NeXT's)
-
.ul (u-law US telephony, headerless, usually 8-bit, low quality)
-
.sf (IRCAM)
-
all sorts of surround sound formats for encoding and decoding multiple channels of audio, often for video, film or DVD's (Dolby Pro-Logic, Dolby Digital Surround 5.1, THX, DTS, etc.. Beyond the scope of this article. With the advent of home DVD burners, watch this take off as a viable audio medium.). Also watch for AAC, or Advanced Audio Coding Scheme being developed by Sony, ATT, Dolby Labs, and the original MP3 folks, which may encode multi-channel 5.1 surround files, as well as other mono, stereo other massive multi-channel formats at lower bit-rates up to 96 kbps and 24-bit resolution..
Analog and Digital Recording
When CDs were first introduced in the early 1980s, their single purpose in life was to hold music in a digital format. In order to understand how a CD works, you need to first understand how digital and analog recording and playback works and the difference between analog and digital technologies.
In the Beginning: Etching Tin Thomas Edison is credited with creating the first device for recording and playing back sounds in 1877. His approach used a very simple mechanism to store an analog wave mechanically. In Edison's original phonograph, a diaphragm directly controlled a needle, and the needle scratched an analog signal onto a tinfoil cylinder:
You spoke into Edison's device while rotating the cylinder, and the needle "recorded" what you said onto the tin. That is, as the diaphragm vibrated, so did the needle, and those vibrations impressed themselves onto the tin. To play the sound back, the needle moved over the groove scratched during recording. During playback, the vibrations pressed into the tin caused the needle to vibrate, causing the diaphragm to vibrate and play the sound.
This system was improved by Emil Berliner in 1887 to produce the gramophone, which is also a purely mechanical device using a needle and diaphragm. The gramophone's major improvement was the use of flat records with a spiral groove, making mass production of the records easy. The modern phonograph works the same way, but the signals read by the needle are amplified electronically rather than directly vibrating a mechanical diaphragm.
Analog Wave What is it that the needle in Edison's phonograph is scratching onto the tin cylinder? It is an analog wave representing the vibrations created by your voice.
The problem with the simple approach is that the fidelity is not very good. For example, when you use Edison's phonograph, there is a lot of scratchy noise stored with the intended signal, and the signal is distorted in several different ways. Also, if you play a phonograph repeatedly, eventually it will wear out -- when the needle passes over the groove it changes it slightly (and eventually erases it).
Digital Data In a CD (and any other digital recording technology), the goal is to create a recording with very high fidelity (very high similarity between the original signal and the reproduced signal) and perfect reproduction (the recording sounds the same every single time you play it no matter how many times you play it).
To accomplish these two goals, digital recording converts the analog wave into a stream of numbers and records the numbers instead of the wave. The conversion is done by a device called an analog-to-digital converter (ADC). To play back the music, the stream of numbers is converted back to an analog wave by a digital-to-analog converter (DAC). The analog wave produced by the DAC is amplified and fed to the speakers to produce the sound.
The analog wave produced by the DAC will be the same every time, as long as the numbers are not corrupted. The analog wave produced by the DAC will also be very similar to the original analog wave if the analog-to-digital converter sampled at a high rate and produced accurate numbers.
You can understand why CDs have such high fidelity if you understand the analog-to-digital conversion process better. Let's say you have a sound wave, and you wish to sample it with an ADC. Here is a typical wave (assume here that each tick on the horizontal axis represents one-thousandth of a second):
When you sample the wave with an analog-to-digital converter, you have control over two variables:
- The sampling rate - Controls how many samples are taken per second
- The sampling precision - Controls how many different gradations (quantization levels) are possible when taking the sample
In the following figure, let's assume that the sampling rate is 1,000 per second and the precision is 10:
The green rectangles represent samples. Every one-thousandth of a second, the ADC looks at the wave and picks the closest number between 0 and 9. The number chosen is shown along the bottom of the figure. These numbers are a digital representation of the original wave. When the DAC recreates the wave from these numbers, you get the blue line shown in the following figure:
You can see that the blue line lost quite a bit of the detail originally found in the red line, and that means the fidelity of the reproduced wave is not very good. This is the sampling error. You reduce sampling error by increasing both the sampling rate and the precision. In the following figure, both the rate and the precision have been improved by a factor of 2 (20 gradations at a rate of 2,000 samples per second):
In the following figure, the rate and the precision have been doubled again (40 gradations at 4,000 samples per second):
You can see that as the rate and precision increase, the fidelity (the similarity between the original wave and the DAC's output) improves. In the case of CD sound, fidelity is an important goal, so the sampling rate is 44,100 samples per second and the number of gradations is 65,536. At this level, the output of the DAC so closely matches the original waveform that the sound is essentially "perfect" to most human ears.
CD Storage Capacity One thing about the CD's sampling rate and precision is that it produces a lot of data. On a CD, the digital numbers produced by the ADC are stored as bytes, and it takes 2 bytes to represent 65,536 gradations. There are two sound streams being recorded (one for each of the speakers on a stereo system). A CD can store up to 74 minutes of music, so the total amount of digital data that must be stored on a CD is:
44,100 samples/channel/second * 2 bytes/sample * 2 channels * 74 minutes * 60 seconds/minute = 783,216,000 bytes
That is a lot of bytes! To store that many bytes on a cheap piece of plastic that is tough enough to survive the abuse most people put a CD through is no small task, especially when you consider that the first CDs came out in 1980.
Digital To Analog Conversion
 Digital-to-analog conversion is a process in which signals having a few (usually two) defined levels or states (digital) are converted into signals having a theoretically infinite number of states (analog). A common example is the processing, by a modem,of computer data into audio-frequency (AF) tones that can be transmitted over a twisted pair telephone line. The circuit that performs this function is a digital-to-analog converter (DAC).
Basically, digital-to-analog conversion is the opposite of analog-to-digital conversion. In most cases, if an analog-to-digital converter (ADC) is placed in a communications circuit after a DAC, the digital signal output is identical to the digital signal input. Also, in most instances when a DAC is placed after an ADC, the analog signal output is identical to the analog signal input.
Binary digital impulses, all by themselves, appear as long strings of ones and zeros, and have no apparent meaning to a human observer. But when a DAC is used to decode the binary digital signals, meaningful output appears. This might be a voice, a picture, a musical tune, or mechanical motion.
Both the DAC and the ADC are of significance in some applications of digital signal processing. The intelligibility or fidelity of an analog signal can often be improved by converting the analog input to digital form using an ADC, then clarifying the digital signal, and finally converting the "cleaned-up" digital impulses back to analog form using an DAC.
There will be more to come so please check back when I've had time to compile more information. |