EENG3010 (week of Apr 3 - 7)

The subject this week is AUDIO.

Audio Plugs

Although these plugs/jacks come in standard sizes and geometries, I am not sure if they are official standards managed by certain institutions. There are 3 sizes: 6.3mm, 3.5mm, and 2.5mm. The largest is the oldest. The sizes are the width of the plugs at the base. The oldest plug was originally used by phone operators at switchboards. In the old days when you picked up your phone at home a light blinked above your jack on the switchboard at the central office. The operator plugged in her 6.3mm headphone plug into your blinking jack and talked to you. She asked whom you wanted to call. Then she plugged in her plug into the jack of the person you intended to call. She asked him if she wanted to accept the call. If the answer was yes, she connected the two jacks with a cable with 6.3mm plugs on both ends.

The 6.3mm jacks are used also in full-size stereo equipment. The 3.5mm jacks are used in PCs, portable CD/MP3 players, digicams, and so on. The smallest jack (2.5mm) is used in phones in general (especially in cellular and cordless phones). The earphones of most phones are intechangeable because of this standard. However, the standard 2.5mm earphones do not work with Nokia phones although they fit in the jack. It is because the Nokia phones use a 4-conductor plug whereas the standard earphones use 3-conductor plugs. That is why the standard audio plugs are called TRS (Tip Ring Sleeve). Those are the positions of the three conductors. We only need 3 conductors for a phone conversation. Why? One is the ground. One is the mono speaker. And the third one is the mono microphone. The phone lines distort sound at a great extent. So there is no point in having a phone conversation in stereo. Nokia's reason for having 4-conductor jacks can have multiple reasons. This way they may have thought they would sell earphones as accesory and make more money. Also, recently cell phones also have MP3 player capability and in that case we need stereo. However, when listening to MP3s we do not have to speak. Hence, with some small electronics 3 conductors would still be enough. There are standard 4 conductor plugs (at least 3.5mm) that are used with camcorders (ie. handycams). The camcorder plugs are used to send video and stereo audio to TV sets. Less expensive camcorders also use the 3-conductor version and send only mono audio to TV.

See these pix for what we put on the board when we talked about this subtopic: snapshot 1, snapshot in the other section.

PC Audio

Creative Labs pioneered PC audio starting in 1989. At the time PCs' audio capability was limited to a small single speaker and a beep sound. Creative launched the first add-in sound cards. It took a while before Intel pulled some of the sound card capabilities to the motherboard. Intel did this with the AC97 interface standard. Intel and Microsoft has similar tactics. Microsoft copies technology invented by other companies (such as web browsing) and pulls it into their operating system. Intel pulls stuff into their motherboard architecture. So when you buy an Intel-based motherboard and sound capabilities come with it (although at a higher price) you do not buy a sound card from a company such as Creative. You start going to those companies only for high-end equipment. Intel opened up a market for other players in sound card market with AC97. Instead of building a PCI card, you can build a riser card that plugs into an AC97 interface on the motherboard. AC97 should not be confused with AC3. AC3 is a digital audio coding scheme for surround sound from Dolby Labs.

CDDA

CDDA means Compact Disc Digital Audio and was pioneered by Philips. CDDA does not involve intense compression unlike MP3. It samples audio at 44.1KHz, quantizes it and stores it in digital form. It converts the samples to digital using 16-bit PCM. One minute of a CDDA song takes about 10MB.

Exercise: Please show this yourself. Make sure you find the exact number. Assume that Differential (or Delta) Pulse Code Modulation is used which compresses 100 bits to 75 bits and then Reed-Solomon coding is used as a safeguard against scratches (and that increases 100 bits to 110 bits).

When I open a CDDA CD in Windows XP, each track (ie. song) appears to be 44B only. However, the CD seems to have zero free space when I right-click its Properties. This is because the file system on a CDDA CD is not exactly compatible with Windows. It is not NTFS or FAT. As a matter of fact, if you look at Properties, it says File system: Unknown. When a player plays a CDDA, it bypasses the OS and directly reads and interprets the file system on the disk.

You may have asked why do we sample the sound at 44.1kHz in CDDA. Humans can only hear between 20Hz and 20kHz. Whatever frequency we want to hear, we need to sample at twice the rate (ie. at least 40kHz). This follows from Nyquist-Shannon sampling theorem in Signal Processing and Information Theory.

MP3

MP3 is an acronym that stands for "MPEG-1 Audio Layer 3". MPEG-1 is a video standard. However, nobody can think of sending video without audio. Hence as part of that standard, engineers devised ways to send compressed audio. Each audio layer (ie. compression) within MPEG builds on top of the tools of the previous layer and uses more complex compression techniques. If you do not remember anything else from this lecture, you should at least remember the following. There is no MPEG-3. MP3 is not MPEG-3.

MP3 is a lossy compression technique. The first part of the compression removes sounds that would not be heard by a regular human ear. The second part is a general-purpose compression technique called Huffman Coding which is even used as part of regular file compression programs.

Conversion of a song to MP3 involves the following steps: Sampling, Frequency Domain, Masking, Huffman Coding. The sound is sampled, quantized, converted to binary in a way very similar to CDDA. Then, the audio clip is divided into very short clips. Each short clip is converted to frequency domain. The frequency domain is viewed as a collection of consecutive frequency subbands. The frequency domain conversion finds the volume in dBs (ie. decibel) in each subband for every short clip (which is a fraction of a second). (Decibel is a logarithmic scale. The reason volume is measured in decibels is because human ear's feeling of volume is proportional to the logarithm of the pressure applied to the ear drums.) Once the frequency distribution of the sound clip is determined, that is compared to the psychoacoustic model of a human ear. For each subband, our ears have a different sensitivity. We do not hear sounds in a subband that are below a certain threshold. So there is no point in storing those sounds. Human ear is most sensitive in the 3-4 kHz range. As we go to lower or higher frequencies from this range, our sensistivity decreases. On top of it a strong volume in a subband masks neighboring subbands. Even if a sound is above our threshold in a subband, it will not be heard if there is a very loud neighboring subband. This masking has a resonating effect. Even after the strong subband gets quiet, it continues to mask the neighboring bands for a little while. This is called temporal masking.

See here for the clas notes: snapshot 1, snapshot in the other section.

Audio Layer 3 of MPEG-2 is also called MP3. However, it is a different format. Most MP3 files you find today are MPEG-1 based. The new video standard MPEG-4 uses an audio compression method called AAC. Other easily found formats are Microsoft's WMA and RealAudio. If you follow our MP3 link above to Wikipedia, you will find links for these formats. Another common audio format is the .wav format. Note that wave format is not a compressed format.

MIDI

MIDI is an early music format (as opposed to a general purpose audio format). MIDI stores music notes. For the purposes of playing songs with multiple instruments, it has multiple sound tracks. Since MIDI only stores notes, MIDI tunes occupy very little space. As a result, they are the format of choice for cellphone ringtones. Note that you cannot code songs with vocals in them. A two channel (~ stereo or more accurately two instrument) MIDI song is around 10kB for each minute. You can use this shareware sw to convert audio that plays on your PC to MIDI. Also whether with vocals or not you do not want to record/play a rock or rap song in MIDI. I have tried an Eminem song. When I tried a classical music song however both in the original format and in MIDI format, I found to my pleasant surprise, not much is lost.

Data Compression Fundamentals

At the end of this week's lectures, we went over the basics of data compression. For every compression format, there is an encoder and a decoder involved. A software (or hardware) which does both is called a "codec". "Co" for coder and "dec" for decoder. (See the board in the class.) To illustrate the basic idea behind data compression, we described run-length coding. The idea here (and in compression in general) is to find repeating patterns and describe them in a concise way. For ex. if we have 100 black pixels next to each other, we can code it with a pair: (0,100). 0 here is the color and 100 is the run-length. The pair as a whole represents a "run". We did the following exercise to understand run-length encoding.

Exercise: We have a b/w drawing with 1000. There is a circle of height 700. How many bytes would one use if an ASCII character for each pixel's color? How many bytes would we need if we use just one bit per pixel since this is a black and white picture? How many bytes are needed after run-length coding? Would run-length coding result in a bigger file if the picture was different? I mean can you think of any type drawing/picture where run-length coding would expand the size. If run-length coding can expand the size, can you come up with a scheme where we never have a larger file than the uncompressed version. (See the pictures for our work on this exercise in the classroom: picture1, picture2.)

Reading materials:
    Follow all the links above. (You do not have to follow the links in those pages.)
    MPEG audio compression tutorial.
    Audio compression tutorial.
    See the pictures of the board in our lectures.