Ogg Vorbis Psycho
Acoustic Models Explained
The contents of this page come from Beni.
Your guide (Ripping, Page 6 of 7) says:
While MP3's version of VBR uses psycho-acoustic models, Ogg is based solely
on noise, masking, coupling and ATH thresholds. I have no idea of what that
means, but I read that's how I heard it works. =)
These *are* aspects of a psycho-acoustic model. For example, it's like saying
"Most people eat soup, but he eats boiled water with vegetables and other
additions." I'm no specialist, but I've got a good idea of how they work. BTW,
I'm also a big fan of Vorbis, so you'll see my bias ;).
ATH - Absolute Threshold of Hearing - the ear can't physically hear a
sound quieter than this, no matter what happens on other frequencies.
Vorbis uses it more correctly than other codecs. Most codecs assume
volume is fixed during playback. Whereas the Vorbis codec assumes that
volume can be adjusted (which it can). Vorbis assumes that you adjust
the volume such that the strongest frequency present won't kill your ears 8-[.
Tone Masking - is just when louder frequencies mask out adjacent quieter
ones. The ear & the brain's ability to process sound is unable to hear a quiet
sound at one frequency if there is a loud sound at another. Frequencies more
distant from one another require a greater volume in order to mask one another.
Whereas frequencies closer to one another are more easily "masked".
Things aren't always that simple though. Think about what happens when you
hit a cymbal. A cymbal doesn't really play a tone, it generates broadband
noise. Noise masking basically determines how much extra noise the encoder
can introduce in that noise without it being audible.
This is generally more tricky than tone masking and one of the things that's
improved a lot in Vorbis RC3. (but RC2 did it still much better than LAME for
example. LAME developers are now trying to imitate the Vorbis method.)
These two are used by all perceptual codecs (including MP3) to model what
a human ear will and won't hear. They are not specific to VBR. ABR
approximates what's left from psycho-acoustics as good as possible in
given bit-rate limits. VBR approximates the same things using as much bits
as it takes.
[The gory details: in order to make the bitrate lower than full-quality VBR,
one makes ATH, masking, etc. more aggressive than necessary. This throws
out also some things that would still be heard, but not very well.]
Coupling - a more generic term for concepts like mid/side stereo, amplitude
stereo, etc., in MP3 and other codecs. The idea is that sound consists of
many channels and has redundancy between channels. This redundancy
can be exploited to lower the bit-rate if the channels are encoded in some
The simplest example is to encode the average and the difference between
channels (for a stereo sound) - this is called mid/side representation and
it requires much less bits for sections that are close to mono.
Psycho-acoustics appear here too: the ear's & brain's abilities to
differentiate sound coming from the each ear is limited. A lossy format
can cheat to some degree (e.g. have the difference low-quality).
Compare with color television (analog). It transmits a full-resolution
intensity image (brightness), yet transmits coloring information at only
half-resolution, because the eye perceives color with less resolution
and the brain cares less for color changes. That's a lossy coupling
method based on a psycho-visual model ;-).
Vorbis supports files with up to 255 channels. That's why the term stereo
is avoided. Also mid/side seems to be patented. For now, the encoder knows
to use coupling for 2-channel files only, but eventually it will scale.
Vorbis uses a flexible format for coupling that allows it to cheat in different
ways at different frequencies, or not to cheat at all (lossless stereo imaging
in lossy audio - still gains on the redundancy!).
It's elegantly based on Vorbis' compression technique. Vector quantization
means that it encodes *joint* approximations to groups of numbers.
If you group together numbers describing different channels, your channels
become automatically coupled (normally a group would be picked from data
describing a single channel, so channels would be approximated independently).
Details explained here: http://www.xiph.org/ogg/vorbis/doc/stereo.html.
Noise -- I'm not sure what you refer to in this case. How is Vorbis based on
noise? Where did you see it? Formats like PlusV and MP3Pro's SBR invent
noise quite shamelessly in a way that you could believe to be the original. ;)
Perhaps this is what you mean:
The process of Vector Quantization introduces some "[vector] quantization
noise" - the difference between the approximation (a limited number of
these can be chosen) and the original group of numbers.
Dont worry, since all codecs suffer from quantization problems. VQ's whole
purpose is to suffer less from it by utilizing correlations among a group of
numbers - the joint noise can lower more than independent noise introduced
into each number.
See http://data-compression.com/vq.html for an explanation (the math can
be skipped. Read paragraphs I, II, beginnings of III and V. IV is easy too.
Be sure click the animation at VI!) BTW, VQF also uses VQ.
Anyway, this noise is introduced *after* all psycho-acoustic modeling and
simplification of the signal. It's one of the very last steps. The trick is to
ensure that it's unnoticeable.
A good codec will predict how much of this noise can be tolerated at a given
point of the signal and decide to allocate additional bits for a more precise
approximation if necessary.
I don't know how Vorbis handles it, but this is a only minor problem with most
codecs. Compared to a good psycho-acoustic model. Basically you allow just
enough possible approximation values to make the noise unnoticeable, and live
with the number of bits that this takes.
I hope this was helpful. It helped me organize these ideas in the head. For more
details you can ask on the Vorbis mailing-list <email@example.com> or contact Monty
<http://www.vorbis.com/contact.psp>. He's the main Vorbis developer.
Beni Cherniavsky <firstname.lastname@example.org>
Return to => [Non-MP3 encoders for CD Audio]