In this project, the Modified Discrete Cosine Transform is used to perform and analyze audio compression. The \(n \times 2n\) MDCT matrix, entrywise, is defined by \[M_{i,j} = \sqrt{\frac{2}{n}}\cos \frac{(2i + 1)(j + \frac{n + 1}{2})\pi}{2n}\] where n is the length of a 'window'. When using the MDCT, a given audio is first sliced into chunks of size \(2n\), that overlap by \(n\). So, for example, for window size 32, an audio stream will get sliced into sections with entries 1-64, 33-96, 65-128, and so on. These strips of audio will be multiplied by \(M\) to get \(y\) windows, which can later be multiplied by \(M^T\) to get back strips of audio that can be added together to almost perfectly get the original audio back, minus the first and last \(n\) bits.
The compression comes in by bit-quantizing the \(y\) windows. Given an interval of values (L, -L) that the initial audio signal can take on, and a number of bits b to quantize into, \(y\) can be quantized by dividing it by \(q = \frac{2L}{2^b - 1}\) and rounding. An approximation of \(y\) can be retrieved by multiplying back q. This procedure causes a loss in data in exchange for being able to represent each window by a discrete integer, which requires less space to be stored in compared to floating-point values, and with a large enough \(b\), the accuracy that is lost in the compression can be imperceptible by the human ear.
For the first problem, we analyze using the MDCT, with a bit quantization level of 4, on tones with frequency of the form $64f$. In doing so, it is readily noticeable that odd tones seem to be reconstructed significantly better than even tones.
Odds are way better than even, as shown with this graph generated from this script.
The first of these is odd (f = 5, RMSE = .0011) and the second of these is even (f = 4, RMSE = .0091). The discrepancy seems to come from the fact that, when the audio signal is sliced up, odd multiples of 64 start and end with 0, whereas even multiples start and end with peaks of waves. The former only requires one value in the \(y\) window to be represented, whereas the latter can only be approximated by adding together several values in the windows, which ends up being significantly less accurate.
For f = 5, a slice of its curve, when converted to a window, only requires one non-zero value.
For f = 4, a slice of its curve, when converted to a window, requires many non-zero values.
Codec CodeOne way of solving this issue is by applying a windowing function, which will taper the ends of the audio slices to 0. One suggested function (that was the one that was used), is \[h_i = \sqrt{2} \sin \frac{(i- \frac{1}{2})\pi}{2n}\]
for \(i = 1...2n\). Multiplying the latter half of this vector to \(w_2\) and the former half of this vector to \(w_3\) undos the windowing function to recover the original signal.
The above is a slice of the f = 4 audio, with the windowing function applied to it. This transformation improved the accuracy drastically, as this dropped the RMSE down to .0011.
The above is of f = 5 with the windowing function applied. This was not as useful for this audio, and actually introduced inaccuracy, raising the RMSE to .0043, which was, in my opinion, significantly different audially.
The reason why this windowing function works is because it obeys the Princen-Bradley conditions: \[h_i^2 + h_{i+n}^2 = 1\] and \[h_k = h_{2n+1-k}\] The second is observed by the fact that the function is symmetric about \(n + 1/2\). The former comes from the fact that \(\sin \frac{(i + n - 1/2)\pi}{2n} = \sin(\frac{(i - 1/2)\pi}{2n} + \frac{\pi}{2}) = \cos\frac{i - 1/2}{2n}\) and applying the Pythagorean Identity. This actually ends up equalling 2, but the 2 is used to compensate for averaging two vectors later.
To elaborate on this further, let's say you have two column vectors, \(v_1 = [x_1, x_2, x_3, ... x_{n+1}, ... x_{2n}]\) and \(v_2 = [x_{n+1}, x_{n+2} ... x_{3n}]\). The latter half of the first vector shares the same values of the former half of the second vector. After multiplying both vectors by \(M^TM\), the original values (or an approximization of them, if quantization was applied) can be recovered by averaging the latter half of the first with the former half of the second.
But with the windowing function applied, we have instead \(v_1 = [x_1h_1, x_2h_2, x_3h_3, ... x_{n+1}h_{n+1}, ... x_{2n}h_{2n}]\) and \(v_2 = [x_{n+1}h_1, x_{n+2}h_2 ... x_{3n}h_{2n}]\). Multiplying \(w_2\) by the latter half of \(h_k\) and \(w_3\) by the former, and averaging them together yields a vector with terms that can, roughly, basically be written in the form \[\frac{x_k(h_k^2 + h_{k+n}^2) + c_k(h_kh_j - h_{2n + 1 - k}h_{2n +1 - j})}{2}\] As mentioned before, the coefficient of the first term in the denominator is 2, and the second term goes to 0, so this recovers the original value.
Codec With WindowingFor this part, I created a graph of what happens with the accuracy of a MDCT (with windowing) on octaves (in red), thirds (in blue), and fifths (in green) as the number of bits each component of the window is given is increased. More bits mean more accuracy.
I ran an mp3 (source: http://music.geocities.jp/gakufuup/) through the codec with varying sizes and with and without windowing and increasing the number of bits. Increasing the bits and adding windowing made the RMSE smaller, as expected.
For this part, I implemented a version of simplecodec that creates a b-vector as opposed to a b scalar, that encodes the more important columns in the y windows with more bits. I created an accumulator vector and add all the absolute values of the y windows to it, and then take the log of it + 1, ceil it to make it an integer \(\geq 1\), and then scale it to a maximum value of b that the user inputs and make that my b-vector. That b-vector is used to create a q-vector which then does what q usually does, except with component-wise operations on the y windows instead of scalar ones.
It is a useful way to save space without sacrificing accuracy. For instance, in the following input into RMSE
>>RMSE(4, 2, 16, 8)
We get an RMSE of 2.7186e-04 with a max of 8 bits, whereas
>>RMSE(4, 1, 16, 4)
We get an RMSE of .0028. Although the comparison might seem a bit unfair because the first has bmax = 8 and the second has b = 4, the former actually allots fewer bits total in this case (58 vs. 128) because the first uses only one bit for all the near-zero values.