So I listen to podcasts using the iOS podcast app. It has options to vary the playback speed (.5-2X)
I am curious how this is implemented. I remember listening to cassette tapes at higher speeds, and I remember the “chipmunk” voice effect. However, even though the voice’s frequencies were shifted, it seemed easier to understand than Apple’s implementation.
In Apple’s podcast app, it sounds like they try to maintain the person’s voice so they don’t sound like a chipmunk. That is good, right?
Well, it seems to me that it is harder to understand. Is it because they are actually taking out tiny portions of the audio and then just stitching it back together? This would make more sense why it is harder to understand, because they are actually removing sound.
Does anyone here know the real answer?
There…are…lots…of…pauses…in…audio.
Those can be removed without seriously hurting the intelligibility of the source, or changing the pitch. If you go to far, though, your brain can’t process the information fast enough…
Here’s some background from Wikipedia.
Speeding up a tape changes the pitch because when you play an analog recording faster, the frequency of the output changes.
Once you’re dealing with a digital signal, you can do smarter things with it. For the most part, podcast playback isn’t really removing pauses (although the podcast app Overcast does do this), it’s just shortening the time duration.
Without going too deep into how signal processing works, here’s one way to think about it. Complicated sounds can be created by combining a bunch of simple tones.
So, there’s a 200 Hz tone that’s louder at some times and quieter at others, and a 400 Hz done that’s louder at some times and quieter at others, and so on. A full representation of sound “decomposed” in this way would require an infinite number of separate tones, but because human ears aren’t perfect, we only really care about the tones in a particular range.
So, let’s say that your digital file contains a representation of each tone and how loud it is over time. If you scale just the “loudness” part for each tone, but don’t change the tone, then your total sound gets shorter, but the tones don’t change. In reality, it’s more complicated than that because changing the length of the sound actually changes how the separate tones combine to form the full sound, but that’s the gist of it.
If you’re interested in more information, you kinda have to start with how a Fourier Transform works. Here’s a pretty good starter guide to the Fourier Transform. Don’t be dissuaded by the math. It’s just a formal definition of the “complicated signals can be constructed by adding up lots of simple signals” idea.
Though I haven’t specifically used the iOS podcast app, I often use other software that digitally changes the playback speed of the audio. I (usually) haven’t noticed that it makes things any harder to understand. If it does so to you, it may be due to one or more of the following factors:
(1) Quality (bit rate, and how well it was originally digitized) of the original sound file
(2) Complexity of the sound (music vs voice)
(3) Sophistication of the speed-changing software in your app
(4) How fast you’re speeding it up