There comes a point when a podcaster wonders “How should I be encoding my mp3s? What’s the right quality vs. file size trade off?” Hopefully this thought occurs before releasing Episode 1, but it is common to read a podcasting tutorial or book and just go with whatever that source is recommending. There’s really nothing wrong with that — the golden rule of audio production is if it sounds good, then you’re doing it right!
I’m only talking audio podcasts here. Video is out of the scope of this post.
There is an argument that file sizes don’t really matter given high speed Internet connections and cheap disk drives. The problem with this line of thinking, however, is that a surprising number of people don’t have broadband (for example, only 60% of households in the US). A lot of people consume podcasts via limited mobile data connections, or via coffee shop wifi where they are limited in both download speeds and time.
A lot of people consume podcasts burned onto CD data discs by friends, to their phones, or to older mp3 players which have limited storage capacity. Data discs are common for military personnel deployed overseas, for example.
Aside from this, given a little testing time up front, there is no future impact to your workflow to provide smaller files. Taking these guidelines into account, there should be minimal (if any) quality loss in providing smaller files, so what’s the downside?
Below I’m going to address each aspect of encoding a podcast and what the “industry best practices” are for each setting. I’ll also go a little deeper into the technical aspects of each setting and why they make a difference. Not too deep since most people don’t care, and each topic alone could easily fill its own podcast episode.
Hmm.. perhaps a podcast mini-series would be helpful? Let me know in the comments if there is any interest in such a thing.
Just want “the answer” without explanations?
Encode to MP3 at 44.1 kHz, 96 kbps CBR Mono (check the note about encoders toward the end of this post though to make sure you are truly getting 96 kbps). These settings are good for most spoken word podcasts and will result in small file sizes.
If you have a music podcast, release either 128k or 192k joint stereo depending on the focus.
If you are producing an audio book and intend to release on Podiobooks, their current requirement is 44.1 kHz, 128 kbps CBR, Joint Stereo, MP3.
As a side note, most of this post is in reference to a standard audio podcast episode. Encoding a podcast promo is a little different though. You should always use the highest quality settings and forget about file size. For episodes, you want a 30 minute, 60 minute, or 3 hour podcast to have a small file size to accommodate listeners with limited bandwidth or limited storage space on their mp3 players. Promos should be short (no more than two minutes), so no matter what quality setting you choose, the files will be sufficiently small.
Most people are either going to stream your promo from your website (your promo is on the Front or About pages so that people can easily find it, right??), hear it on another podcast, or download from this site (of course 😉 ). In all of these scenarios, you want the audio quality to be as high as possible so that listeners pay full attention to your content without distractions.
When another podcaster includes your promo in their show they are going to import the file, place it, fade it in and out, and export with the rest of an episode using their preferred settings. You have no idea what their settings are, so you want to make sure that you give them the best quality to work with. In a perfect world, this means an uncompressed WAV or AIFF file, but most people can’t tell the difference between that and a high bitrate MP3 when it comes to the spoken word, so you might as well go the easy route.
If you want great sounding audio you need to start with good equipment. The better your audio source, the better the compressed end result will sound. That doesn’t mean expensive equipment, just good equipment. Every persons voice is different and different microphones will sound better on some voices than others. The hard part is finding out what will work best for you.
The best microphone definitely isn’t going to be the one built into the screen of your laptop! But there are lots of inexpensive headsets and usb mics with decent quality. You’re best bet to find a good mic for your voice is to head to a music store and ask to try some out. Tell them that you are looking to do “voice over” or “narration” work and if they know what they’re doing they should be able to point you to a couple models in different price ranges. Frequently they’ll plug you into a PA system or headphones and let you talk into it for a bit to see how it sounds.
Another good way to go is to head to a podcaster or musician friend’s place and try out their equipment.
Similar to microphones are preamps and mixing equipment. They don’t have quite as big of an impact and get expensive fast, so I’ll leave that topic to another post.
The first consideration is what file type to use. Different file types have different features and support varies from hardware player to hardware player.
The most popular for podcasts are MP3 and AAC. Unless you have a special requirement that only AAC fulfills (primarily “enhanced” episodes with bookmarks and/or slide show) use MP3. Just about every audio player supports MP3. Many (most?) support AAC, but there is little reason to take that chance.
Other popular formats include WMA (Microsoft), Ogg Vorbis, FLAC, and Musepack. There are valid reasons to use each of them that range from philosophy to device support, but for the largest possible audience reach, MP3 is still the way to go.
That’s not to say you just have to pick one! Many podcasts have multiple feeds, each feed dedicated to a different format. There are tools to encode to multiple formats automatically, so once your feeds are set up there isn’t much work needed to fill them with the appropriate files.
Bitrate is measured in kilobits per second (kbps, or frequently shortened to just “k”). Basically, the higher the bitrate, the better sounding the audio, but also the larger the file size.
But what is the best bitrate to use? 128 kbps is “safe” for spoken audio. Depending on the program you use for encoding (more about that later) the result will either be two 64 kbps streams (stereo), or a single 128 kbps stream. Both should sound great through ear buds, computer or car speakers.
If you have good recording equipment and record at a decent sampling rate and bit depth, then you can encode to 64 kbps without any noticeable degradation in sound quality.
An example I frequently cite is the “This Week In…” podcasts by Leo Laporte. They use high quality microphones in the studio with excellent preamps and analog to digital converters. They release their shows in 64 kbps CBR, 44.1 kHz, mono and sound fantastic. A 90 minute episode is only a 42 megabyte download.
Most encoders have a setting for the type of encoding as well as rate. The major types are Constant Bitrate (CBR), Variable Bitrate (VBR), and Average Bitrate (ABR).
CBR is just what it sounds like – you pick a bitrate and the encoder squishes all of your audio to fit it, no matter how sonically complex the source material is.
VBR and ABR are given an acceptable range and automatically increase or reduce the rate depending on what is going on sonically. If there are a lot of frequencies being represented (like music) then it will increase the bitrate to make that sound better. If there isn’t much going on (silence during pauses) it drops the rate since there isn’t much to reproduce. Again, this explanation is simplified as there is a lot of psychoacoustic juju happening.
It is best to use CBR because some playback devices have either no or non-standard support for the others. CBR files are bigger, but not by a huge margin, so the wider compatibility is a good trade off.
Computer audio files are digital. In the analog world (how your ears work) audio is a smooth wave. Digital is binary, so everything is either a one or a zero. This conversion is often referred to as quantization.
Sample rate determines the frequency with which the system measures the amplitude of the analog audio wave. CD audio uses a 44.1 kHz sampling rate, which means that it takes 44,100 samples every second. DVDs use 48 kHz.
Because of the crazy math involved in digital audio, your equipment can only reproduce about half of what the technical rate is. So, if you record at 44.1 kHz you are only reproducing about 22.05 kHz at playback – which is fine since it’s still over the maximum that most humans can perceive (20 kHz max, 16 kHz for most people). Telephone systems traditionally allocate 8 kHz to give you a reference of the quality you get at 44.1 kHz recording.
The MP3 standard allows for sampling rates of 32, 44.1 and 48 kHz but because of inconsistencies of media player devices (are you seeing a pattern here?) it is best to stick with 44.1 kHz.
There are good reasons, by the way, for recording at a higher rate than humans can perceive. In short, if there is an amplitude spike (or drop) between samples, then you get more noise. It’s out of the scope of this discussion, but if interested you can research oversampling and the work of Harry Nyquist. Given the limited range of human speech, this is much more important when recording singing or other audio sources.
Bit depth determines how many different volume measurements the system has to work with. In other words, if you think of the audio as being measured by a ruler, the bit depth is how many notches that ruler has. Some rulers only measure in whole inches (low bit depth), while others allow you to measure within one-sixteenth of an inch.
A ruler with more notches allows for a great number of measurement options, and therefore a more accurate measurement.
Bit depth is largely irrelevant in a compressed audio file and there aren’t any settings to check, but I wanted to define it here in relation to your initial recording. The fact is that you can get away with much smaller bitrates if your initial recording is of high quality.
Therefore it is always best to record in 24bit and edit in at least 24bit (a lot of software uses 32bit float internally, so no worries). When it comes time to export, you’ll choose 16bit.
Similar to the notes above related to sampling rate, in truth human speech is only perceived over a range of about 40 decibels (dB), but the dynamic range of human hearing as about 140 dB, which comes out to just over 23 bits.
Stereo, Joint Stereo, Mono?
Mono means a single signal or channel that is usually sent equally to all speakers/headphones. Stereo means that there are separate signals sent to each side, which is most prevalent in music.
Most podcasts don’t need to be in stereo. If you have one host, or even a full panel you are probably not mixing it so that people are panned around the headphones – each microphone is probably straight down the middle.
By the way, if a light bulb just went off and you thought “hey, that’s a cool idea!” – don’t do it. It causes problems for people playing shows in their car (voices will seem quieter to the driver if a voice is coming out from their car’s right door) and it can be confusing or fatiguing to ears using headphones. Also, it will either double the file size, or halve the resulting quality which is bad either way.
A lot of modern computer microphones and recording software will record a stereo signal, even though your voice is inherently mono. A lot of people just roll with whatever was recorded because they don’t realize that this is unnecessarily doubling the size of their files!
When doing double ended recordings over Skype, or swapping raw files with another podcaster, you should always export in mono before sending. There’s just no need to have a stereo file from a microphone for voice. The transfers will be faster and it takes half the drive space for working with and backups.
Similarly, unless you are doing a music podcast, export your episode to mono. Your intro/bed music won’t suffer for it since it is short, or played quietly under the primary content of the show.
In your podcast encoder, if the source is mono, it shouldn’t matter what stereo setting you use – the file size will usually be the same (or halved, which I’ll cover a little later). Encoders will act differently though, so be sure to test yours and see if you have to explicitly set it to mono or not.
One thing that everybody “knows” is that lots of mp3 players have problems playing mono mp3 files, so the industry standard is to set your encoder to “joint stereo”. I haven’t been able to find definitive truth to this. It may have been true five or six years ago, but likely isn’t true any more. I think it’s telling that the TWiT network releases in mono, and is one of the most widely listened to podcasting networks. If this was a problem – they’d be hearing about it!
Regardless, as long as your source is mono, encoding to joint stereo shouldn’t result in a larger file size, so there’s no harm in playing it safe.
Here are the differences between the primary stereo settings:
- Simple Stereo – the encoder will evaluate the audio coming in from each channel of an input file and can allocate more or less bits to the side that has the most complex audio happening
- Joint Stereo – lots of magic happens (mid/side processing) which results in a better sounding file with efficient compression. Other techniques are faster, but have a higher risk of sounding bad if there are issues with the source file
- Mono – if the input file is stereo, the separate channels will be summed into a single stream, then attenuated by 6 dB. The attenuation is necessary since summing the channels will increase the amplitude which could result in distortion or clipping
Codecs / Encoders
This is one of the most contentious areas, and requires some personal testing for your audio set up and content. In fact, when researching the topic I found no empirical testing with regards to podcasting. I don’t know that I’ve been able to find one article, blog post, or forum entry that wasn’t complete opinion and regurgitating what is “known”.. but nobody can point to actual testing to prove it.
I found this to be so frustrating that I’m going to embark on my own ABX testing of the most popular encoders for podcasting and post the results. ABX testing is a lengthy process, so it will likely be a couple months before I can share results, but I think it will be worth it.
Most encoder discussions revolve around audiophile sites where they want to achieve complete audio transparency from the original source. This is great, but spoken word in general and podcasting specifically is very different from the remastered Sgt. Peppers album (in mono, of course!)
Most commercial and “pro” software uses Fraunhofer, such as: iTunes, Audition, CuBase, Logic, etc. Most Open Source, free, shareware, and independent audio software uses LAME such as: Reaper, Audacity, DropMP3, dBPowerAmp, etc.
One might assume that Fraunhofer is inherently better since the “big guys” use it. This is actually a legal issue, not a quality issue. Fraunhofer holds patents for some of the technology that makes MP3 work. The LAME developers don’t pay this license, so they only release source code for “educational purposes”. It is illegal for any program to include the LAME libraries unless they pay a license to Fraunhofer.
Fraunhofer negotiates the license rates based on the anticipated distribution and use of the software, so naturally, the big guys get better rates licensing the FhG encoder as opposed to alternative encoders that use some of the technologies covered by the patents.
This is why LAME isn’t actually distributed with Audacity and you have to go to a separate site to download it.
Most podcasters have heard that using Fraunhofer through iTunes gives the best resultant file, but this simply isn’t true any more. It used to be that LAME was better at high bitrates and Fraunhofer excelled at low bitrates. It is difficult to find information on just how often Fraunhofer updates their technology, but from most accounts it was in 2008 (and they had been focusing their energy on surround sound and high definition audio).
In contrast, LAME last released in November, 2011 (as of this writing in June 2012), and includes contributions from the whole gamut of audio enthusiasts, so all levels of encoding get attention.
You really need to do a test with some of your own recorded audio to see what will work best for you. Export a couple minutes from an episode (say, from intro through to part of some monologuing or discussion) and encode it using the suggested settings above with a couple different programs, then do an AB comparison of the results.
If you think it sounds better with one program or the other, then go with that!
One thing to look out for is how some programs interpret your choice of bitrate (h/t to Max Flight). Audacity (as well as Reaper, Ardour, and the command line LAME encoder) will take the target bitrate and encode to that exact number whether the source file is mono or stereo. This means that if you specify 128 kbps and the source is mono, then all 128 kbps is dedicated to the output. If you have a stereo source, then each channel is encoded at 64 kbps giving a resulting 128 kbps file overall. This is a generalization as there is some psychoacoustic and bit sharing magic happening, but the important part is that it does exactly what you told it to.
iTunes on the other hand, will encode a stereo source file to 128 kbps (64 kbps per channel, as you’d expect) but a mono source file to only 64 kbps. So if you want a mono file to encode to the full 128 kpbs, you actually have to specify 256 kbps in the options.
I’m not sure about other programs, as I’ve only tested those five, which reinforces the need to spend a little time to test the programs you intend to use with your own content.
Verifying Your Files
The easiest way to verify the bitrate and other technical aspects of your files is to open them with VideoLAN (VLC). Aside from being a great media player, it is open source, and hitting Control-i (or use the Tools drop down and choosing “Media Information”) while a file is playing will show you the internals.
The open source command line utility FFmpeg will display this information as well. Just type “ffmpeg -i [filename]” to view all sorts of information about the file.
Many other media players will show this information, you just need to dig around the menus.
As you can see, there are a lot of variables that go into how to make the best sounding episodes with the smallest possible file sizes.
Use the best equipment you can, record at high quality, export to mono, and encode with the program that worked best in your own testing. You want to aim for 64 kbps per channel to achieve great sound with a small file size.