Checksum mp3 audio frames (the data and not the headers)

DJ 1E and I have often lamented over the difficulties in version control with our audio collections.  Particularly with our usage of MusicBrainz, there are often pre- and post- processed files that have the same audio data, but different headers.  Traditional full-file checksums (e.g. md5sum) will present different checksums given the header variations.

The ideal would be a tag program that can checksum the mp3/ogg/m4a/aac audio stream and store the generated hash as a tag, similar to how flac handles its crc sums.

I did a bit more digging into this matter today and it seems that some code is moving in this direction.  I find a good but dated thread over at Hydrogen Audio asking a similar question,  Is there a Tool for MP3 checksum generation?, Audio part only.

This thread introduces   LAMEtag, a Windows command line exe with a GUI component as well.   Unfortunately, this otherwise handy utility only works on LAME encoded mp3s (by design, obviously).  Good for verifying CRC checksums if they were encoded into the frames with LAME.  While some of my collection fits this bill, I need a more general solution.

The same thread goes on to mention MP3tag, another Windows utility (GUI) that offers a wealth of features, including md5 checksums of audio only.   It takes me a bit of trial and error to realize that the means of generating and validating said sums has to be done through the programs export functionality.

$filename(txt)$loop(%_filename_ext%)%_filename_ext% %_md5audio%
$loopend()

I eyeballed the results and sure enough, they matched.

Unfortunately, my archives live in Linux and having to transfer two similar files every time I want to compare is inconvenient.  I need command line!  SebastianG offers a java .jar called mp3d5.jar that would apparently do the trick, but the only links I can find to that bit of magical code turn up 404.

Fortunately, this post at StackOverflow is on the same path and would provide me with the silver (plated at least) bullet for this issue.

Calculate checksum of audio files without considering the header

I want to programmatically create a SHA1 checksum of audio files (MP3, Ogg Vorbis, Flac). The requirement is that the checksum should be stable even if the header (eg. ID3) changes.
Note: The audio files don’t have CRCs

The author wants the same thing I am searching for–the ability to generate a checksum of the audio stream and store it in the file header as a tag.  Furthermore, he mentions his use of mp3cat!  I pulled down a copy of mp3cat and compiled it on my archive box.  Then the fun began

File1 and File2 are two mp3s with the same header information and content, but differing file sizes.  File3 a known “non-matching” mp3.

% ./mp3cat - - < file1.mp3 | md5sum
e9f10503ea7afd9adf676e6f20370e45  -
% ./mp3cat - - < file2.mp3 | md5sum
e9f10503ea7afd9adf676e6f20370e45  -
% ./mp3cat - - < file3.mp3 | md5sum
793774c7956a5fcacb7062b58b8c4677  -
% eyeD3 file1.mp3 > file1.id3
% eyeD3 file2.mp3 > file2.id3
% diff file1.id3 file2.id3
2c2
< file1.mp3     [ 5.20 MB ]
---
> file2.mp3     [ 5.20 MB ]
6c6
< ID3 v2.4:
---
> ID3 v2.3:
% ls -l *.mp3
-rwx------ 1 tmo  tmo 5447834 2009-01-12 01:19 file1.mp3
-rwxr-xr-x 1 tmo tmo 5448730 2009-07-19 18:31 file2.mp3

See the diff?  Conflicting ID3v2 tag versions.  One file is using 2.3 and the other is using 2.4, which would likely explain the negligable file size differences.

Firing up MusicBrainz Picard and looking at my configuration, I see that I’d checked the Write IDV2.3 tags (instead of the 2.4 default)

picard_ID3v2_options

I unchecked this box but still had questions.  What’s the difference between 2.3 and 2.4?  What do the different character encodings afford?

A glance at the help file for MusicBrainz reveals the following

ID3v2 version

The ID3v2 standard, as used for MP3 files, is an example of a ambigous, misunderstood and misimplemented standard. Every MP3 player maker, either in hardware or software, implements a different state of this standard to his own understanding. To ensure compatibility with your player, you can select the ID3v2 standard version for your files. Version 2.4 is the current standard version, but its support in players is currently lacking. If your player doesn’t show the tags with version 2.4, try using version 2.3. In an extreme case, you could try including ID3v1 tags which are obsolete and don’t work with non-latin scripts properly.

ID3v2 text encoding

If you don’t know what this is about, use the default. For the ID3v2 standard, you can choose the internal encoding for the tags. Up to version 2.3, only ISO-8859-1 (also known as Latin-1) and UTF-16 has been available. UTF-8 support has been introduced with version 2.4 and is thus unavailable with 2.3. Normally, you should go with either UTF-8 or UTF-16, depending on your ID3v2 version. Use ISO-8859-1 only if you face compatibility issues with your player.

That explains a good chunk of the difference.  Going forward I will use the 2.4 tags though I have some concerns as to whether the newer tags will confuse  some of my older hardware players (my 500GB iRiver, for instance).

This also introduces another variable or two that I hadn’t considered.  ID3 tag versioning and character encoding of said.  Ideally, I’d want backward compatibility with ID3v1 for ancient devices.  But for sake of functionality and depth of options, I’d want to eventually migrate my tags (across ogg, mp3, aac, m4a) to v2.4 with either UTf-16 or UTF-8.

I am still confused about the character encoding. The always informative Hydrogen Audio forums covered the topic in this thread:  UTF-8 vs UTF-16 in ID3 Tags?

The takeaway for me being:

But together with UTF-8 ID3v2.4 has some cleanups and improvements against v2.3 so this latest format is a recommended one indeed.

My portable support only ID3v2.3 so I must use UTF-16 in v2.3 for it. But most of modern software already support v2.4. Most often mistake in ID3 tags is usage ISO-8859-1 marker with text in other 8-bit codepage, both for ID3v1 and ID3v2. This is non-standard and will get wrong text info in standard complaint software/hardware. But unicode – ether UTF-16 in ID3v2.3 or UTF-8 in ID3v2.4 – both correct and standard. It’s your choice what to use.

mp3cat seems to be the solution for now and I will benefit from creating a script to utilize it in verifying files in my collection.

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>