The magic of file
Ever wonder how file(1)1 works?
After reading this post, you should understand all about file and
its underpinnings.
file
Using the file tool is quite simple. It’s just file file-name-1.
You can call it with multiple arguments too. Here’s a few examples of
running it on my system.
$ file document.pdf
document.pdf: PDF document, version 1.4
$ file image.jpg
image.jpg: JPEG image data, Exif standard: [TIFF image data, big-endian, direntries=7 ...orientation=upper-left, datetime=2019:10:30 21:44:30, width=1740], baseline, precision 8, 2017x961, components 3
$ file video.mp4
video.mp4: ISO Media, MP4 v2 [ISO 14496-14]
$ file notes.org
notes.org: UTF-8 Unicode text, with very long lines
$ file /dev/video0
/dev/video0: character special (81/0)
You can already see a few things:
-
It works on any file.
-
It tries to examine the contents of the file (not just the extensions) to figure out the file type.
-
It understands file formats in some detail. It can extract extra information from the file itself. You can see that it understands versions for mp4 and jpeg files.
-
It knows about special files like devices.
There’s a few gotchas, though.
-
filemay not do what you want in shell scripts.$ file no-such-file no-such-file: cannot open `no-such-file' (No such file or directory) $ echo $? 0This version of
filealways seems to exit with a code of 0. Even whenfilecan’t figure it out. Even if the file doesn’t exist. -
For many formats that are not common on Unix/Linux,
fileis particularly fragile:$ file Vagrantfile Vagrantfile: HTML document, ASCII text $ file microsoft.netcore.app.2.1.19.nupkg microsoft.netcore.app.2.1.19.nupkg: Microsoft OOXMLYes, it really thinks a
Vagrantfileis an HTML document and a NuGet package file is an Office document!
But how does file actually figure out the file type?
Like other classic Unix tools, file has excellent documentation.
man file
points
out that file looks for things in this order.
-
It checks to see if a file is a special type of file, like devices.
-
It checks if the
magic(5)database contains a “magic pattern” that can identify this file type. -
It checks to see if it’s a text-file. In this case, file will try and find out additional properties, such as language, encoding, line-length and the line terminator type (CR, LF, or CRLF).
file uses the result from the first check that works. If none of
these checks are successful, file simply reports the file type as
‘data’.
The most interesting one of these is the magic(5) database option.
While the other checks are fairly static, you can examine the magic
database see how file works. You can even modify the database
manually to fix bugs2 in file!
magic
The magic database is a plain text file. It is often located at
/usr/share/magic. You can view it in your favourite editor.
/usr/share/magic contains a series of test-and-output lines. A test
describes something to look for. If the test matches, the output is
printed.
Here’s what a line might look like:
0 string Draw RISC OS Draw file data
The first 3 columns are the test and the final column is the output.
This line specifies that at if the string at position 0 in the file matches the constant string Draw, then show RISC OS Draw file data as the file type.
To make things easier for the people maintaining the database, the database supports nesting tests. That is, a test can say that it only applies if the previous test matched.
Here’s an example of a test and a nested test:
0 belong 0xcafebabe
>4 belong >30 compiled Java class data,
The first line is a test that can always be performed. The second line contains a test that should only be performed if the first test is successful, indicated by a “>” in the beginning of the line.
The first test here says: does the big-endian long at file position 0
match the constant 0xcafebabe. There’s no output column, so it
doesn’t do anything directly if it’s successful.
However, if the test is successful, the nested test runs. The nested test in this example checks if the big-endian long at position 4 in the file is greater than 30. If that’s true, print compiled Java class data3.
man 5 magic contains extensive documentation that covers all the
syntax and options available if you want to learn about this in more
detail.
a real life bug
Now that you know have a good understanding of file and magic,
lets see how you can use that to figure out a real life “bug”.
As part of releasing .NET Core this week, one of the tools in our release pipeline flagged an error saying the type of a particular file had changed. The error looked something like this:
file type changed for Microsoft.NETCore.App.Ref/3.1.0/data/PlatformManifest.txt:
-ASCII text
+Message Sequence Chart (chart)
The error is telling us that the file type changed from ASCII text to Message Sequence Chart (chart).
That error looks bogus. It’s a text file. The extension is .txt.
The contents look like plain text too:
$ head Microsoft.NETCore.App.Ref/3.1.0/data/PlatformManifest.txt
mscorlib.dll|Microsoft.NETCore.App.Ref|4.0.0.0|4.700.19.56404
System.IO.Compression.Native.a|Microsoft.NETCore.App.Ref||0.0.0.0
System.IO.Compression.Native.so|Microsoft.NETCore.App.Ref||0.0.0.0
So why is file saying this is a Message Sequence Chart? Let’s see if
we can find an answer.
We know what algorithm file uses. So we will walk through it.
First, file will check if it’s a special file. We can use stat to
find out what if that’s true.
$ stat -c '%F' Microsoft.NETCore.App.Ref/3.1.0/data/PlatformManifest.txt
regular file
Here, stat confirms that this is regular file.
Next, file will use magic to guess the file type.
Does the magic database know about this file type at all?
$ grep 'Message Sequence Chart' /usr/share/magic
0 string mscdocument Message Sequence Chart (document)
0 string msc Message Sequence Chart (chart)
0 string submsc Message Sequence Chart (subchart)
Aha! It does.
For this set of tests, magic looks at the string at position 0
and if it matches with msc (or mscdocument, or submsc), it
thinks this is a Message Sequence Chart.
If you recall, the first line of the text file was:
$ head Microsoft.NETCore.App.Ref/3.1.0/data/PlatformManifest.txt
mscorlib.dll|Microsoft.NETCore.App.Ref|4.0.0.0|4.700.19.56404
The first few bytes of the file incorrectly match with msc. magic
and file see the match and think this file is a Message Sequence
Chart (chart).
This was clearly a false alarm from file (and our tooling). We
ignored it.
closing thoughts
Hopefully, you now have a good idea of how file and magic work.
You also have a good starting point to use and debug these tools if
you never need to.
-
The
thing(number)convention is pretty common in Unix/Linux documentation. Thethingpart indicates the name of the concept. Thenumberindicates the type of the thing as well as the section number of man pages where this thing is documented. For example, someone can be talking aboutexec(1p)andexec(3). Byexec(1p)they mean the one that is documented in section 1: Executable programs. Soexec(1p)is theexecprogram. Byexec(3)they mean the one that is documented in section 3: Library calls. Soexec(3)is a library call. You can read the manual pages of a particular version by runningman 1p execorman 3 exec. ↩︎ -
If you do patch the database to fix a bug, please contribute the change back upstream. ↩︎
-
It turns out that Java class files and MACH executables use the same first sequence of bytes in the file (
0xcafebabe). The next set of bytes can be used to guess whether this is Java class file or a MACH executable. Java class files have the class file version here, which is closer to 50. MACH executables use a much lower value. That’s why the first test here for0xcafebabedoesn’t print anything: at this point,magicdoesn’t have enough information to claim if this is Java or not. ↩︎