Table of Contents
Many programs and desktops use the MIME system[MIME] to represent the types of files. Frequently, it is necessary to work out the correct MIME type for a file. This is generally done by examining the file's name or contents, and looking up the correct MIME type in a database.
For interoperability, it is useful for different programs to use the same database so that different programs agree on the type of a file and new rules for determining the type apply to all programs.
This specification attempts to unify the type-guessing systems currently in use by GNOME[GNOME], KDE[KDE] and ROX[ROX]. Only the name-to-type and contents-to-type mappings are covered by this spec; other MIME type information, such as the default handler for a particular type, or the icon to use to display it in a file manager, are not covered since these are a matter of style.
.desktop files, with Type=MimeType, one file
per type to determine type from file name. The files are arranged in the
filesystem to mirror the two-level MIME type hierarchy.
The syntax is very similar to other
with Name=, Comment= etc.
[Desktop Entry] Encoding=UTF-8 MimeType=application/x-kword Comment=KWord Comment[af]=kword [... etc. other translations ] Icon=kword Type=MimeType Patterns=*.kwd;*.kwt; X-KDE-AutoEmbed=false [Property::X-KDE-NativeExtension] Type=QString Value=.kwd
KDE does not have a separate system for specifying extension matches, but uses case-sensitive glob patterns for everything.
A single file stores all the rules for recognising files by content. This
is almost identical to file(1)'s
database file, but without the encoding field.
The format is described in the file itself as follows:
# The format is 4-5 columns: # Column #1: byte number to begin checking from, ">" indicates continuation # Column #2: type of data to match # Column #3: contents of data to match # Column #4: MIME type of result
GNOME uses the gnome-vfs library to determine the MIME type of a file.
This library loads name-to-type rules from files with a '.mime' extension
in a system-wide directory (set at install time), and merged with those in the
user's directory. It loads textual descriptions for the types from
files in the same directories, ending with '.keys'. The file
gnome-vfs.mime in the system directory is always loaded
first (allowing everything else to override it). The file
user.mime in the user's directory is always loaded
last, making these settings take precedence over all others.
The format of the .mime files are described as follows:
# Mime types as provided by the GNOME libraries for GNOME. # # Applications can provide more mime types by installing other # .mime files in the PREFIX/share/mime-info directory. # # The format of this file is: # # mime-type # ext[,prio]: list of extensions for this mime-type # regex[,prio]: a regular expression that matches the filename # # more than one ext: and regex: fields can be present. # # prio is the priority for the match, the default is 1. This is required # to distinguish composed filenames, for example .gz has a priority of 1 # and .tar.gz has a priority of 2 (thus a file having the filename # something.tar.gz will match the mime-type for tar.gz before the mime-type # for .gz # # The values in this file are kept in alphabetical order for convenience. # Please maintain this when adding new types. Also consider adding a # human-readable description to gnome-vfs.keys when adding a new type here. # # Also do please not add illegal mime types, observe the mime standard when # adding new types.
When looking up the type for a file, gnome-vfs looks first for an exact-case
match for the extension, then an all upper-case match, then an all lower-case
match. If no matches are found, or there is no '.' in the name, then the
regular expression matches are checked. It does this first for rules with
priority 2, then for those with priority 1. The modification time on the
directories is used to detect changes.
The .keys files contain type-to-description rules, eg:
application/msword description=Microsoft Word document [de]description=Microsoft Word-Dokument ...
Guidelines for writing descriptions can be found in the
The format for magic entries is defined as:
# The format of magic entries is: # # offset_start[:offset_end] pattern_type pattern [&pattern_mask] type # # <offset_start> and <offset_end> are decimal numbers (file offsets). # # <pattern_type> is (byte | short | long | string | date | beshort | # belong | bedate | leshort | lelong | ledate). # # <pattern> is an ASCII string with non-printable characters escaped # as hex or octal escape sequences, and spaces and other important # whitespace escaped with '\'. # # <pattern_mask> is a string of hex digits. The mask must be the same # length as the pattern. # # <type> is a valid MIME type. # # Order magic patterns such that ambiguous ones (such as # application/x-ms-dos-executable) are at the end of the list and # therefore get applied last. # # Avoid rules that require a seek deep into the examined file. If you # must, locate such rules at the end of the list so that they get # applied last # # When designing new document formats, make them easily recognizable # by defining a sufficiently unique magic pattern near the document # start. A good pattern is at least four bytes long and contains one # or two non-printable characters so that text files won't be # misidentified.
MIME-info directories in
default). Files from earlier directories override those in later ones, but
the order within a directory is not specified.
The files are in the same format as GNOME, except:
There are no .keys files, so files of all extensions are loaded.
The priority is ignored.
A case-sensitive match is tried first, then a lower-case match. No upper-case match is tried.
Multiple extensions are allowed. Eg:
application/x-compressed-postscript ext: ps.gz eps.gz
When looking up the type for a file, ROX starts with the first '.' and tries a case-sensitive match of the remaining text against the extensions. The it tries again with the filename in lower-case. It then tries again from the second '.', and so on. If no type is found, it tries the regular expressions.
ROX has no rules for determining a file's type from its contents.
In discussions about these systems, it was clear that the differences between the databases were simply a result of them being separate, and not due to any fundamental disagreements between developers. Everyone is keen to see them merged.
This spec proposes:
A standard format for these files.
Standard locations for them.
The new format is very similar to that described in the Desktop Entries Specification[DesktopEntries]. However, only the tags used in this example are valid:
[MIME-Info text/html] Comment=HTML document Comment[af]=... [... etc. other translations ] Patterns=*.htm;*.html Contents=50:(string 0:64 "<HTML") Hidden=false PreferredExtension=html
All KDE-specific tags have been removed, as well as the Icon field. Although all desktops need a way to determine the icon for a particular type, the icon used will depend on desktop, and not only on the file type. The Encoding tag is not present; all .mimeinfo files are in the UTF-8 encoding.
The type should be a standard MIME type where possible. If a special media type is required for non-file objects (directories, pipes, etc), then the media type 'inode' may be used.
The entries in Patterns are separated by semicolons. There is no trailing semicolon. PreferredExtension is the suggested extension to use when creating files of this type.
Although not part of the name-to-type mapping, the Comment field is left in for the sake of not having too many files.
The Hidden field is usually not present. It is used to indicate that this entry replaces all information for this MIME type read so far, instead of being merged with other records for the same type. The intent is to let users entirely replace existing types.
Unlike the KDE system, the files are not arranged in the filesystem by type.
This approach is only possible for a tightly coordinated system. Consider,
for example, that ROX-Filer adds a mapping from
.DirIcon to 'image/png'. This cannot be specified in
a file called
image/png.desktop without conflicting
with existing definitions for the type.
Since files are not named by type, each file may contain multiple types. The files should instead be named by the package that they come from to avoid conflicts and reduce loading times.
The directories to be used to load these files are:
Each of these directories contains a number of files with the '.mimeinfo' extension. Applications MUST NOT try to load other files. This is to allow for future extensions.
Programs modifying any of these files MUST update the modification time on
the parent (
mime-info) directory so that applications can
easily detect the change. The rules from the directories in this list take
precedence over conflicting rules from earlier directories. If a directory
contains a file called
user.mimeinfo then it should be
read after all other files in that directory. This is to allow the user's
settings to take precedence over all others. GUI tools for editing the MIME
types will edit
KDE's Patterns field replaces GNOME's and ROX's ext/regex fields, since it is trivial to detect a pattern in the form '*.ext' and store it in an extension hash table internally. The full power of regular expressions was not being used by either desktop, and glob patterns are more suitable for filename matching anyway.
Applications MUST first try a case-sensitive match, then a case-insensitive
one. This is so that
main.C will be seen as a C++ file,
IMAGE.GIF will still use the *.gif pattern.
If several patterns match then the longest pattern SHOULD be used. In
particular, files with multiple extensions (such as
Data.tar.gz) MUST match the longest sequence of extensions
(eg '*.tar.gz' in preference to '*.gz'). Literal patterns (eg, 'Makefile') must
be matched before all others. It is acceptable to match patterns of the form
'*.text' before other wildcarded patterns (that is, to special-case extensions
using a hash table).
If the same pattern is defined twice, then they MUST be ordered by the
directory the rule came from (this is to allow users to override the system
defaults if, for example, they are using a common extension to mean something
else). Patterns in
~/.mime/mime-info override those
/usr/local/share/mime/mime-info, which in turn take
precedence over those from
If a pattern is defined twice within same directory, either can be used.
If the same type is defined in several places, the Patterns and Comments MUST be merged. If two different comments are provided for the same MIME type in the same language, they should be ordered by directory as before.
Common types (such as MS Word Documents) will be provided in the X Desktop Group's package, which SHOULD be required by all applications using this specification. Since each application will then only be providing information about its own types, conflicts should be rare.
The value of the Contents attribute contains a priority and an expression. If several expressions match for one file, the one with the highest priority is used. As a guide, priorities should be between 1 and 100, with 50 being the normal case. Generic types (such as XML or GZip-compressed files) should have lower priorities.
Since scanning a file's contents can be very slow, applications may choose to do pattern matching first and only fall back to content matching, or not perform it at all.
The basic building blocks of expressions are bracketed lists containing a type, an offset (or range of offsets), the data to match and, optionally, a mask. For example:
(string 0 "%PDF-") (string 0 "\177ELF") (string 0:64 "<svg") (string 0 "BMxxxx\000\000" 0xffff00000000ffff)
The first element of the list is the type of the data (see the table below), the second is the range of offsets to check, the third is the value to match and the last, if present, is the mask.
Integers have the usual C-style prefixes (0 for octal numbers, 0x for hexadecimal).
Strings have C-style escaping. This string contains the sequence of bytes
<0, 8, 9, 10>:
A range gives the range of valid starting offsets. If the end of the range is omitted then it is assumed to be the same as the start (that is, the match is only checked at one point in the file).
The possible types of match are listed below:
|string||String of bytes|
|big16||16-bit big-endian integer|
|big32||32-bit big-endian integer|
|little16||16-bit little-endian integer|
|little32||32-bit little-endian integer|
|host16||16-bit integer in host-order|
|host32||32-bit integer in host-order|
These basic expressions may be combined using the
or syntax, eg:
(and (string 0 "\037\213") (string 10 "KOffice") (string 18 "application/x-kchart\004\006"))
and keyword corresponds to a more-deeply indented continuation
line in the original file(1) syntax, while
to elements at the same indentation. They may be nested in the obvious (scheme-like)
Since many formats have sub-formats (for example, KOffice stores its files in
GZip format, with a generic KOffice marker and a specific application marker),
it may be a useful optimisation to spot the same subexpression (eg
(string 10 "KOffice")) being used in several types and
only check it once.
The system described in this document is intended to allow different programs to see the same file as having the same type. This is to help interoperability. The type determined in this way is only a guess, and an application MUST NOT trust a file based simply on its MIME type. For example, a downloader should not pass a file directly to a launcher application without confirmation simply because the type looks `harmless' (eg, text/plain).
Do not rely on two applications getting the same type for the same file, even if they both use this system. The spec allows some leeway in implementation, and in any case the programs may be following different versions of the spec.
[GNOME] The GNOME desktop, http://www.gnome.org
[KDE] The KDE desktop, http://www.kde.org
[ROX] The ROX desktop, http://rox.sourceforge.net
[DesktopEntries] Desktop Entry Specification, http://www.freedesktop.org/standards/desktop-entry-spec.html