112 lines
5.5 KiB
Text
112 lines
5.5 KiB
Text
A Fast Method of Identifying Plain Text Files
|
|
=============================================
|
|
|
|
|
|
Introduction
|
|
------------
|
|
|
|
Given a file coming from an unknown source, it is generally impossible
|
|
to conclude automatically, and with 100% accuracy, whether that file is
|
|
a plain text file, without performing a heavy-duty semantic analysis on
|
|
the file contents. It is, however, possible to obtain a fairly high
|
|
degree of accuracy, by employing various simple heuristics.
|
|
|
|
Previous versions of the zip tools were using a crude detection scheme,
|
|
originally used by PKWare in its PKZip programs: if more than 80% (4/5)
|
|
of the bytes are within the range [7..127], the file is labeled as plain
|
|
text, otherwise it is labeled as binary. A prominent limitation of this
|
|
scheme is the restriction to Latin-based alphabets. Other alphabets,
|
|
like Greek, Cyrillic or Asian, make extensive use of the bytes within
|
|
the range [128..255], and texts using these alphabets are most often
|
|
mis-identified by this scheme; in other words, the rate of false
|
|
negatives is sometimes too high, which means that the recall is low.
|
|
Another weakness of this scheme is a reduced precision, due to the false
|
|
positives that may occur when binary files containing a large amount of
|
|
textual characters are mis-identified as plain text.
|
|
|
|
In this article we propose a new detection scheme, with a much increased
|
|
accuracy and precision, and a near-100% recall. This scheme is designed
|
|
to work on ASCII and ASCII-derived alphabets, and it handles single-byte
|
|
alphabets (ISO-8859, OEM, KOI-8, etc.), and variable-sized alphabets
|
|
(DBCS, UTF-8, etc.). However, it cannot handle fixed-sized, multi-byte
|
|
alphabets (UCS-2, UCS-4), nor UTF-16. The principle used by this scheme
|
|
can easily be adapted to non-ASCII alphabets like EBCDIC.
|
|
|
|
|
|
The Algorithm
|
|
-------------
|
|
|
|
The algorithm works by dividing the set of bytes [0..255] into three
|
|
categories:
|
|
- The white list of textual bytecodes:
|
|
9 (TAB), 10 (LF), 13 (CR), 20 (SPACE) to 255
|
|
- The gray list of tolerated bytecodes:
|
|
7 (BEL), 8 (BS), 11 (VT), 12 (FF), 26 (SUB), 27 (ESC)
|
|
- The black list of undesired, non-textual bytecodes:
|
|
0 (NUL) to 6, 14 to 31.
|
|
|
|
If a file contains at least one byte that belongs to the white list, and
|
|
no byte that belongs to the black list, then the file is categorized as
|
|
plain text. Otherwise, it is categorized as binary.
|
|
|
|
|
|
Rationale
|
|
---------
|
|
|
|
The idea behind this algorithm relies on two observations.
|
|
|
|
The first observation is that, although the full range of 7-bit codes
|
|
(0..127) is properly specified by the ASCII standard, most control
|
|
characters in the range 0..31 are not used in practice. The only
|
|
widely-used, almost universally-portable control codes are 9 (TAB),
|
|
10 (LF), and 13 (CR). There are a few more control codes that are
|
|
recognized on a reduced range of platforms and text viewers/editors:
|
|
7 (BEL), 8 (BS), 11 (VT), 12 (FF), 26 (SUB), and 27 (ESC); but these
|
|
codes are rarely (if ever) used alone, without being accompanied by
|
|
some printable text. Even the newer, portable text formats, such as
|
|
XML, avoid using control characters outside the list mentioned here.
|
|
|
|
The second observation is that most of the binary files tend to contain
|
|
control characters, especially 0 (NUL); even though the older text
|
|
detection schemes observe the presence of non-ASCII codes from the range
|
|
[128..255], the precision rarely has to suffer if this upper range is
|
|
labeled as textual, because the files that are genuinely binary tend to
|
|
contain both control characters, and codes from the upper range. On the
|
|
other hand, the upper range needs to be labeled as textual, because it
|
|
is being used by virtually all ASCII extensions. In particular, this
|
|
range is being heavily used to encode non-Latin scripts.
|
|
|
|
Given the two observations, the plain text detection algorithm becomes
|
|
straightforward. There must be at least some printable material, or
|
|
some portable whitespace such as TAB, CR or LF, otherwise the file is
|
|
not labeled as plain text. (The boundary case, when the file is empty,
|
|
automatically falls into this category.) However, there must be no
|
|
non-portable control characters, otherwise it's very likely that the
|
|
intended reader of that file is a machine, rather than a human.
|
|
|
|
Since there is no counting involved, other than simply observing the
|
|
presence or the absence of some byte values, the algorithm produces
|
|
uniform results on any particular text file, no matter what alphabet
|
|
encoding is being used for that text. (In contrast, if counting were
|
|
involved, it could be possible to obtain different results on a text
|
|
encoded, say, using ISO-8859-2 versus UTF-8.) There is the category
|
|
of plain text files that are "polluted" with one or a few black-listed
|
|
codes, either by mistake, or by peculiar design considerations. In such
|
|
cases, a scheme that tolerates a small percentage of black-listed codes
|
|
would provide an increased recall (i.e. more true positives). This,
|
|
however, incurs a reduced precision, since false positives are also more
|
|
likely to appear in binary files that contain large chunks of textual
|
|
data. "Polluted" plain text may, in fact, be regarded as binary, on
|
|
which text conversions should not be performed. Under this premise, it
|
|
is safe to say that the detection method provides a near-100% recall.
|
|
|
|
Experiments have been run on a large set of files of various categories,
|
|
including plain old texts, system logs, source code, formatted office
|
|
documents, compiled object code, etcetera. The results confirm the
|
|
optimistic assumptions about the high accuracy, precision and recall
|
|
offered by this algorithm.
|
|
|
|
|
|
--
|
|
Cosmin Truta
|
|
Last updated: 2005-Feb-27
|