The Clean Theory

      No Comments on The Clean Theory

Clean Theory 1.png

Old rules…

A popular proverb goes like this:

When one adds a pint of clean water to a barrel of sewer water one gets a barrel of sewer water, but when one adds a pint of sewer water to a barrel of clean water one gets… well… a new barrel of sewer water.

If the clean water is considered as a logically true statement and the sewer water is considered as a logically false statement, then the proverb expresses a long known principle used in logic:

Adding a true statement (clean water pint) to several chained false ones (sewer water barrel) with an AND operator (‘&’) results in an overall false statement. A similar situation occurs when adding a false statement to several chained true ones, resulting again in an overall false statement.

The proverb and the logic principle are useful for Symantec’s Security Response Engineers (SREs). One of the SRE’s main duties is to determine whether a given file may pose a threat to the environment where it would be deployed and to take the necessary steps to prevent any such threat from materializing. To do this, the SRE needs to look within the file for specific sequences of code or commands that may perform unwanted or malicious actions in the deployment environment.

In a similar manner to the proverb and the logic principle, such a potentially malicious file subjected to analysis may be expressed as a (long) logical sequence similar to this one:

S = P1 & P2 & P3 & … & Pi & … & Pn

In this case, each Pi represents a fundamental block in the file (performing one atomic action), or a statement in the logic parallel. Note that in reality the above expression may contain other logical operators like, ‘IF/THEN’. Nonetheless, in order to evaluate the whole file, one must evaluate each individual Pi.

We refer here to file blocks in a general manner because in reality file types vary so much. File blocks have different representations for different file types. For example, a file block in a script file is an atomic command executed by the script interpreter in one step, while a file block in a native executable may be regarded as a basic code block—which means a block of instructions that has either more than one entry, or more than one exit.

Since an overall false statement only takes a single false statement in the list, as soon as one of the file blocks is deemed to pose a threat then the entire file is considered a threat. With the file deemed a threat, analysis stops and a detection signature is added for it.

In cases where the whole file was created with malicious intent—like most Trojan horses—the threat can be easily spotted. This is because elements of the threat can be found all over the place, and obfuscation or polymorphism provide good indications there is something wrong with the file. Currently, about three out of four files that Symantec receives for analysis (75 percent) are considered a threat and receive a signature for detection.

To give a sense of the magnitude of the task, let’s consider a simple application such as Notepad, which has roughly 1,500 file blocks. If an attacker were to insert a few malicious blocks at a random location it would make it very difficult to spot them among the 1,500 clean file blocks. It’s like looking for a needle in a hay stack!

When detailed information is required to document the actions of a threat, deep analysis must be performed on the whole file. To perform this deep analysis, the SRE examines almost all of the file’s code blocks, regardless whether good or malicious, in order to fill in all the pieces of the puzzle. For example, in Stuxnet’s case (one of the most complex threats ever seen) it took a team of three Senior Security Response Engineers more than four months to go through its roughly 12,000 blocks of code.

SREs can accelerate the processing of such vast amounts of information by automating identification of known clean library code often re-used in many binaries, or by finding the original clean file and comparing it to possible threats for differences. However, a large number of blocks will always need to be manually inspected. One rule states that the effort and time to make a security determination is directly proportional to the amount of information contained in a file—in other words, the analysis work is about equal to file size.

Given a task of great magnitude, the SRE can also utilize other specialized tools before diving into deep analysis. Good tools are behavior examiners, for instance, where determination is based on the actions performed by the file when deployed in a controlled environment. But that’s another story.


It could be argued that something like command prompt (cmd.exe) or equivalent has at least one block of code that deletes multiple files, and can do it without interaction. On its own, this behavior is regarded as malicious and if found by itself in a standalone application (say an executable starts deleting all the files it finds on the system after execution) it would be considered a Trojan horse. However, cmd.exe and the like are in fact clean files, so how does that work?

Just like in logic, the similarity of true/false statements to clean/bad files works in this case too. The destructive code only triggers when a specific parameter is given to cmd.exe and the interaction is suppressed by another external parameter.

Basically, cmd.exe performs like this:

If the delete command is present then delete the specified files. If the silent parameter is present then suppress prompting.

Each of the two statements can be expressed as the following:

S = if P then Q

This means that when P is false, then S is true, or clean. Just like when the delete command is not present in the command line (P = false) then no files will be deleted by cmd.exe, which means a clean run. This also means that if the silent parameter is omitted then there will be a prompt for each command.

When P is true then Q will be evaluated and, if true, S will also be true. In the same manner cmd.exe, when told to delete files, acts legitimately, making it a part of a larger scheme, but on its own it is clean. It’s similar to a knife that can be used in the kitchen or, alternatively, for criminal activities. In this way, SREs investigate the purpose of commanding files to perform actions (either legitimate or malicious) and find out what the intent is.

Many modern threats and attacks actually use several modules that interact with each other. While most of the time the modules are specifically created with malicious intent, which makes it easier to identify them as malicious, the case may be that some of the modules employed in certain threats are, on their own, legitimate tools.

One such case is NetCat, a legitimate command line tool used by network administrators for advanced network connections. On its own NetCat is a clean tool, but it is also great for hacking attacks, where it is mostly used to initiate a back door connection. Due to widespread malicious use, NetCat has been categorized by Symantec as both a security threat and a security assessment tool, giving the user the option to ignore its detection when used legitimately.


Given the lengthy process of analyzing each individual file and the increasing amount of files to be analyzed every day, determining file origin and trustworthiness have become important factors in the process.

For instance, legitimate companies will generally produce high quality content that fits a pattern of quality control; integrity data—including digital signatures and version information—is always present. Such information is used to track the origin of the file to its creator and indicates whether the file is trustworthy.

Files produced by known legitimate companies may, within certain limits of certitude, be assumed clean without going through the whole analysis process (unless there is a good reason to do so, such as an observed side effect or a suspicious action performed by it).

Trust can also be applied when flagging files. Most clean files tend to be easy to analyze, while eighty-three percent of the threat files observed today use at least one packer on top of the actual payload. If a file pretends to come from a trustworthy source and it also has signs of obfuscation, mismatch in the digital signature, or appears custom-packed, there is more than ninety-five percent chance it will pose a security threat since a legitimate file from a trustworthy source will not normally employ obfuscation techniques or custom packers.

Next steps

As logic states, truth can only imply truth. In a similar manner, clean files must be clean on all levels. Clean files must originate from a known and reputable entity, must serve a well-defined legitimate purpose, and it must be constructed only of clean blocks. These principles are not just useful for file determinations, they can also be applied to other areas.

Taking into account currently used slow (relatively) analysis techniques along with the daily increases in the number of files to be processed, security vendors should innovate new ways to discover threats that reduce the one-by-one individual in-depth file analysis. The tendency is to set more and more emphasis on determining trust and intent.

Mircea Ciubotariu. The Clean Theory, August 2013, Virus Bulletin. Copyright is held by Virus Bulletin Ltd, but is made available on this site for personal use free of charge by permission of Virus Bulletin.

Leave a Reply