PhD Abstract

Accessing web sites on the Internet has become an integral part of people’s lives, exposing them to malicious software, designed to steal personal information, allowing criminals to conduct identity theft or gain access to bank accounts. More recently, a common attack has been to deny access to information until a ransom is paid.

Security software can be installed on computers to mitigate some of these risks. It will intercept web communications and compare fingerprints of known malware with what has been received. In the event that a match occurs, a warning can be provided or the potentially malicious element can be quarantined, protecting the user from infection.

This approach relies on the collation and analysis of malicious software, before security vendors can update their databases, and once malware authors become aware, they can evolve their code to avoid detection.

An alternative strategy would be to only allow access to benign applications, but this would also require a fingerprint database to be maintained, and to avoid detection malware authors would simply evolve their code to blend in.

A web application is made up of pages that do specific things and each has; headers, function calls, methods and objects, that are specific to its function. An application fingerprint can be created by profiling attributes, that are unlikely to be impacted by updates. This approach is practical because in most cases only one profile is required for common application components.

Malicious code is also likely to have a specific nature, whether to steal or delete information, something that would be difficult for a malware author to hide. Furthermore, if existing code is modified to include malicious functionality, then this may change its nature, allowing it to be detected.

Three aspects of web communications were identified for analysis from the review of literature; packet length and timing, HTML headers and body, and embedded scripting code.

Packet length and timing has been discounted because web server responses are likely to be fragmented to a common packet size, implying it will not carry any useful information.

HTML headers are used to exchange information between the web server and browser, and carry application specific information. Initial experimentation has shown that headers do not appear to be aligned to the nature of the application or specific components. Furthermore, most headers can be manipulated without impact, implying that they will be easy to forge.

The current focus is to collect information from the HTML body and embedded scripting languages, specifically JavaScript due to its popularity. Tokens and Abstracted Syntax Trees will be extracted, and mutual information, an application of natural language processing, will be calculated to show the strength of relationships between them. Once candidate elements have been identified, they will be tested to understand how they contribute to the fingerprint of the applications nature.