The need to access web sites on the Internet, by individuals and organisations, has become an essential part of daily life. This increasing use exposes users to exploitation through propagation of malware, botnet communications and more, ultimately increasing the risk of the exfiltration of stolen information such as identifies or confidential data e.g. banking data. A recent development is malware that ransoms personal data, by moving it to cloud storage or encrypting it, and only returning access to it, once a fee has been paid. 

The typical defences against these threats often consist of security software installed on computers, servers and appliances hosted at the network edge, that use anti-malware scanners, filters, website reputation engines and more to reduce the risks to users. 

Files and network communications are scanned by looking for matches against the fingerprints of previously identified malware; when a match is found, communications can be blocked, files quarantined, and alerts raised based on the finding. To be effective fingerprint databases need to be constantly updated as new malware is identified, making it difficult for vendors to keep up with threats. In essence a game of cat and mouse exists between malware authors and the vendors of the security software, and consequently it is not necessarily the most effective approach to protection against this type of threat as it is often circumvented. 

Alternative strategies for the detection of malware need to be considered, because fingerprint databases cannot grow indefinitely and, in the current model, malware goes undetected until the databases have been updated. Instead of blacklisting, whitelists of benign applications could be created, but there are too many to practically fingerprint, and there is nothing to stop a malware author forging a fingerprint to avoid detection, once they understand how the fingerprint is generated. 

This research aims to provide an approach to address these problems. The suggested approach focuses on statistical methodologies for detection, minimising the reliance on static fingerprints and the intrinsic problems with this method i.e. database updating and inaccuracy. Instead, the approach implements a strategy of profiling the nature of HTTP communications using techniques from information theory. 

Filters can then be created that limit application conversations and data exchanged, between end devices and the Internet, to only what is wanted. This could allow malware to be filtered out because its nature is likely to be specific and different to that of benign applications. 

Thus, my hypothesis is that: 

Statistical techniques, from information theory, can be used to profile the nature of HTTP communications, allowing application filters to be constructed.