1. Introduction
Digital forensic analysis is the process of identification, preservation, analysis, and presentation of digital evidence in a manner that is legally acceptable (McKemmish ). The significant growth in the size of storage media combined with the popularity of digital devices and the decrease in the price of these devices and storage media have led to a major issue affecting the timely process of justice, which is the growing volume of data seized and presented for analysis, often now consisting of many terabytes of data for each investigation.
This increase in digital evidence presented for analysis to digital forensic laboratories has been an issue for many years, leading to lengthy backlogs of work (Justice ). Digital forensic data holdings consist of large amounts of structured and unstructured data which encompasses a wide variety of file systems, operating systems, software, and user created data, across a range of devices and media types.
Existing forensic software solutions have evolved from the first generation of tools and are now beginning to address scalability issues, but a gap remains in relation to analysis of large and disparate forensic datasets. Processing times are increasing along with the amount of data seized for investigation.
There have been many calls for research to focus on the timely analysis of large datasets (Garfinkel ). Digital forensic practitioners, especially those in law enforcement agencies, will continue to be under pressure to deliver more with less especially in todays economic landscape.
This gives rise to a variety of needs, including;
a capacity to triage evidence prior to conducting full analysis,
a more efficient method of collecting and preserving evidence,
reduced data storage requirements,
an ability to conduct a review of information in a timely manner,
an ability to archive important data,
an ability to quickly retrieve and review archived data, and
a source of data to enable a review of current and historical cases for intelligence and knowledge management purposes.
Many policing agencies have dedicated digital forensic sections to undertake analysis of digital evidence. Within these sections the seized devices are forensically copied, processed, analysed, and the results communicated in a format that is able to be presented and understood in a legal environment. Many agencies are struggling to keep up with the growing volume of data presented for analysis, with increasingly larger backlogs of cases (Justice ).
Whilst there are a variety of challenges to digital forensic analysis, including encryption, Internet-of-Things (IoT) devices, cloud storage, and anti-forensics, the growth in the volume of data is a major challenge. This is a result of the rapid development of storage technology, including consumer devices and cloud storage. Digital forensic software has evolved from the first generation of tools, but there remains a potential to develop innovative methods to conduct analysis, reducing the time a practitioner spends reviewing superfluous data and focus on data which has a better potential for evidential relevance.
There are a variety of research fields which have potential to impact the volume data challenge, including; Knowledge Discovery, Knowledge Management, Data Mining, and Criminal Intelligence Analysis. Knowledge discovery and knowledge management is an overall process of extracting valuable knowledge from data (Cios and Kurgan Literature Review).
Significant gaps remain in relation to applying data mining methodologies to digital forensic data, including a methodology which can be applied to real world data, the benefits which may be observed, and the most appropriate methodology to achieve the desired results including; a reduction in analysis time, a method of archiving and retrieving data, a rapid triage process, and a methodology to gain knowledge from the seized data. It is envisioned that applying the concepts of data mining to digital forensic data will lead to a methodology to assist examiners in analysing the vast volumes of seized data.
Evidential data is the focus when establishing proof, often in a Court environment, whereas intelligence is information which is processed in some form into knowledge which is designed for action (UNODC ). In the digital forensic realm there is another major gap relating to the limited use of intelligence gained during digital forensic analysis. Digital forensic intelligence has potentially a large benefit to investigative and other agencies. The current focus of investigations is locating evidence for urgent matters, with little or no time to consider other information which may provide valuable input to current or future investigations. Historically there has been very little discussion of a methodology to utilise the intelligence gained from one digital forensic investigation to assist with other investigations, nor to build a body of knowledge inherent in historical digital forensic cases.
The input of open and closed source intelligence for investigations is anticipated to improve the analysis phase of a digital forensic investigation. For example; information stored on a phone seized for one investigation may provide information to other seemingly unrelated investigations. Without an intelligence or knowledge management process, this information or linkage remains undiscovered. Using knowledge management, intelligence analysis, and data mining methodologies, it is envisioned that a large volume of information could be aggregated into common data holdings, and include the capability for rapid searches to link associated information and assist in investigations, but to enable this, we need a way to collect and process the data in a timely manner, which with the current and increasing volume of data on devices, seems out of reach.
The aim of this book is to outline a framework for digital forensic practitioners to apply data mining and data reduction techniques to digital forensic data to reduce the time to collect data. The focus is to apply data reduction and data mining techniques to a large volume of structured and unstructured data atypical of seized evidential data. Data mining and intelligence analysis techniques are demonstrated with the use of test data and real world (anonymised) data to demonstrate an appropriate methodology which can be applied in real world situations in an effort to address the digital forensic volume data issue.
The use of technology by criminals and/or victims means that data of relevance to an investigation may be located on a variety of devices, or may be virtualised, geographically distributed, or transient. This presents technical and jurisdictional challenges for identification and seizure by law enforcement and national security agencies, which can impede digital forensic investigators and potentially prevent agencies from acquiring digital evidence and forensically analysing digital content in a timely fashion (Taylor et al. ). The increasing volume of data is also impacting on timely location of evidence and intelligence on seized devices.
The motivation for conducting research into the growing volume of digital forensic data can be summarised as follows; Electronic devices and storage media, including cloud storage and Internet of Things devices, is increasingly being used by consumers, businesses, and government users to store growing amounts of data, which can be accessed with portable devices, computers, or mobile phones. Criminals are embracing the growth in technology for communication opportunities and the method by which crime is now enabled by technology, such as storing illicit data on portable devices or in cloud file hosting services. Investigations can stall if identification and preservation of potential evidence is not able to be undertaken, or this is not able to be done in a timely manner.