By Cinzia Cappiello April 17, 2023
The availability of a large amount of data facilitates the spreading of a data-driven culture in which data are used and analyzed to support decision-making. This is also true for the cybersecurity environment in which the increasing number of threats appearing over time and related public data caused a “paradigm shift in understanding and defending against the evolving cyber attacks, from primarily reactive detection toward proactive prediction”.
Conventional data analysis approaches cannot address the complexity of the new threats and the velocity with which they are generated and spread throughout the Internet: more flexible and efficient mechanisms are needed. Artificial Intelligence (AI) systems based on Machine Learning (ML) tools and exploiting the power provided by big data architectures seem promising solutions to detect and mitigate many of the novel cybersecurity attacks. They can analyze large volumes of data, identify anomalies and suspicious behavior and investigate threats by correlating many data points. Techniques such as regression, classification, and clustering are already used to identify network threats, detect software vulnerabilities, monitor email, and design advanced antivirus applications.
However, data-based decisions are effective only if input data sources are relevant and not affected by poor quality and biases. For this reason, the definition of a proper analytics pipeline is crucial for guaranteeing an appropriate output quality. The pipeline should comprise three steps: data collection, preparation, and analysis.
- Data collection: high volumes of heterogeneous data must be collected and stored. Volume, velocity, and the need for real-time analytics should be addressed. For this reason, the adoption of a Lambda architecture could be suitable since it is able to process massive quantities and provide access to both batch-processing and stream-processing methods. Furthermore, data storage requires the design of a proper data lake architecture in which the data catalog helps in the sources’ selection, data sharing, and integration.
- Data preparation and quality assurance: the input data should be relevant and reliable. Data preparation includes different pre-processing components such as data cleaning, minimization, and sampling.
- Data analysis: data are analyzed with ML tools for detecting and predicting attacks.
In summary, data-driven cybersecurity based on the adoption of machine learning tools can help organizations to improve their security protection by increasing the accuracy of attacks detection and response.