• “Leakage” in machine learning (ML) occurs when data that an ML model should not learn on is included at training time, often in unexpected ways.
  • This can cause overconfidence in ML model training results, producing cybersecurity ML models that fail to recognize threats.
  • CrowdStrike data scientists employ strategic data splitting during ML model training to prevent data leakage.

Because of the kinds of relationships present in cybersecurity data, our domain requires us to carefully consider the ML concepts of train-test leakage and data splitting. When observations are not independent of one another, the data should be split in a way that does not cause overconfidence. Otherwise, we might think our model can handle malicious processes very well, even though when faced with new threats, the model fails to recognize them.
There are also trade-offs here. It is possible that blocking the data may limit what is seen in predictor space — the possible feature values — which can decrease the model’s predictive capability. Some experiments below illustrate these concepts. 
A common statistical assumption is that observations in the data are independent. However, in real-world scenarios, data points often relate to each other. And if we train using data that are not independent, we get train-test leakage — where the training data has information it should not be expected to have. When correlated observations are mixed randomly into both the train and test sets, the model’s training data is dependent on the testing data in a way that may not be realistic for the model in production. Therefore, real-world performance for the ML model may not match what was seen in testing.
For example, if data points are spatially autocorrelated (related by location), traditional random train-test splits can lead to misleading results. This happens because nearby locations share similar features, like climate, which can leak information between training and test sets. 
One tenet of ML is to split the data into train, validation, and test sets, or to perform cross-validation, where the data are partitioned into multiple iterations of training and testing. The model learns from the training data and is then evaluated on the validation/testing data. This allows us to have a reasonable expectation of real-world performance and select a winner from competing models.

From Random to Strategic Data Splitting

Since day one, CrowdStrike’s mission has been to stop breaches. Our pioneering AI-native approach quickly set our platform apart from the landscape of legacy cybersecurity vendors that were heavily reliant on reactive, signature-based approaches for threat detection and response. 
This problem is not limited to one domain. Kapoor and Narayanan (2023) describe how different types of data leakage have contributed to a reproducibility crisis in ML-based science across over 290 papers in 17 fields, due to the overoptimism that leakage produces.
Different modeling strategies for nonindependent data are possible — such as linear mixed model or time-series approaches — but many performant predictive models, including tree-based ensembles and neural networks, may not be designed to account for these dependency structures. 
Looking at this issue in a real-world, physical science scenario, data with these kinds of dependency structures can often be found in ecological data, as noted by Roberts et al. (2016). In this domain, it is common to observe autocorrelation in space or time, or dependency among observations from the same individuals or groups. 
To achieve this critical requirement, CrowdStrike data scientists think carefully about how we train and evaluate our ML models. We train our models on datasets containing millions of cybersecurity events. These events can be structured in certain ways; they can have dependencies or similarities to one another. For example, we might collect multiple data points for a single malicious process tree and those data points will be closely related to one another, or we might collect malicious scripts that are extremely similar. 
We should then turn toward a more careful approach to data splitting. The Roberts et al. study recommends splitting the data into blocks, where each block groups together dependent data at some level. Each block is then assigned to a cross-validation fold. In the ecological data example, grouping nearby locations together as one block prevents data leakage and gives more accurate model performance estimates.
Therefore, a random split may inflate the performance estimate of the model. In fact, we may get data from an entirely new region at prediction time, but the data splitting method has led to overoptimism and overfitting. 
Our use of patented models across the CrowdStrike Falcon® sensor and in the cloud enables us to quickly and proactively detect threats — even unknown or zero-day threats. This requires accurate threat prediction by the CrowdStrike Falcon® platform.
In this post, we explain why CrowdStrike data scientists adopt strategic data splitting when training our ML models. Employing the strategic data splitting approach, as discussed below, will help to prevent train-test leakage in datasets with interdependent observations. This helps ensure more reliable model performance against novel threats in the wild.
One approach CrowdStrike takes to stop breaches is applying ML to detect malicious processes by their behaviors. However, observations from a process are correlated with other observations from that process — and with other processes from its process genealogy and machine of origin. We experimented with “blocking” by machine.

CrowdStrike’s Solution to Data Leakage

As an analogy for train-test leakage, imagine you’re evaluating a student in a class with a final test. In order to prepare them for the test, you give them a set of practice questions — the training set. If the practice questions are too closely related to the actual test questions (for example, only changing a few words in the question), the student might ace the test just by memorizing the practice questions. The student may have performed well, but we are overconfident in how much the student has learned because information leaked from the training set to the actual test — giving us an inflated view of their true knowledge.

Similar Posts