Detecting APT Attacks With Sparse Data: A Knowledge Graph Embedding Approach That Works

Let's say you're building a threat detection system. You have security logs, some telemetry, and a few threat intelligence reports. Your data is patchy. You're not seeing full attack chains. What you have is more like a collection of fragments — a compromised credential here, a suspicious IP there, maybe a technique ID from the ATT&CK framework if you're lucky. Your job is to connect those fragments and reconstruct plausible attack paths, even when most of the intermediate steps are missing.

This is exactly the problem that knowledge graph embedding (KGE) was supposed to solve. You model your security incidents as a graph, use link prediction to fill the gaps, and you have a framework for anticipatory defense. In theory, at least. In practice, it breaks almost immediately when you point it at Advanced Persistent Threats.

Why APTs Break Knowledge Graph Models

APTs are long-running, coordinated cyberattacks. They move slowly, pivot across systems, and stay hidden for weeks or months. The data around them is scarce because most of it is never shared publicly. But the problem goes deeper than just having few examples.

Standard KGE models assume the graph has structural density. They need enough connected paths to learn from. APT graphs have neither. One step in an attack might be a phishing email. Then nothing for two days. Then a privilege escalation event from a random IP. In graph terms, these dependencies are not just underrepresented — they are systematically absent. The model never trains on them, so it never learns to detect them.

The core issue is not just that the graphs are small. It is that they are structurally deficient in ways that destroy the assumptions baked into most embedding methods. Even advanced models like ComplEx or DistMult cannot recover from this because their learning signal is tied to graph structure, not domain semantics. The gaps are concentrated exactly where you need the model to work hardest — privilege escalation, lateral movement, exfiltration.

APT-ST-AN: Two Problems, One Architecture

A team from the University of Electronic Science and Technology in China took a different approach. Instead of changing the model, they changed the graph. Their system, APT-ST-AN, addresses two distinct failure modes: not enough positive examples, and not enough useful negative ones.

Part 1: Spatio-Temporal Attribute Reasoning (ST)

APT attacks have structure. They play out over time, across specific systems, often with repeatable sequences. If one technique tends to follow another, that relationship has predictive value. APT-ST-AN captures this by introducing a new relation type called "next". It encodes temporal ordering directly into the graph. So if technique A typically precedes technique B, a "next" edge connects them.

These relations are extracted from the MITRE ATT&CK framework, the Attack Flow project, and a lightweight rule-mining system called AnyBURL. AnyBURL discovers logical patterns from existing triples — if technique A often precedes B across multiple campaigns, this pattern becomes a rule. Inferred edges from these rules then become new positive triples in the graph.

The effect is significant. The base knowledge graph in this study had around three thousand triples. After applying spatio-temporal reasoning, it grew to almost seven thousand. That is not noise or fabricated data. It is inferred structure derived from observable patterns in real threat intelligence. The model now has far more semantic coverage to generalize from during inference.

Part 2: Adversarial Negative Sampling (AN)

Every KGE model needs negative examples during training — deliberately false triples that contrast with the real ones. Without them, the model cannot learn to distinguish plausible connections from implausible ones. Most systems use random negative sampling. But in sparse graphs, randomly-selected false triples are obvious and trivial. They do not provide a useful learning signal.

APT-ST-AN replaces random negatives with adversarial and synthetic hard negatives. The process has three stages:

  1. Adversarial generation: The system applies FGSM — the Fast Gradient Sign Method — to perturb entity embeddings slightly in the direction of the loss gradient. This creates examples that look plausible but are definitely false. Hard to dismiss, hard to classify correctly.
  2. Synthetic mixing: Real positive triples are blended with their adversarial variants at the feature level. The mix is biased toward the adversarial component to avoid generating examples too close to real positives.
  3. Filtering: Cosine similarity is used to retain only the most informative synthetic negatives — those close enough to the positive distribution to be challenging, but clearly not valid edges.

The result is a set of negative examples that forces the model to sharpen its decision boundaries. Combined with the enriched positive side from the ST module, the model now has both more to learn from and a harder optimization target.

Does It Work?

The authors evaluated APT-ST-AN on multiple small-scale knowledge graphs, with a focus on real APT scenarios. The results were consistent. APT-ST-AN outperformed both classical models like TransE and RotatE, and domain-specific baselines designed for security graphs. Each module contributed measurably — the ST component improved results on its own, and adding the AN component pushed performance higher still.

The improvements held not only on the security-domain datasets but also on public benchmarks from film and literature. This matters because it suggests the approach generalizes. It is not just a tuned solution for one particular graph structure.

What You Can Take From This

If you are building infrastructure that relies on predictive modeling over sparse knowledge graphs — not just in security, but in any domain where full data visibility is never available — this paper offers several transferable ideas.

  • Sequential relations add real value. Most knowledge graphs treat edges as flat, unordered facts. But in threat intelligence, fraud investigation, or biological signaling, ordering matters. Encoding "this tends to follow that" as an explicit edge type is lightweight and interpretable.
  • Hard negatives are worth the effort. Random negative sampling is still the default in most production KGE pipelines. In sparse graphs, this is a significant missed opportunity. Adversarial and synthetic negatives can reshape the optimization landscape meaningfully.
  • Rule mining is cheap augmentation. AnyBURL mines logical patterns from graph structure with low computational overhead. If edge density is a problem in your graph, rule-based inference can generate high-confidence positives without manual labeling. This works for any domain with local causal or temporal patterns.
  • The approach scales to other domains. The core insight — that latent causal structure can be recovered through sequential reasoning and adversarial sampling — applies anywhere the full graph is never visible. Financial fraud sequences, supply chain risk networks, medical event chains. The same principles hold.

Conclusion

APT-ST-AN is a targeted response to a specific structural problem: what do you do when your knowledge graph is too sparse to train on reliably? The answer is to enrich both sides of the training signal — more meaningful positives through sequential reasoning, harder negatives through adversarial construction. The result is an embedding model that can make useful predictions even when the data gives it very little to work with.

If you want to dig into the full ablation results or the details of the spatio-temporal rule extraction process, the paper is published in the journal Array (DOI: 10.1016/j.array.2025.100404).

References

Previous Post Next Post