Finding Security Vulnerabilities with Cypher and Neo4j

In this blog post I want to demonstrate how Neo4j and open source tools like jQAssistant can be used to detect vulnerabilities in software systems. Such issuses are usually based on certain execution patterns of code, which individually are harmless, but become critical when combined.

As an example I want to use the Deserialization Vulnerability that was hot news in November 2015. I’d like to show you how you can test yourself for any combination of libraries, if similar issues can be triggered.

The Java Deserialization RCE Vulnerability

At the beginning of November a wave of security alerts were raised in the Java ecosystem. A large number of application servers, CI setups and other systems were reported to exhibit a remote code execution issue.

This issue is based upon the unchecked deserialization of untrustworthy user supplied data into Java objects. During/after the deserialization process, methods were executed on objects resulting from that data which would eventually lead to arbitrary code execution on the JVM and even on the host system (via Runtime.exec).

The issue was first presented by Chris Frohoff and Gabriel Lawrence already in January 2015 at AppSecCali2015.

It is built upon serializing an instance of a JDK-class (AnnotationInvocationHandler) which happens to call methods on the deserialized data in its readObject method. This fact was combined with the ability of the Apache Commons Collections project’s ability to wrap collection classes in transformer facilities. One of which (InvokerTransformer) can be configured to dynamically use methods for the value transformations using reflection on the JVM. Combined with Java’s ability to run operating system or other commands on the host machine via Runtime.exec it opens the door to arbitrary code execution.

Many Java servers accept serialized Java objects over the wire and deserialize that data within their process. This happens for protocols like RMI, JMS and JMX but also often for custom RPC and other protocols.

Java code running on a server is not subject to the JVM security manager sandbox by default. That’s why this code execution is not protected against automatically.

But due to the complexity of the setup and the less pronounced presentation of the issue, it was mostly ignored by other researchers and the public.

Only in late October Code White researcher Matthias Kaiser, demonstrated in detail how to remoteley execute code on an Atlassian Bamboo installation. This was picked up by Steve Breen who demonstrated similar vulnerabilities with other popular application servers (e.g. WebLogic, IBM WebSphere, JBoss, Jenkins, OpenNMS). Without notifying the vendors of these products, he wrote a quite polemic blog post on the subject, which was quickly picked up by news outlets.

Most of the mentioned products and projects have released fixed versions of their software but there are certainly still a lot of vulnerable installations running.

The Vulnerability Pattern

The basic pattern of this vulnerability is pretty straightforward. It is more or less a long execution chain, which starts with an class that is serializable (i.e. transitively implements java.io.Serializable) and also contains a readObject method. The chain ends with a call to java.lang.reflect.Method.invoke().

Within that chain you have invocations either directly on target methods or on methods on interfaces / base-classes that are then implemented by methods transitively leading to the expected chain end.

This is a visual representation:

In a graph model you can represent code as a graph with Nodes for packages, classes, interfaces, methods, fields, parameters and Relationships for inheritance, delegation, invocation, read- and write operations.

In such a model then declaring the start- and end-method becomes straightforward and the arbitrary long execution-chain turns into a shortest-path operation.

Fortunately there are tools that provide this out of the box. Meet jQAssistant.

In my first encounter with Neo4j in 2008 I experimented with loading the JDK class library into Neo4j using the aforementioned model, even without Cypher this was really fast and fun to play with. Years later I had a long conversation with my friend Dirk Mahler about software analytics with graphs, around code-quality, software-erosion, architectural rules and the lack of flexibility in commercial tools.

Our discussions eventually led to the creation to the open-source project jQAssistant, which is a software analytics toolkit built upon Neo4j, Cypher and an extensible plugin system. Basically it uses plugins to parse information not only from source code but also many other sources (build-files, database-metadata, descriptors, config-servers and many more). The initial graph model from that raw data is then enriched using provided concepts and custom techncial (e.g. Service,Test,UI-Component) and business (e.g. Order-Management, Recommendation-Engine, User-Profile) concepts. In most cases these are applied as node-labels representing higher level concepts but can also be more complex graph structures.

Based on the enriched raw graph data, you can compute metrics and output reports for your software project. But most interestingly you can create your own architectural constraints or other rules that the entirety of your project should adhere to. If the constraints return offending violations, then those are reported and your project build can be made to fail.

All concepts, constraints, metrics and reports are declared as Cypher statements, either in self-documenting text documents or XML descriptors. Each of them can depend on others, so during their application requirements are automatically resolved and you only run the minimal set of Cypher statements necessary.

You can use the full richness of Cypher to match patterns, compute cardinalities or compare collections of elements.

More details about jQAssitant can be found on the project site, in dedicated blog posts and the extensive documentation.

Now let’s see how we can use this toolset to find security vulnerabilities in our code.

Locate the Pattern using jQAssistant and Cypher

After downloading jQAssistant, you can simply scan a folder containing the relevant JAR-files. In our case the rt.jar from the JDK and the commons-collections.jar from Apache Commons Collections. For your own use-case it makes sense to scan the combination of jars you use in your application-server and/or project.

bin/jqassistant.sh scan -f lib/*.jar

If you just want to explore the scanned data, you run the Neo4j server with:

bin/jqassistant.sh server

First, let’s think about what we’re looking for. We discussed it above in prose. What would it look like as Cypher statements using the jQAssistant Java-plugin graph-model.

If we start with the endpoints of our vulnerability execution chain:

Serializable classes that implement readObject (393 classes)

MATCH (:Interface {fqn:"java.io.Serializable"})<-[:EXTENDS*]-(class),
      (class)-[:DECLARES]->(m:Method {name:"readObject"})
RETURN count(*);

Calls of the method java.lang.reflect.Method.invoke(): (258 calls)

MATCH (:Class {fqn:"java.lang.reflect.Method"})-[:DECLARES]->(m:Method {name:"invoke"})
RETURN size(()-[:INVOKES]->(m)) as invocations;

To enrich our existing model, so that is more suitable for scanning across jars and along method overrides we need to run two additional concepts. Both of which are provided by the Java-Plugin but not applied by default.

bin/jqassistant.sh analyze -concepts classpath:Resolve,java:MethodOverrides

Now we have everything in place to determine the transitive connection between deserializing objects which from within readObject call Method.invoke() over several intermediate steps.

MATCH (:Interface {fqn:"java.io.Serializable"})<-[:EXTENDS*]->(serialized:Class)-[:DECLARES]->(readObject:Method {name:"readObject"})
MATCH (:Class {fqn:"java.lang.reflect.Method"})-[:DECLARES]->(invoke:Method {name:"invoke"})

MATCH path = shortestPath((readObject)-[:INVOKES|:OVERRIDDEN_BY*]->(invoke))

RETURN serialized,path
ORDER BY length(path) ASC LIMIT 10;

Here you can see one of the many paths that represent a potential vulnerability and could / should be examined more closely.

jqassistant deserialize vulnerability example1

This was just one example how software analytics with Neo4j can be used to immediately acquire highly valuable information about your software projects.

There are many more applications:

manage consistency of transitive library dependencies of a multitude of projects
infer software modules from an unstructured codebase
incrementally improve software quality by using team-provided language or architectural rules
assert synchronization between database (or other) metadata and related code
structural search of "interesting" code structures
generate graph visualizations with virtual nodes and relationships that aggregate lower level metrics and dependencies
manage service or component visibility and discoverability even in absence of a module system
enrich static structural information with runtime trace information, heap structures, test execution results, build information, component interactions to discover new insights or support decisions for your team or yourself
render information derived from the graph as interactive graphs, charts, diagrams, city-maps with drill-down and comparison abilities

If you have other cool ideas of what you could achieve by treating information around software projects as a graph, please let me know. You can also join our graph-software-analytics google group and share your ideas or related projects there.