Vlad Tsyrklevich
Twitter, GitHub, Keybase, e-mail
Notes on Octopus and Gremlin 3

Octopus is the next iteration of joern. It’s still under development, but the dev branch is functional and includes some set-up documentation. The changes include a new architecture, a move from Neo4j to TitanDB, and an upgrade to Tinkerpop3 which includes an updated version of gremlin. The general idea of how gremlin graph traversals work is unchanged but there are a couple of differences that could trip up those familiar with gremlin v2. These are my terse notes from some experiments, they might be useful to others who are working with octopus:

  • If you hit an error like the following on an OS X home brewed python install:
:projects:octopus:octopusMlutils
running install
error: can't combine user with prefix, exec_prefix/home, or install_(plat)base
:projects:octopus:octopusMlutils FAILED

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':projects:octopus:octopusMlutils'.
> Process 'command 'python3'' finished with non-zero exit value 1


The solution is quite simple, remove the --user flag from the installation invocation (patch).

  • Octopus comes with a handy shell built-in; however, the default gremlin output is quite bare. I wrote a simple pretty printer that makes the user experience a bit nicer here. Coloration would make parsing the output a lot easier on the eyes. The code is ugly, hence I haven’t contributed it back to the built-in joern steps.
> g.V(618000400, 38629568, 77070352)
v[38629568]
v[77070352]
v[618000400]
> g.V(618000400, 38629568, 77070352).pp(true)
vertex id: 618000400   	[_key: 755133, childNum: 0, code: interval = cpuattr . ppattr_cpu_attr_interval * NSEC_PER_SEC, functionId: 754897, isCFGNode: True, location: 266:4:8291:8349, type: ExpressionStatement]
  in function handle_resourceuse in file /octopus/data/projects/xnu-3248.60.10.tar.gz/src/xnu-3248.60.10/bsd/kern/process_policy.c at line 266

vertex id: 38629568    	[_key: 94060, childNum: 19, code: size = sizeof ( dtrace_aggdesc_t ) + ( aggdesc . dtagd_nrecs * sizeof ( dtrace_recdesc_t ) ), functionId: 93256, isCFGNode: True, location: 16378:2:433104:433191, type: ExpressionStatement]
  in function dtrace_ioctl in file /octopus/data/projects/xnu-3248.60.10.tar.gz/src/xnu-3248.60.10/bsd/dev/dtrace/dtrace.c at line 16378

vertex id: 77070352    	[_key: 93669, childNum: 17, code: size = sizeof ( dtrace_eprobedesc_t ) + ( epdesc . dtepd_nrecs * sizeof ( dtrace_recdesc_t ) ), functionId: 93256, isCFGNode: True, location: 16287:2:430850:430939, type: ExpressionStatement]
  in function dtrace_ioctl in file /octopus/data/projects/xnu-3248.60.10.tar.gz/src/xnu-3248.60.10/bsd/dev/dtrace/dtrace.c at line 16287
  • Gremlin v3 requires parentheses for traversal steps without parameters, e.g. vertex.out is now vertex.out(). Apparently SugarPlugin allows you to continue to use the old syntax though.
  • There are now two separate keys on vertices: there are the default TitanDB IDs (numeric but not contiguous) and there is now also an embedded _key vertex property (numeric by default and contiguous.) Property values like functionId reference the _key value, not TitanDB’s key.
  • Octopus indexes exact string matches against the _key and type properties and includes a Lucene (text search) index against code. The text search syntax is a little different, e.g. getCallsTo("*copyin*") is now getCallsTo(textContains("copyin")). Reference this documentation for supported text comparisons.
  • Gremlin3 includes a native match step, so the joern step that performs a recursive search of AST children has been renamed to _match
  • The identity pipe _() no longer works to transform individual vertices or Java collections into pipelines (or traversals in gremlin v3 parlance.) You can still create traversals from individual vertices or Java collections by doing g.V(vertex) or g.V(collection.toArray()). Alternatively, you can emulate the old behavior with the following hack (not recommended):
Collection.metaClass._ = { return g.V(delegate.toArray()) }
com.thinkaurelius.titan.graphdb.vertices.AbstractVertex.metaClass._ = { return g.V(delegate) }
  • Here is a simple working query as an example, perform an intra-procedural search for a copyin() to a variable that is later involved in a multiply:
getCallsTo("copyin")
    .as('copy')
    .sideEffect { start_node = g.V(it.get()).statements()[0] }
    .ithArguments('1')
    ._match { it.get().value('type') == 'Identifier' }
    .sideEffect { var_name = it.get().value('code') }
    .select('copy')
    .flatMap {
        g.V(it.get()).reachableCfgNodes(true)
            .has('code', textContains(var_name))
            .as('stat')
            .astNodes().has('type', 'MultiplicativeExpression')
            .astNodes().has('code', var_name)
            .select('stat')
    }
    .dedup()


Edit 10/31: Tim Hemel is working on a good tutorial about learning gremlin3 specifically using octopus as he mentions here. It’s unfinished but I already picked up some useful information from it.

Update 11/13: I’ve uploaded some more Octopus steps I worked on and found useful here and here.

Published on 31 Oct 2016