Royal Caliber Consulting for High Performance Computing

# VertexAPI2: An Introduction

VertexAPI2 is a library for the rapid development of high performance GPU programs that process large graphs. Modern GPUs boast extaordinary compute capacity compared to traditional CPUs and have shown orders of magnitude greater performance in many application areas. However, it takes considerable skill and effort to effectively map even simple serial algorithms to the highly parallel architecture of GPUs. Parallel graph algorithms in particular pose many complexities and high performance GPU solutions to graph problems remain in the realm of cutting-edge art. The VertexAPI2 project hides away the complexities of programming a GPU without compromising on performance, allowing experts in the domain of graph analytics to harness the power of GPUs without requiring expertise in GPU programming. VertexAPI2 is strongly inspired by the abstraction used in Graphlab.

VertexAPI2 presently targets NVIDIA GPUs using the CUDA/C++ SDK. VertexAPI2 requires Kepler or higher architecture. Single GPU performance is quite good compared to an equivalent Graphlab program, often faring better by an order of magnitude or more compared to a multi-core CPU. (NOTE: insert performance table here) VertexAPI2 supports multi-GPU setups via MPI, but there are important open problems in load balancing and certain deficiencies in current MPI libraries that cause multi-GPU scaling in the present version to be very poor. VertexAPI2 is an active project and the API is rapidly evolving as we discover improved approaches to express algorithms and automatically map them to parallel hardware. The next few sections describe the abstract model with some examples. The User Guide provides technical documentation for getting setup and building your own code.

## Obtaining VertexAPI2

The VertexAPI2 project is open-source and available under Apache License v2.0. You can clone the repository with:

git clone https://github.com/RoyalCaliber/vertexAPI2.git

There is no installation, the API is available as a bunch of header files. A simple Makefile is provided to compile some utility functions for reading graphs and to build the included example programs. For more details refer to the User Guide.

## The Gather-Apply-Scatter (GAS) Abstraction

VertexAPI2 closely follows the programming model of Graphlab. A graph algorithm is expressed as a series of operations performed at each vertex of a graph. The operations occur in three phases (steps):

Gather phase: each vertex aggregates values associated with its incoming edges and source vertices

Apply phase: each vertex updates its state using the gather result

Scatter phase: each vertex updates the state of every outgoing edge.

No other types of operations are permitted on the vertices or edges within the GAS model. The figure below shows the entities involved in orange in each phase for some vertex v in a larger graph.

All three phases together constitute an iteration. The graph algorithm is usually run until convergence, i.e. when all the edge and vertex values stop changing. We now describe each of these phases and the API in some more detail.

### Gather Phase

A quick word on notation: associated with each vertex is some state that is relevant to the algorithm at hand. The state can be anything, a number or a flag, or a string or a struct with multiple types. For simplicity in the following description we use the symbol $v$ synonymously for both the vertex and its state. Similarly edges can have arbitrary state associated with them as well and we use the same symbol to refer to both an edge and its state.

In the Gather phase, each vertex collects some information on its immediate neighborhood. In order to define the Gather phase for an algorithm, we must define two things:

(1)a function $g(v, u, e)$, which takes as input

the state of a destination vertex $v$

the state of a source vertex $u$

the state of the edge $e$ from $u$ to $v$

(2)an associative binary operator $\oplus$ that aggregates multiple outputs of $g$ into a single value.

Shown in the above figure is some selected vertex $v$ and its immediate neighborhood from a hypothetical graph. $v$ has three incoming edges $a$, $b$ and $c$ with corresponding neighbors $0$, $1$ and $2$. The result of the gather operation on $v$ is given by $$s = g(v, 0, a) \oplus g(v, 1, b) \oplus g(v, 2, c)$$

### Apply Phase

The apply phase is defined by a function $f(v, s)$ that takes the current state of a vertex $v$ and the gather result $s$ for that vertex and returns a new state for $v$. i.e. $$v' = f(v, s)$$ where $v'$ represents the updated state of the vertex $v$.

### Scatter Phase

In the scatter phase, each vertex has the opportunity to modify the state of its outgoing edges. This is specified by a function $h(v', e)$ that takes the updated state of vertex $v$ and the state of an outgoing edge $e$ and returns the new state of that edge. i.e. for the vertex $v$ in the figure above, the scatter operation is $$e' = h(v', e)$$ $$d' = h(v', d)$$ Note that the scatter operation for $v$ cannot modify the state of outgoing neighbors 3 or 4.

To crystallize these concepts, we now look at a few examples that express graph algorithms in terms of the functions $g$, $f$, $h$ and the operator $\oplus$.

## Example: PageRank

The PageRank algorithm assigns each web page an importance score based on the number and importance of other web pages that link to it. Let $r_x$ denote the PageRank of a web page $x$. Then the PageRank of a web page A is defined by the formula:

$$r_A = 0.15 + 0.85 \sum_{\{x \rightarrow A\}} \frac{r_x}{n_x}$$

where the $n_x$ is the number of outgoing links from web page $x$ and the set $\{x \rightarrow A\}$ denotes the set of all web pages $x$ that have a link to $A$.

An iterative solution to the PageRank equation is easy to formulate. We initialize all ranks to some starting value. In each iteration, we update the ranks of all the pages using the defining equation and repeat until the ranks have converged.

Now lets see how we can express this in terms of the GAS abstraction. For each vertex $v$ in the graph, we associate two numbers, the rank $r_v$ and the number of outgoing links $n_v$. We define the Gather function $g$ by $$g(v, u, e) = \frac{r_u}{n_u}$$ We use the ordinary arithmetic sum for the operator $\oplus$: $$s_1 \oplus s_2 = s_1 + s_2$$ so that $$s_v = \sum_{u \rightarrow v} g(u, v, e) = \sum_{u \rightarrow v} \frac{r_u}{n_u}$$ and we let the apply function be $f(v, s_v) = 0.15 + 0.85 s_v$, i.e. the updated rank of $v$. Thus in the Apply phase, we are essentially computing: $$r_v' = f(v, s_v) = 0.15 + 0.85\sum_{u \rightarrow v} \frac{r_u}{n_u}$$ Note that for PageRank, the Scatter phase is a no-op.

(NOTE: show results for PageRank here)

Before continuing with further examples, lets take a moment to explore a couple of important concepts.

## Activation

In the version of PageRank described above, every vertex runs all three phases in every iteration until convergence. Depending on the initial values and local graph topology, different vertices reach their final value at different rates. Often after the first dozen or so iterations (this number depends on the graph), most of the vertices have converged and we spend a lot of time repeatedly recalculating the same rank for these vertices until the last unconverged vertex reaches its correct rank. This aspect is observed in a variety of algorithms and can be significantly more pronounced in some cases.

To improve the computational efficiency, we introduce the concept of activation. Every vertex is flagged active or inactive. The Gather, Apply and Scatter operations are only run for active vertices. To decide the activity of a vertex, in the Apply operation, we compare the new and old states. If there is a significant change, we activate all the outgoing neighbors of this vertex (note: not the vertex itself!) so that they may in turn update their state in the next iteration. Thus a vertex is active if and only if any one or more of its incoming neighbors decide it should be active. In the case of PageRank, all vertices are initially marked active and the algorithm has converged when there are no active vertices remaining.

## Synchronous Execution

The version of GAS described so far is called bulk synchronous. What this means is that in each iteration, the Gather phase must be completed for every single vertex before the apply phase is run for any vertex. That is, each vertex will only see the state of its neighborhood as it was in the previous iteration. Similarly, the Apply operation must be completed for every vertex before the Scatter phase begins. Activation also is similarly synchronized - in fact changes in a vertex's activity only affect it in the next iteration.

There are clearly other possibilities - for example, we could have asynchronous execution where each vertex state is immediately updated. Further, in a parallel execution environment, asynchronous execution can be done in a non-deterministic manner. In many cases asynchronous execution can reach convergence more quickly because partial changes are utilized within the same iteration. VertexAPI2 presently only allows bulk synchronous execution, but different asynchronous modes are being planned.

Now that we understand the concept of activation, we can easily write a breadth-first-search (BFS) through the entire graph, using the GAS abstraction. The state of a vertex $v$ is simply $d_v$, the BFS depth. At initialization we flag all vertices as inactive except the root vertex and set all their depths $d = -1$. The gather function $g$, the gather reduction operator $\oplus$ and the scatter function $h$ are all set to no-ops. In the Apply function, for each vertex $v$ we check if the depth $d_v$ is negative. If the depth $d_v$ is non-negative, we do nothing. Otherwise, we set the depth $d_v$ to the current iteration number and activate $v$'s outgoing neighbors. And that's it!

Let's trace out how this works. In the first iteration, only the root vertex is active. So the apply function is run only for the root vertex. In the Apply function, the root vertex's depth gets set to the iteration count of 0 and all of the root's outgoing neighbors are activated for the next iteration. The root itself is now inactive, unless one of the outgoing neighbors has an edge back to the root. In the next iteration the Apply function is run for these neighbors and they get to set their depth to the iteration count of 1 and activate their outgoing neighbors. Thus each iteration activates the next frontier of vertices. Any reactivation of a previously active vertex is acceptable, since we never modify the depth of a vertex more than once. At the end of the iterations, each vertex has its BFS depth (or a depth of -1 if there is no path from the root).

## Example: Single-Source Shortest Paths

The Single-Source Shortest Paths (SSSP) problem is closely related to BFS. In BFS, we found the minimum number of edges to traverse between the root and any given vertex. In the case of SSSP, each edge $e$ has an associated weight $w_e$. The distance of a path is the sum of weights of edges in that path. In SSSP, we want to find the minimum distance amongst all paths from the root to a given vertex $v$ and we want to do this for all vertices $v$ in the graph. Note that if the weights on each edge are identical, then solving SSSP is the same as the BFS traversal described earlier. In the following discussion we use the word "distance" to mean "distance from root".

So, how do we solve the SSSP problem? We define the state at each vertex to be the minimum currently known distance $d$. We initialize $d_v = \infty$ for all vertices except the root, for which we set $d_{\mathrm{root}} = 0$. We define the gather function $g$ by $$g(v, u, e) = w_e + d_u$$ $g$ represents the shortest currently known distance to $v$ on any path that goes through incoming neighbor $u$. Next we define the gather reduction operator by $$s = d_1 \oplus d_2 = \min(d_1, d_2)$$ Thus the Gather operation on a vertex $v$ picks out the minimum currently known distance $s_v$ to $v$ from root. Finally the Apply function is defined by $$d_v' = h(v, s_v) = \min(d_v, s_v)$$ i.e. as we iterate, at each vertex we keep track of the shortest distance we have discovered to that vertex so far. If $d_v' < d_v$, then we activate the outgoing neighbors of $v$ to let them know that a new shorter path has been discovered to $v$. At initialization we flag the outgoing neighbors of the root as active and the rest of the vertices as inactive and interate till there are no active vertices. Tracing through a simple example should convince the reader that this GAS formulation will yield the correct distance at each vertex and a formal proof is not particularly difficult.

## A Look Under The Hood

Now that we have some idea of how algorithms are expressed in the GAS abstraction, we take a quick look at how VertexAPI2 is implemented. This section is mainly of interest to GPU developers or those interested in parallel algorithms.

We focus first on the single GPU case. The vertex states and edge states of the graph are stored in contiguous arrays of whatever variable types the user specifies. For the rest of this description, we use the word "edge" to refer to the pair of source and destination vertex indices, as opposed to "edge state" which refers to whatever data we may have associated with the edge. Now lets walk through each of the three phases.

### Gather Phase

A serial implementation of the gather phase is very simple, expressed here in pseudo-code. Note: loop ranges are inclusive on the left and exclusive on the right.

0 s = [] 1 for i in 0..nVertices: 2 s[i] = 0 3 4 for e in edges: 5 if e.dst is active: 6 s[e.dst] = s[e.dst] $\oplus$ $g$(e.dst, e.src, e.state)

Writing the simple piece of code from line 4 through line 6 to run efficiently on a GPU is however no simple task! There are several reasons a naive approach does not work:

only a small fraction of edges may be active and unless we take care, most threads will do no work because of the condition on line 5.

multiple threads need to write to s[edge.dst] on line 6. A simple approach would need to use atomics, which are too slow. (how slow?)

To do the gather efficiently on the GPU, we use a number of tricks:

we sort the edges by edge.dst and store them in compressed sparse column format.

we maintain an explicit list of active vertices.

we use the Load Balanced Search primitive from the moderngpu library to map threads to active edges.

we use block segmented reduction to cut down a lot of traffic to main memory.

### Apply Phase

The serial pseudo-code for the Apply Phase is

0 for v in vertices: 1 if v is active: 2 v.state = $f$(v.state, s[v])

The apply phase is an $O(|V|)$ operation rather than an $O(|E|)$ operation, so there is much less work to do in this phase. Besides, it is trivial to parallelize on the GPU: one thread is mapped to each active vertex. There is nothing noteworthy about the implementation of the Apply Phase on the GPU.

### Scatter Phase

The scatter phase in serial pseudo-code is

0 for e in edges: 1 if e.src is active: 2 e.state = $h$(e.src, e.state)

Scatter shares one challenge with gather, which is that when there are few active edges, a naive implementation will have most threads doing no work. This is again circumvented by using the active vertex list, rather than working on all the edges in the graph. Unlike the gather phase, there is no contention between threads on line 2, which makes scatter somewhat simpler to implement than gather.

### Activation

The active vertex list is maintained using one of two algorithms depending on the current level of activity. When activity is very low, atomic instructions are used to directly build an active vertex list. When activity is moderate, we first build a $|V|$-long vector of flags indicating whether a vertex is active or inactive, followed by a compaction to build the active vertex list.

This completes the high level overview of VertexAPI2. To start working with VertexAPI2 for your own algorithms, please look at the User Guide.

## Acknowledgements

The VertexAPI2 project is funded by the DARPA XData program. The work is done in collaboration with the group of Dr. John D. Owens at UC Davis. Many thanks are owed to Joseph Gonzalez of the Graphlab project for his discussions about the GAS abstraction.