Bryan Thompson

I will be giving a key note presentation on Monday, March 31st at DesWEB 2014 in Chicago.  This is the 5th annual workshop on Data Engineering meets the Semantic Web (DESWeb).  I will be talking about graphs as they are used in generic “Big Data” platforms, in RDF/SPARQL databases, and in graph mining and machine learning platforms.  One of my themes will be that there are really very different problems that require very different computational systems to support efficient and scalable operations and that some commonly used approaches are inherently not scalable.  I will also try to outline how these different classes of systems can be related together and what can be done to help integrate the RDF community within these other application areas.  Finally, I will touch on some recent advances in accelerated graph processing on GPUs.

Thanks,

Bryan

 

Today, NVIDIA announced that they are finally going to eliminate the major bottlenecks to scaling for GPUs – the relatively low bandwidth to large memory.

The GPU bandwidth to its own memory is very high at 288GB/s.  The CPU bandwidth to own memory is only around 60 GB/s.  The problem is that the GPU has only 8-12 GB of fast local memory (DRAM).  If the GPU needs to access large memory, it has fall back on the CPU memory over the PCIe bus, but the PCIe bandwidth is only 16GB/s.  This creates a huge problem for scaling data intensive algorithms.

NVIDIA made two announcements today that will completely change this situation by 2016.  These are:

  • NVLINK: Will deliver a 5x – 12x increase in bandwidth for machines with multiple GPUs.  Even at 5x, that 16GB/s turns into 80 GB/s across the GPUs in the same host.   If they can also give us access to the CPU memory at that bandwidth, then the GPU will finally be on an equal playing field with the CPU for data intensive applications.
  • 3D stacked memory.  This is the huge win.  The capacity and memory bandwidth are going to jump through the roof.  Pascal has 24GB of device local RAM with up to 1000GB/s of memory bandwidth.  This will completely change the playing field for data intensive application.

Pascal will be released in 2016.  It will include NVLINK and 3D stacked memory and will occupy only 1/3rd of a PCIe slot!

 

We are please to announce the v3 release of MPGraph. The MPGraph API makes it easy to develop high performance graph analytics on GPUs. The API is based on the Gather-Apply-Scatter (GAS) model as used in GraphLab. To deliver high performance computation and efficiently utilize the high memory bandwidth of GPUs, MPGraph’s CUDA kernels use multiple sophisticated strategies, such as vertex-degree-dependent dynamic parallelism granularity and frontier compaction.

The v3 release includes a 5x – 10x performance gain in algorithms that have large frontiers (Connected Components, Page Rank, etc.). This performance gain is obtained by using a different strategy to load balance the GPU when the frontier is large. This strategy has more overhead for small frontiers, but outperforms the existing kernels when the frontier becomes large.  MPGraph automatically chooses the best strategy for each iteration of the computation.

Download MPGraph v3 from SourceForge now. Or you can get the latest development version from SVN:

svn checkout svn://svn.code.sf.net/p/mpgraph/code/trunk

Our near term goals are to increase the data density on the GPU and support multi-GPU computations.  Topology compression will stretch the resources of a single GPU, providing support for graphs with up to 1 billion edges. Increased data density will also work in our favor as we move into multi-GPU support.

You can learn more about MPGraph at GTC next week.  We will be presenting on Monday the 24th in San Jose.

The goal of this session is to demonstrate how our high level abstraction enables developers to quickly develop high performance graph analytics programs on GPUs with up to 3 billion edges traversed per second on a Tesla or Kepler GPU. High performance graph analytics are critical for a large range of application domains. The SIMT architecture of the GPUs and the irregularity nature of the graphs make it difficult to develop efficient graph analytics programs. In this session, we present an open source library that provides a high level abstraction for efficient graph analytics with minimal coding effort. We use several specific examples to show how to use our abstraction to implement efficient graph analytics in a matter of hours.

We will be presenting new results for MPGraph v3.  These results include significant speedups for problems with very large frontiers.

For more information about the GPU Technology Conference, see http://www.gputechconf.com/page/home.html.  For more information about the MPGraph presentation, see http://registration.gputechconf.com/quicklink/b1cyGlI.  For more information about MPGraph, see http://sourceforge.net/projects/mpgraph/ and http://www.systap.com/mpgraph/api/html/index.html.

 

 

The goal of this session is to demonstrate how our high level abstraction enables developers to quickly develop high performance graph analytics programs on GPUs with up to 3 billion edges traversed per second on a Tesla or Kepler GPU. High performance graph analytics are critical for a large range of application domains. The SIMT architecture of the GPUs and the irregularity nature of the graphs make it difficult to develop efficient graph analytics programs. In this session, we present an open source library that provides a high level abstraction for efficient graph analytics with minimal coding effort. We use several specific examples to show how to use our abstraction to implement efficient graph analytics in a matter of hours.

We will be presenting new results for MPGraph v3.  These results include significant speedups for problems with very large frontiers.

For more information about the GPU Technology Conference, see http://www.gputechconf.com/page/home.html.

 

 

 

In this post, I will go into why there is a “big graph anti-pattern”, the fundamentally different kinds of graph processing, how to match the technology to the problem, and what are some successful patterns for scalable graph processing.

The big graph anti-pattern is “Throw everything into a big graph and then using the same tools that gave us horizontal scaling for other problems: map/reduce and key-value stores.”

There are several fallacies here. First, there are many types of graph processing and you can not use the same architecture to scale all of them.  This is the main focus of this posting and I will go into detail on why this does not work below. Second, the advantage of throwing the data together is that you can move onto finding information immediately. However, throwing the data together does not eliminate the schema alignment problem.  It just let’s you choose when you are going to deal with it and how much effort you will put into it. You still need to understand the data and the analytics to interpret either one. Lastly, if you are using statistical models, then you can run into problems if there are too many variables and your model winds up lacking predictive power.  You are better off focusing on the information that you need to make a specific decision rather than allowing statistical algorithms to go off on a fishing expedition.

There are some fundamental architectural differences in systems for high performance graph traversal and graph analytics, systems for high performance graph pattern matching, map/reduce platforms and key-value stores. If you only test at a small data scale, the scaling properties of these different architectures are not as evident and your benchmarking will fail to predict the actual performance characteristics of these technology on larger graphs. As the data scale increases, the differences become significant and determine what does and does not scale.

The only way to get scaling and high throughput for graph traversal and graph mining is to get the architecture, the software, and the hardware right. If you make the wrong choices, the communications costs change from O(N) (using 2D partitioning) to O(N*N) (the best case using any other approach). These problems are bandwidth limited, so you can’t just throw CPU cores and main memory at them. Efficient parallel graph algorithms are hard and good implementations will bottleneck at the CPU memory bus – beyond that you get negative scaling – your system just slows down as you add more CPU cores. Another rarely appreciated bottleneck is the high cost of sequential code. Any sequential code in your algorithm, including between iterations, has a huge negative impact on throughput since all the other cores are sitting idle. This is especially true for graphs with a high diameter, such as bitcoin transaction data or road maps.

Kinds of graph processing. There are several very different types of “graph” operations:

  • Gathered reads for property and link set retrieval. This is a parallel workload for the random recovery of attribute sets from known vertices and edges of interest. This is a good fit for a key-value store.
  • Graph traversal algorithms, such as BFS (breadth first search) and SSSP (shortest paths). The workload for these algorithms requires visiting each vertex in the graph at least once through a series of iterative 1-hop expansions over the graph. The size of the frontier (the vertices to be visited in the next round) often grows exponentially before eventually decaying, so the solution must be able to handle both small and large frontiers efficiently. The number of iterations depends strongly on the depth of the graph. Social networks may have a depth of 6. Road networks or bitcoin transaction data can have a depth of 10,000. This class of algorithms often does very little work per edge and vertex visited, so it places an extreme burden on the memory bus.  This extreme workload is why BFS is used for the Graph 500. Disk is not a good fit here.
  • Graph analytic algorithms, such as page rank, k-means, etc. These algorithms all have a workload that requires multiple full visitations of the entire graph. The frontier starts out with all vertices and then slowly drops towards zero as the algorithm converges. You have to read all the data, multiple times. Again, disk is not a good fit here. You need memory bandwidth. Lot’s of memory bandwidth.  More than you can get from a CPU.
  • Graph query (aka graph pattern matching). This involves matching specific patterns among the vertices, edges, and their labels in the graph using a high level query language such as SPARQL or Cypher (neo4j’s graph query language). Performance here depends on (a) a good query optimizer to reorder the joins in order to minimize the required effort; (b) propagating constraints from one join to the next in order to read less data and solve the query more rapidly; and (c) a good choice of indices (nice graph databases do not make you guess at what indices you need, they handle this automatically). This is one case where disk is a good fit. Index data structures can be used to rapidly identify a subset of the data to be read, but memory bandwidth is still a bottleneck for databases – this is the whole reason behind the emergence of the column-wise storage (something that we will introduce into bigdata this year).

Graph databases (or at least high level query against a graph database) and graph mining systems have fundamentally different workloads and require different techniques. When it comes to gathered property set retrieval, Accumulo, Cassandra, and related key-value stores are all very similar technologies.  Property set lookups can be parallelized against any of them once the desired vertex set is known, so choose whatever works for you. However, key-value stores are not able to provide efficient graph query or efficient graph mining/traversal. I will try to explain why below. First, I will focus on why they can not be used to create scalable graph traversal and graph analytic solutions.

Graph traversal. For graph traversal, you need forward and reverse indices to follow links. A graph database normally carries at least a forward and reverse index and can therefore be used for graph traversal, but this is not efficient and it is not scalable.

There are a few problems. First, graph traversal algorithms typically need to visit large parts of the graph in multiple iterations. This is not efficient against disk if there is any random access. (GraphChi is an example that uses an IO efficient solution against disk. However, it must read the entire graph in each iteration. While it scales well on a single machine, it can not be used for low latency operations, is a poor choice for algorithms such as BFS or SSSP, and it does not have the throughput of a main memory or GPU based solution.) Second, the approach to horizontal scaling for graph traversal must be based on what is variously called vertex cuts (graph lab uses this nomenclature) or a 2D decomposition (this is the terminology in the HPC space, which has been doing this for years for sparse matrix vector operations, which are very similar to graph operations).

In a 2D decomposition, the access to the links of the graph is decomposed against a 2-dimensional compute grid over virtual nodes. The vertices are organized into the rows and columns of the compute topology by using the same partitions for both dimensions. The rows provide access to the out-edges of the graph. The columns provide access to the in-edges of the graph. The diagonal consists of those edges whose source and target vertices fall into the same partition. Using a naïve partitioning strategy, the vertices are assigned to partitions by dividing the vertex identifier by the number of rows/columns in the compute grid. Graph aware partitioning can be used into increase the local density and interconnectedness of those partitions by what amounts to relabeling the vertex identifiers.

To gather the in edges for all vertices in the frontier whose vertex identifier falls into a given partition, the operation is decomposed into a local operation on each compute node in the corresponding column of the 2D compute topology.  The intermediate per-compute node results are then aggregated back to the compute node processing that partition. This parallelizes the effort across the column.  The scatter over the out edges is pretty much the same, but the operation is parallelized over a row of the 2D compute topology.  GraphLab and HPC sparse matrix vector multiplication systems all use this approach. So does the system that took first place in the Graph 500.

There are two basic problems that are addressed by a 2D decomposition.  They are co-location of the link weights in the forward and reverse traversal directions.  This is vital for algorithms that modify the link weights during traversal.  Without a 2D decomposition, link updates are always random since they can only be 1:1 with one of the indices.  On the other index they have a random access pattern. This causes a severe bottleneck.  A 2D decomposition also minimizes the communication volume by decomposing the gather and scatter phases over a row or column of the 2D compute grid. Compute nodes outside of the row (scatter) or column (gather) do not participate in that operation.  This means that he communication pattern is both efficiently parallelized and regular.

Graph databases and blue prints can not scale well for graph traversal or graph analytics.  First, graph databases need to minimize the data read for efficient high level query, so they use a very different data partitioning scheme (not 2D).  While they can be used for graph traversal algorithms for small data sets, the approach is simply not scalable for the reasons outlined above, e.g., O(N^2) communications plus high latency associated with the disk.  Client-based graph traversal APIs, such as blueprints, face another problem as well – the client is doing round trips with the server/cluster.  This is not only a throughput bottleneck, but it also limits the size of the frontier and computation state to the memory of the client.

You might ask, can’t I get away with using a graph database and blueprints?  Not if you want your solution to scale.  The promise of simple scaling that we have for key-value stores and map/reduce simply does not hold for graphs.  You must be using the right technology to get beyond toy problems.  How can you tell if you are using the right technology?  Look at the data layout – if it is a key-value store, it will not scale for graph traversal or graph analytics.  Look at the graph computation, if it is guided by a client API such as blueprints, it will not scale.  Look at the platform – if it has huge latencies for each iteration, such as Hadoop, it will not scale to graphs with large diameters (with one minute overhead per map/reduce job and 8000 iterations for BFS on bitcoin, you would have to wait 8000 minutes just for the job scheduling overhead on Hadoop – MPGraph does the entire computation in 300ms.)

Graph query. There are a few reasons why the graph traversal and graph analytics architectures do not perform well for graph query.  The main reason is that each join needs to be distributed across a row or column of the 2D compute topology.  While this is efficient for graph traversal, it is not efficient for low latency graph pattern matching queries.

The way to make graph pattern matching queries fast is to optimize the join ordering (the most selective access path is run first), read as little data possible, and then feed constraints from that join into the remaining joins. This can be done either by passing along intermediate solutions containing variable bindings discovered in the data and using nested indexed joins (bigdata does this) or through sideways information passing that let’s you skip parts of a access path that are provably unable to join (RDF3X does this).  Either way, you execute the joins in order of their selectivity and pass along constraints that allow you to avoid reading most of the data.  This is how to achieve low latency for high level declarative graph query languages.  This is also why people have such difficulties building scalable high level query solutions over existing key-value stores.  These architectures do not provide ways to constrain the access paths based on the data already read in previous joins.  As a result, they wind up sending all the data from each access path back to the client, which is then forced to do the join locally in memory on the client.  This approach reads way too much data, slams the network, and slams the client. For example, this is why Rya can not scale for graph query.  Accumulo does not let Rya flow the query over the data.

The 2D approach is not well suited to graph query because you have to read on all compute nodes in a row/column of the 2D compute topology to access the link set for a vertex.  In contrast, that data is co-located in the forward and reverse indices of a graph database.  This allows less inter-node communication for graph query access paths.

What if we had a hybrid system that maintained both the forward and reverse indices and the 2D layout so it could answer both low latency graph queries and provide efficient and scalable graph traversal? So far I have not seen any architectures that maintain both kinds of indices, but this could be interesting. There might be high data volume queries (queries where we need intrinsically need to read a lot of data, such as rollups over the entire graph) where we could accelerate the query using the 2D partitions.

When should you scale-out a graph database? A graph database with a decent query optimizer should be able to handle upwards of 10-50 billion edges on a single machine and provide low latency query.  The main enabling points are a good query optimizer and fast disk (SSD or PCIe flash) since graph query will result in random read IO patterns on the disk. The random IO pattern occurs because the index pages are not laid out in key order on the disk.  Bigdata also provides a high 9s open source deployment with linear scaling in query throughput as a function of the size of the replication cluster.

Horizontally scaled graph databases have inter-node communication overhead and are slower for low-latency most queries.  The main reason to scale-out a graph database is because you need throughput for data load (billions of edges per hour) and you need to run queries that do rollups over all that data. If you can partition your graph based along lines that make sense for your business such that most queries run inside of a single partition, you can often get much higher performance from a pool of graph databases each servicing a different partition of the data. When necessary, you can use federated query to read across those partitions (bigdata builds in support for federated query).

We develop two complementary kinds of open source graph technologies that target fundamentally different kinds of graph problems. One project (MPGraph) provides graph traversal and graph analytics on GPUs.  The other (bigdata) is a graph database that supports graph pattern matching using a high level query language.  The GPU approach represents the best known technique for high performance graph traversal and graph analytics and outperforms main-memory CPU solutions on machines with up to 24 cores by between 5x ~ 500x. The GPU currently requires a compile time schema for the property set and link set.  Our road map for that technology includes adding topology compression (up to 1 billion edges on a single card), column-wise compression of schema flexible property and link sets, and 2D decomposition onto multiple GPUs for graphs with more than 1B edges.

The bigdata graph database uses multiple indices to avoid bias in the access paths, an efficient representation of property sets, link sets, and link attributes, and a high-level query language paired with a query optimizer for efficient low-latency high level query.  There is also a vertex-centric API for the bigdata graph database.  Throughput of the vertex-centric API against the graph database over SSD is less than 1M traversed edges per second and has the same in-memory limits on the problem size identified above (compare this to 3 billion traversed edges per second on a GPU and you can see why we have two different approaches to graphs!).  We are working to improve graph traversal throughput on the graph database using column-wise indexing to reduce IO and the CPU overhead associated with materializing edges from the index, but there is still a performance gap of several orders of magnitude when comparing a disk/index based graph mining API and a graph mining API running on a GPU.   This performance gap is intrinsic.  It is the difference in the bandwidth of disk, main memory, and the memory bandwidth of the GPU, which is 10x greater than the memory bandwidth of the CPU.

Match the technology to the problem:

  • Key-value stores are good at property and link set retrieval since they can perform the gathered reads efficiently and in parallel.
  • Memory bandwidth is the bottleneck for graph traversal and graph analytic algorithms.  MPGraph excels here since the DRAM on the GPU is 10x faster than the CPU RAM.  Touching the disk is a huge penalty for graph traversal, and even IO efficient approaches are bandwidth limited by the sequential transfer rate of the disk rather than the bandwidth of main memory or GPU device memory.
  • High performance for graph query requires good query optimizers and flowing the query across the cluster (for scale-out).  Bigdata has two different query optimizers, one which emphasizes low latency query, where the overhead of the query optimizer itself can cause low throughput, and one which uses deep sampling of the query against the data to find the minimum cost query plan for long running, data intensive queries.  Constraints are propagated from join to join by flowing the intermediate solutions to each join in turn. Bigdata reads less data from the disk to answer a query because variable bindings discovered in earlier joins restrict the access paths for later joins. Due to non-locality, graph query typically turns into random IOs against the disk, so you always want to deploy the graph database over SSD or PCIe flash for fast high level query. Don’t scale-out a graph database unless you need it.  Single machine or replication cluster deployments handle large graphs (50B edges) and can deliver low-latency query. Scale-out deployments build in more coordination overhead and should be undertaken only after careful examination of your requirements.  Graphs do not enjoy the same simple scaling model as key-value stores and map/reduce.  You have to use the right technology for your problem.

What about YarcData? The Cray XMT and XMT2 architecture marketed and sold by YarcData as the “Urika” appliance is worth discussing in some depth (Cray and YarcData are trademarks of Cray, Inc.).  The basis of the XMT architecture is some slow “stream” processors with zero cost latency switching. The XMT architecture scales by keeping a lot of memory transactions in flight and then switching between those memory transactions when they arrive in the queue for a stream processor with zero latency. The concept for the XMT is not worry about where the data was stored, but just make sure that you can keep those slow stream processors busy by having enough parallel work and moving the data around over the fast interconnect.  For those who want to lay down the cash on a YarcData appliance (rumored to be well north of $1M), this is a good and scalable solution.  However, it is instructive to compare the YarcData solution with a GPU.  MPGraph running on a GPU at 3 billion edges per second has much more throughput than the YarcData appliance. In fact, the challenge with GPUs is that they are so fast that it is difficult to keep them busy – this is the opposite of the YarcData problem. There are two ways to make this work out in favor of the GPU, and we are pursuing both of them.  First, if we apply topology compression to the data, we can get nearly ten times the number of edges into a single GPU.  That means instead of 100M edges at the speed of light, we will have nearly 1B edges at the speed of light on a single card.  This is more than enough for most problems for under $7k in hardware (if you buy the expensive K20 cards rather than the gamer cards), or you can rent it on EC2 for ~$400/month.  Second, if we keep of the edges of the graph in DRAM, then we can put nearly 60 billion edges (with topology compression) into a GPU compute cluster on EC2. Rather than sending patches of the graph to the GPU, we can keep it all in DRAM and just send vertex state and frontier updates over the PCIe bus and the cluster interconnect.  To my mind, the ORNL Titan supercomputer (also built by Cray) with 18,000+ NVIDIA Kepler GPUs is the ultimate graph processing machine and uses commodity hardware and GPUs.

Conclusion: Always keep in mind the bottleneck, which is either disk (graph query) or RAM (graph traversal and graph analytics). While these graph problems may look similar from the outside, they have completely different computational workloads and scaling requirements and must be addressed using different kinds of technologies.  Avoid the “one big graph” anti-pattern and deploy a mixture of technologies that address the different workloads and computations that you need for your application.  You may want to put everything into a key-value store for fast gathered property set retrieval, but you can’t query or traverse it efficiently there.  Fast query requires a graph database with a high performance query optimizer and query engine.  Fast graph traversal and graph analytics require fast memory and efficient parallelism. Think about how to partition the workload and the data to provide fast and scalable solutions.  For example, could you put a social network topology entirely onto a GPU, run your analytics there, and then do gathered reads against a key-value store or graph database?  This could give you full depth graph analytics over your social network in a faction of a second and then you can materialize the data you need from that key-value store you already have.

 

SYSTAP has been awarded a new contract to extend the MPGraph platform from a single GPU to GPU compute clusters in partnership with Dr. Martin Berzins and the Scientific Computing and Imaging Institute of the University of Utah.  Our initial target is a 64 Kepler GPU cluster. This cluster can hold between 6B and 60B edges in fast DRAM, depending on whether or not we are applying topology compression. Our goal is to maintain the ultra high performance of the single GPU solution on large compute clusters.

EC2 makes GPUs trivial to access. To test out MPGraph, you can launch an EC2 g2.2xlarge instance type. The G2 instances provide access to NVIDIA GRID GPUs (“Kepler” GK104) each with 1,536 CUDA cores and 4GB of video memory. Last we checked, this instance type runs ~ $468/month.

Once your instance is running, you can do any of:

- Install SVN and checkout MPGraph: svn checkout svn://svn.code.sf.net/p/mpgraph/code/ mpgraph-code
- download an MPGraph release
- download an MPGraph snapshot

Change into the mpgraph directory and type:

make all

This will build the code. The README includes instructions for getting started, including a simple graph analytic that you can use to verify your install.

Enjoy!

I will be speaking about our open source graph database and GPU graph mining platforms. Highlights will include the recently released support for highly available replication clusters as well our recent work with accelerated graph processing on GPUs at 3 billion traversed edges per second.

6:30 PM, Tuesday, March 18, 2014
Xcelerate Solutions
8405 Greensboro Dr., Suite 930
McLean 22102, VA

It has been gratifying to see the level of interest in MPGraph judging on downloads alone since our v2 release. The v3 release should be out next month and will include 5x – 10x performance gains in algorithms that have large frontiers (Connected Components, Page Rank, etc). This is done using a different kernel that solves a thread scheduling problem and then turns the computation into an embarrassingly parallel problem. This technique has more overhead for small frontiers, but outperforms the existing kernels when the frontier becomes large.

Our short term goals beyond is to increase the data density on the GPU. This will stretch the resources of a single GPU, providing support for graphs with up to 1 billion edges. Increased data density will also work in our favor as we move into multi-GPU support.

We will be talking about MPGraph at CSHALS in Boston next week (2/26/2014) and at GPUTECH in San Jose on March 24th (10am, in the DARPA/XDATA track).

We will also be talking about the new HA replication cluster at CSHALS.

We are following a mandatory migration to the SourceForge Allura platform. As part of this migration, the links for the issue tracker, media wiki, and SVN will all change. SourceForge will continue to host the bigdata (and MPGraph) projects, but starting today we will be hosting the issue tracker and media wiki ourselves. This decision was made to preserve the existing features and data in the trac and media wiki instances.

The new URLs are:

We have received several questions about the roadmap for MPGraph and bigdata. We are working on a simple integration pattern now. This integration will be based on the export of a topology projection onto the GPU and will support ultra fast graph mining as a SPARQL SERVICE against the exported topology. We are also developing a similar SERVICE for graphing mining on typed graphs over the database indices. Eventually these will converge.

The road map for MPGraph includes topology compression to host large graphs on the GPU (up to 2B edges on a single K40 GPU), column-wise compression to capture schema flexible vertex and link property sets, and a multi-GPU implementation that will target GPU clusters (including EC2).

Schema flexible property and link sets and will allow us to capture the RDF data model on the GPU. At that point we can also implement the world’s fastest SPARQL end point at 3 billion edges per second. The GPU is the world’s fastest single node graph processing platform. This is much faster than the Cray XMT or Urika appliance.

In the scale out architecture, we plan to support GPU accelerated SPARQL and GPU accelerated graph mining using a dedicated 2D decomposition of the graph (vertex cuts). The 2D index decomposition will exist in along side the existing (1D) scale-out indices for SPARQL. This is because you need different index partitioning strategies for high performance on selective SPARQL queries (1D partitioning) and graph mining operations (2D partitioning). The additional index will provide a 2D partitioning of the edges of the graph.

So, look for an initial integration between MPGraph and bigdata in Q1. We will continue to do MPGraph releases with new capabilities on a regular basis. The next MPGraph release will include a very substantial speedup for algorithms with large frontiers, such as page rank and connected components.