Pregel-A System for Large-Scale Graph Processing.pdf
《Pregel-A System for Large-Scale Graph Processing.pdf》由会员分享,可在线阅读,更多相关《Pregel-A System for Large-Scale Graph Processing.pdf(11页珍藏版)》请在得力文库 - 分享文档赚钱的网站上搜索。
1、Pregel: A System for Large-Scale Graph ProcessingGrzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn,Naty Leiser, and Grzegorz CzajkowskiGoogle, Inc.malewicz,austern,ajcbik,dehnert,ilan,naty,ABSTRACTMany practical computing problems concern large graphs.Standard exampl
2、es include the Web graph and various so-cial networks. The scale of these graphsin some cases bil-lions of vertices, trillions of edgesposes challenges to theirefficient processing.In this paper we present a computa-tional model suitable for this task. Programs are expressedas a sequence of iteratio
3、ns, in each of which a vertex canreceive messages sent in the previous iteration, send mes-sages to other vertices, and modify its own state and that ofits outgoing edges or mutate graph topology. This vertex-centric approach is flexible enough to express a broad set ofalgorithms. The model has been
4、 designed for efficient, scal-able and fault-tolerant implementation on clusters of thou-sands of commodity computers, and its implied synchronic-ity makes reasoning about programs easier.Distribution-related details are hidden behind an abstract API. The resultis a framework for processing large gr
5、aphs that is expressiveand easy to program.Categories and Subject DescriptorsD.1.3 Programming Techniques: Concurrent Program-mingDistributed programming; D.2.13 Software Engi-neering: Reusable SoftwareReusable librariesGeneral TermsDesign, AlgorithmsKeywordsDistributed computing, graph algorithms1.
6、INTRODUCTIONThe Internet made the Web graph a popular object ofanalysis and research. Web 2.0 fueled interest in social net-works. Other large graphsfor example induced by trans-portation routes, similarity of newspaper articles, paths ofPermission to make digital or hard copies of all or part of th
7、is work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requ
8、ires prior specificpermission and/or a fee.SIGMOD10, June 611, 2010, Indianapolis, Indiana, USA.Copyright 2010 ACM 978-1-4503-0032-2/10/06 .$10.00.disease outbreaks, or citation relationships among publishedscientific workhave been processed for decades. Frequentlyapplied algorithms include shortest
9、 paths computations, dif-ferent flavors of clustering, and variations on the page ranktheme. There are many other graph computing problemsof practical value, e.g., minimum cut and connected compo-nents.Efficient processing of large graphs is challenging. Graphalgorithms often exhibit poor locality o
10、f memory access, verylittle work per vertex, and a changing degree of parallelismover the course of execution 31, 39. Distribution over manymachines exacerbates the locality issue, and increases theprobability that a machine will fail during computation. De-spite the ubiquity of large graphs and the
11、ir commercial im-portance, we know of no scalable general-purpose systemfor implementing arbitrary graph algorithms over arbitrarygraph representations in a large-scale distributed environ-ment.Implementing an algorithm to process a large graph typ-ically means choosing among the following options:1
12、. Crafting a custom distributed infrastructure, typicallyrequiring a substantial implementation effort that mustbe repeated for each new algorithm or graph represen-tation.2. Relying on an existing distributed computing platform,often ill-suited for graph processing. MapReduce 14,for example, is a v
13、ery good fit for a wide array of large-scale computing problems.It is sometimes used tomine large graphs 11, 30, but this can lead to sub-optimal performance and usability issues. The basicmodels for processing data have been extended to fa-cilitate aggregation 41 and SQL-like queries 40, 47,but the
14、se extensions are usually not ideal for graph al-gorithms that often better fit a message passing model.3. Using a single-computer graph algorithm library, suchas BGL 43, LEDA 35, NetworkX 25, JDSL 20,Stanford GraphBase 29, or FGL 16, limiting thescale of problems that can be addressed.4. Using an e
15、xisting parallel graph system. The ParallelBGL 22 and CGMgraph 8 libraries address parallelgraph algorithms, but do not address fault toleranceor other issues that are important for very large scaledistributed systems.None of these alternatives fit our purposes. To address dis-tributed processing of
16、 large scale graphs, we built a scalable135and fault-tolerant platform with an API that is sufficientlyflexible to express arbitrary graph algorithms. This paperdescribes the resulting system, called Pregel1, and reportsour experience with it.The high-level organization of Pregel programs is inspire
17、dby Valiants Bulk Synchronous Parallel model 45. Pregelcomputations consist of a sequence of iterations, called su-persteps. During a superstep the framework invokes a user-defined function for each vertex, conceptually in parallel.The function specifies behavior at a single vertex V and asingle sup
18、erstep S. It can read messages sent to V in su-perstep S 1, send messages to other vertices that will bereceived at superstep S + 1, and modify the state of V andits outgoing edges. Messages are typically sent along outgo-ing edges, but a message may be sent to any vertex whoseidentifier is known.Th
19、e vertex-centric approach is reminiscent of MapReducein that users focus on a local action, processing each itemindependently, and the system composes these actions to liftcomputation to a large dataset. By design the model is wellsuited for distributed implementations: it doesnt exposeany mechanism
20、 for detecting order of execution within asuperstep, and all communication is from superstep S tosuperstep S + 1.The synchronicity of this model makes it easier to reasonabout program semantics when implementing algorithms,and ensures that Pregel programs are inherently free of dead-locks and data r
21、aces common in asynchronous systems. Inprinciple the performance of Pregel programs should be com-petitive with that of asynchronous systems given enoughparallel slack 28, 34. Because typical graph computationshave many more vertices than machines, one should be ableto balance the machine loads so t
22、hat the synchronizationbetween supersteps does not add excessive latency.The rest of the paper is structured as follows. Section 2describes the model. Section 3 describes its expression asa C+ API. Section 4 discusses implementation issues, in-cluding performance and fault tolerance. In Section 5 we
23、present several applications of this model to graph algorithmproblems, and in Section 6 we present performance results.Finally, we discuss related work and future directions.2.MODEL OF COMPUTATIONThe input to a Pregel computation is a directed graph inwhich each vertex is uniquely identified by a st
24、ring vertexidentifier. Each vertex is associated with a modifiable, userdefined value. The directed edges are associated with theirsource vertices, and each edge consists of a modifiable, userdefined value and a target vertex identifier.A typical Pregel computation consists of input, when thegraph i
25、s initialized, followed by a sequence of supersteps sep-arated by global synchronization points until the algorithmterminates, and finishing with output.Within each superstep the vertices compute in parallel,each executing the same user-defined function that expressesthe logic of a given algorithm.
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- Pregel System for Large-Scale Graph Processing Large Scale
链接地址:https://www.deliwenku.com/p-19246481.html
限制150内