书签分享收藏举报版权申诉 / 11

当前位置：首页 > 教育专区 > 教案示例 > Pregel-A System for Large-Scale Graph Processing.pdf

Pregel-A System for Large-Scale Graph Processing.pdf

上传人：安***

文档编号：19246481

上传时间：2022-06-05

格式：PDF

页数：11

大小：473.87KB

( 4.5 )

《Pregel-A System for Large-Scale Graph Processing.pdf》由会员分享，可在线阅读，更多相关《Pregel-A System for Large-Scale Graph Processing.pdf（11页珍藏版）》请在得力文库 - 分享文档赚钱的网站上搜索。

1、Pregel: A System for Large-Scale Graph ProcessingGrzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn,Naty Leiser, and Grzegorz CzajkowskiGoogle, Inc.malewicz,austern,ajcbik,dehnert,ilan,naty,ABSTRACTMany practical computing problems concern large graphs.Standard exampl

2、es include the Web graph and various so-cial networks. The scale of these graphsin some cases bil-lions of vertices, trillions of edgesposes challenges to theirefficient processing.In this paper we present a computa-tional model suitable for this task. Programs are expressedas a sequence of iteratio

3、ns, in each of which a vertex canreceive messages sent in the previous iteration, send mes-sages to other vertices, and modify its own state and that ofits outgoing edges or mutate graph topology. This vertex-centric approach is flexible enough to express a broad set ofalgorithms. The model has been

4、 designed for efficient, scal-able and fault-tolerant implementation on clusters of thou-sands of commodity computers, and its implied synchronic-ity makes reasoning about programs easier.Distribution-related details are hidden behind an abstract API. The resultis a framework for processing large gr

5、aphs that is expressiveand easy to program.Categories and Subject DescriptorsD.1.3 Programming Techniques: Concurrent Program-mingDistributed programming; D.2.13 Software Engi-neering: Reusable SoftwareReusable librariesGeneral TermsDesign, AlgorithmsKeywordsDistributed computing, graph algorithms1.

6、INTRODUCTIONThe Internet made the Web graph a popular object ofanalysis and research. Web 2.0 fueled interest in social net-works. Other large graphsfor example induced by trans-portation routes, similarity of newspaper articles, paths ofPermission to make digital or hard copies of all or part of th

7、is work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requ

8、ires prior specificpermission and/or a fee.SIGMOD10, June 611, 2010, Indianapolis, Indiana, USA.Copyright 2010 ACM 978-1-4503-0032-2/10/06 .$10.00.disease outbreaks, or citation relationships among publishedscientific workhave been processed for decades. Frequentlyapplied algorithms include shortest

9、 paths computations, dif-ferent flavors of clustering, and variations on the page ranktheme. There are many other graph computing problemsof practical value, e.g., minimum cut and connected compo-nents.Efficient processing of large graphs is challenging. Graphalgorithms often exhibit poor locality o

10、f memory access, verylittle work per vertex, and a changing degree of parallelismover the course of execution 31, 39. Distribution over manymachines exacerbates the locality issue, and increases theprobability that a machine will fail during computation. De-spite the ubiquity of large graphs and the

11、ir commercial im-portance, we know of no scalable general-purpose systemfor implementing arbitrary graph algorithms over arbitrarygraph representations in a large-scale distributed environ-ment.Implementing an algorithm to process a large graph typ-ically means choosing among the following options:1

12、. Crafting a custom distributed infrastructure, typicallyrequiring a substantial implementation effort that mustbe repeated for each new algorithm or graph represen-tation.2. Relying on an existing distributed computing platform,often ill-suited for graph processing. MapReduce 14,for example, is a v

13、ery good fit for a wide array of large-scale computing problems.It is sometimes used tomine large graphs 11, 30, but this can lead to sub-optimal performance and usability issues. The basicmodels for processing data have been extended to fa-cilitate aggregation 41 and SQL-like queries 40, 47,but the

14、se extensions are usually not ideal for graph al-gorithms that often better fit a message passing model.3. Using a single-computer graph algorithm library, suchas BGL 43, LEDA 35, NetworkX 25, JDSL 20,Stanford GraphBase 29, or FGL 16, limiting thescale of problems that can be addressed.4. Using an e

15、xisting parallel graph system. The ParallelBGL 22 and CGMgraph 8 libraries address parallelgraph algorithms, but do not address fault toleranceor other issues that are important for very large scaledistributed systems.None of these alternatives fit our purposes. To address dis-tributed processing of

16、 large scale graphs, we built a scalable135and fault-tolerant platform with an API that is sufficientlyflexible to express arbitrary graph algorithms. This paperdescribes the resulting system, called Pregel1, and reportsour experience with it.The high-level organization of Pregel programs is inspire

17、dby Valiants Bulk Synchronous Parallel model 45. Pregelcomputations consist of a sequence of iterations, called su-persteps. During a superstep the framework invokes a user-defined function for each vertex, conceptually in parallel.The function specifies behavior at a single vertex V and asingle sup

18、erstep S. It can read messages sent to V in su-perstep S 1, send messages to other vertices that will bereceived at superstep S + 1, and modify the state of V andits outgoing edges. Messages are typically sent along outgo-ing edges, but a message may be sent to any vertex whoseidentifier is known.Th

19、e vertex-centric approach is reminiscent of MapReducein that users focus on a local action, processing each itemindependently, and the system composes these actions to liftcomputation to a large dataset. By design the model is wellsuited for distributed implementations: it doesnt exposeany mechanism

20、 for detecting order of execution within asuperstep, and all communication is from superstep S tosuperstep S + 1.The synchronicity of this model makes it easier to reasonabout program semantics when implementing algorithms,and ensures that Pregel programs are inherently free of dead-locks and data r

21、aces common in asynchronous systems. Inprinciple the performance of Pregel programs should be com-petitive with that of asynchronous systems given enoughparallel slack 28, 34. Because typical graph computationshave many more vertices than machines, one should be ableto balance the machine loads so t

22、hat the synchronizationbetween supersteps does not add excessive latency.The rest of the paper is structured as follows. Section 2describes the model. Section 3 describes its expression asa C+ API. Section 4 discusses implementation issues, in-cluding performance and fault tolerance. In Section 5 we

23、present several applications of this model to graph algorithmproblems, and in Section 6 we present performance results.Finally, we discuss related work and future directions.2.MODEL OF COMPUTATIONThe input to a Pregel computation is a directed graph inwhich each vertex is uniquely identified by a st

24、ring vertexidentifier. Each vertex is associated with a modifiable, userdefined value. The directed edges are associated with theirsource vertices, and each edge consists of a modifiable, userdefined value and a target vertex identifier.A typical Pregel computation consists of input, when thegraph i

25、s initialized, followed by a sequence of supersteps sep-arated by global synchronization points until the algorithmterminates, and finishing with output.Within each superstep the vertices compute in parallel,each executing the same user-defined function that expressesthe logic of a given algorithm.

26、A vertex can modify its stateor that of its outgoing edges, receive messages sent to itin the previous superstep, send messages to other vertices(to be received in the next superstep), or even mutate the1The name honors Leonhard Euler. The Bridges of K onigs-berg, which inspired his famous theorem,

27、spanned the Pregelriver.ActiveInactiveVote to haltMessage receivedFigure 1: Vertex State Machinetopology of the graph. Edges are not first-class citizens inthis model, having no associated computation.Algorithm termination is based on every vertex voting tohalt. In superstep 0, every vertex is in th

28、e active state; allactive vertices participate in the computation of any givensuperstep. A vertex deactivates itself by voting to halt. Thismeans that the vertex has no further work to do unless trig-gered externally, and the Pregel framework will not executethat vertex in subsequent supersteps unle

29、ss it receives a mes-sage. If reactivated by a message, a vertex must explicitlydeactivate itself again. The algorithm as a whole terminateswhen all vertices are simultaneously inactive and there areno messages in transit. This simple state machine is illus-trated in Figure 1.The output of a Pregel

30、program is the set of values ex-plicitly output by the vertices. It is often a directed graphisomorphic to the input, but this is not a necessary prop-erty of the system because vertices and edges can be addedand removed during computation. A clustering algorithm,for example, might generate a small

31、set of disconnected ver-tices selected from a large graph. A graph mining algorithmmight simply output aggregated statistics mined from thegraph.Figure 2 illustrates these concepts using a simple example:given a strongly connected graph where each vertex containsa value, it propagates the largest va

32、lue to every vertex. Ineach superstep, any vertex that has learned a larger valuefrom its messages sends it to all its neighbors.When nofurther vertices change in a superstep, the algorithm termi-nates.We chose a pure message passing model, omitting remotereads and other ways of emulating shared mem

33、ory, for tworeasons. First, message passing is sufficiently expressive thatthere is no need for remote reads. We have not found anygraph algorithms for which message passing is insufficient.Second, this choice is better for performance. In a clusterenvironment, reading a value from a remote machine

34、in-curs high latency that cant easily be hidden. Our messagepassing model allows us to amortize latency by deliveringmessages asynchronously in batches.Graph algorithms can be written as a series of chainedMapReduce invocations 11, 30. We chose a different modelfor reasons of usability and performan

35、ce. Pregel keeps ver-tices and edges on the machine that performs computation,and uses network transfers only for messages. MapReduce,however, is essentially functional, so expressing a graph algo-rithm as a chained MapReduce requires passing the entirestate of the graph from one stage to the nextin

36、 generalrequiring much more communication and associated serial-ization overhead. In addition, the need to coordinate thesteps of a chained MapReduce adds programming complex-ity that is avoided by Pregels iteration over supersteps.1363621Superstep 06626Superstep 16666Superstep 26666Superstep 3Figur

37、e 2: Maximum Value Example.Dotted linesare messages. Shaded vertices have voted to halt.3.THE C+ APIThis section discusses the most important aspects of Pre-gels C+ API, omitting relatively mechanical issues.Writing a Pregel program involves subclassing the prede-fined Vertex class (see Figure 3). I

38、ts template argumentsdefine three value types, associated with vertices, edges,and messages. Each vertex has an associated value of thespecified type. This uniformity may seem restrictive, butusers can manage it by using flexible types like protocolbuffers 42. The edge and message types behave simil

39、arly.The user overrides the virtual Compute() method, whichwill be executed at each active vertex in every superstep.Predefined Vertex methods allow Compute() to query infor-mation about the current vertex and its edges, and to sendmessages to other vertices. Compute() can inspect the valueassociate

40、d with its vertex via GetValue() or modify it viaMutableValue(). It can inspect and modify the values ofout-edges using methods supplied by the out-edge iterator.These state updates are visible immediately. Since their vis-ibility is confined to the modified vertex, there are no dataraces on concurr

41、ent value access from different vertices.The values associated with the vertex and its edges are theonly per-vertex state that persists across supersteps. Lim-iting the graph state managed by the framework to a singlevalue per vertex or edge simplifies the main computationcycle, graph distribution,

42、and failure recovery.3.1Message PassingVertices communicate directly with one another by send-ing messages, each of which consists of a message value andthe name of the destination vertex. The type of the messagevalue is specified by the user as a template parameter of theVertex class.A vertex can s

43、end any number of messages in a superstep.All messages sent to vertex V in superstep S are available,via an iterator, when V s Compute() method is called insuperstep S + 1. There is no guaranteed order of messagesin the iterator, but it is guaranteed that messages will bedelivered and that they will

44、 not be duplicated.A common usage pattern is for a vertex V to iterate overits outgoing edges, sending a message to the destination ver-tex of each edge, as shown in the PageRank algorithm inFigure 4 (Section 5.1 below). However, dest_vertex needtemplate class Vertex public:virtual void Compute(Mess

45、ageIterator* msgs) = 0;const string& vertex_id() const;int64 superstep() const;const VertexValue& GetValue();VertexValue* MutableValue();OutEdgeIterator GetOutEdgeIterator();void SendMessageTo(const string& dest_vertex,const MessageValue& message);void VoteToHalt();Figure 3: The Vertex API foundatio

46、ns.not be a neighbor of V . A vertex could learn the identifierof a non-neighbor from a message received earlier, or ver-tex identifiers could be known implicitly. For example, thegraph could be a clique, with well-known vertex identifiersV1through Vn, in which case there may be no need to evenkeep

47、explicit edges in the graph.When the destination vertex of any message does not ex-ist, we execute user-defined handlers. A handler could, forexample, create the missing vertex or remove the danglingedge from its source vertex.3.2CombinersSending a message, especially to a vertex on another ma-chine

48、, incurs some overhead. This can be reduced in somecases with help from the user. For example, suppose thatCompute() receives integer messages and that only the summatters, as opposed to the individual values. In that case thesystem can combine several messages intended for a vertexV into a single m

49、essage containing their sum, reducing thenumber of messages that must be transmitted and buffered.Combiners are not enabled by default, because there isno mechanical way to find a useful combining function thatis consistent with the semantics of the users Compute()method.To enable this optimization

50、the user subclassesthe Combiner class, overriding a virtual Combine() method.There are no guarantees about which (if any) messages arecombined, the groupings presented to the combiner, or theorder of combining, so combiners should only be enabled forcommutative and associative operations.For some al

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

此文档不允许下载，请继续在线阅读

版权申诉 word格式文档无特别注明外均可编辑修改；预览文档经过压缩，下载后原文更清晰！

配套讲稿：: 如PPT文件的首页显示word图标，表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
特殊限制：: 部分文档作品中含有的国旗、国徽等图片，仅作为作品整体效果示例展示，禁止商用。设计者仅对作品中独创性部分享有著作权。
关键词：: Pregel System for Large-Scale Graph Processing Large Scale

得力文库 - 分享文档赚钱的网站所有资源均是用户自行上传分享，仅供网友学习交流，未经上传用户书面授权，请勿作他用。

限制150内

关于本文

本文标题：Pregel-A System for Large-Scale Graph Processing.pdf
链接地址：https://www.deliwenku.com/p-19246481.html