书签分享收藏举报版权申诉 / 53

立即下载

当前位置：首页 > 技术资料 > 其他杂项 > TI 多内核编程指南.pdf

TI 多内核编程指南.pdf

上传人：qwe****56

文档编号：70021080

上传时间：2023-01-14

格式：PDF

页数：53

大小：1.73MB

( 4.5 )

《TI 多内核编程指南.pdf》由会员分享，可在线阅读，更多相关《TI 多内核编程指南.pdf（53页珍藏版）》请在得力文库 - 分享文档赚钱的网站上搜索。

1、SPRAB27BAugust 2012Multicore Programming GuidePage 1 of 52Submit Documentation Feedback SPRAB27BAugust 2012Please be aware that an important notice concerning availability,standard warranty,and use in critical applicationsof Texas Instruments semiconductor products and disclaimers thereto appears at

2、 the end of this document.Application ReportMulticore Programming GuideMulticore Programming and Applications/DSP SystemsAbstractAs application complexity continues to grow,we have reached a limit on increasing performance by merely scaling clock speed.To meet the ever-increasing processing demand,m

3、odern System-On-Chip solutions contain multiple processing cores.The dilemma is how to map applications to multicore devices.In this paper,we present a programming methodology for converting applications to run on multicore devices.We also describe the features of Texas Instruments DSPs that enable

4、efficient implementation,execution,synchronization,and analysis of multicore applications.Contents1 Introduction.32 Mapping an Application to a Multicore Processor.32.1 Parallel Processing Models.32.2 Identifying a Parallel Task Implementation.93 Inter-Processor Communication.143.1 Data Movement.143

5、.2 Multicore Navigator Data Movement.173.3 Notification and Synchronization.173.4 Multicore Navigator Notification Methods.224 Data Transfer Engines .234.1 Packet DMA.234.2 EDMA.244.3 Ethernet.244.4 RapidIO.244.5 Antenna Interface.254.6 PCI Express.254.7 HyperLink.255 Shared Resource Management.265.

6、1 Global Flags.265.2 OS Semaphores.265.3 Hardware Semaphores.265.4 Direct Signaling.26Page 2 of 52Multicore Programming GuideSPRAB27BAugust 2012Submit Documentation Feedback 6 Memory Management.276.1 CPU View of the Device.286.2 Cache and Prefetch Considerations.296.3 Shared Code Program Memory Plac

7、ement.306.4 Peripheral Drivers.326.5 Data Memory Placement and Access.337 DSP Code and Data Images.347.1 Single Image.347.2 Multiple Images.347.3 Multiple Images with Shared Code and Data.347.4 Device Boot.357.5 Multicore Application Deployment(MAD)Utilities.368 System Debug.388.1 Debug and Tooling

8、Categories.388.2 Trace Logs .398.3 System Trace.509 Summary.5110 References.52SPRAB27BAugust 2012Multicore Programming GuidePage 3 of 52Submit Documentation Feedback 1 IntroductionFor the past 50 years,Moores law accurately predicted that the number of transistors on an integrated circuit would doub

9、le every two years.To translate these transistors into equivalent levels of system performance,chip designers increased clock frequencies(requiring deeper instruction pipelines),increased instruction level parallelism(requiring concurrent threads and branch prediction),increased memory performance(r

10、equiring larger caches),and increased power consumption(requiring active power management).Each of these four areas is hitting a wall that impedes further growth:Increased processing frequency is slowing due to diminishing improvements in clock rates and poor wire scaling as semiconductor devices sh

11、rink.Instruction-level parallelism is limited by the inherent lack of parallelism in the applications.Memory performance is limited by the increasing gap between processor and memory speeds.Power consumption scales with clock frequency;so,at some point,extraordinary means are needed to cool the devi

12、ce.Using multiple processor cores on a single chip allows designers to meet performance goals without using the maximum operating frequency.They can select a frequency in the sweet spot of a process technology that results in lower power consumption.Overall performance is achieved with cores having

13、simplified pipeline architectures relative to an equivalent single core solution.Multiple instances of the core in the device result in dramatic increases in the MIPS-per-watt performance.2 Mapping an Application to a Multicore ProcessorUntil recently,advances in computing hardware provided signific

14、ant increases in the execution speed of software with little effort from software developers.The introduction of multicore processors provides a new challenge for software developers,who must now master the programming techniques necessary to fully exploit multicore processing potential.Task paralle

15、lism is the concurrent execution of independent tasks in software.On a single-core processor,separate tasks must share the same processor.On a multicore processor,tasks essentially run independently of one another,resulting in more efficient execution.2.1 Parallel Processing ModelsOne of the first s

16、teps in mapping an application to a multicore processor is to identify the task parallelism and select a processing model that fits best.The two dominant models are a Master/Slave model in which one core controls the work assignments on all cores,and the Data Flow model in which work flows through p

17、rocessing stages as in a pipeline.Page 4 of 52Multicore Programming GuideSPRAB27BAugust 2012Submit Documentation Feedback 2.1.1 Master/Slave ModelThe Master/Slave model represents centralized control with distributed execution.A master core is responsible for scheduling various threads of execution

18、that can be allocated to any available core for processing.It also must deliver any data required by the thread to the slave core.Applications that fit this model inherently consist of many small independent threads that fit easily within the processing resources of a single core.This software often

19、 contains a significant amount of control code and often accesses memory in random order with multiple levels of indirection.There is relatively little computation per memory access and the code base is usually very large.Applications that fit the Master/Slave model often run on a high-level OS like

20、 Linux and potentially already have multiple threads of execution defined.In this scenario,the high-level OS is the master in charge of the scheduling.The challenge for applications using this model is real-time load balancing because the thread activation can be random.Individual threads of executi

21、on can have very different throughput requirements.The master must maintain a list of cores with free resources and be able to optimize the balance of work across the cores so that optimal parallelism is achieved.An example of a Master/Slave task allocation model is shown in Figure 1.Figure1Master/S

22、lave Processing ModelOne application that lends itself to the Master/Slave model is the multi-user data link layer of a communication protocol stack.It is responsible for media access control and logical link control of a physical layer including complex,dynamic scheduling and data movement through

23、transport channels.The software often accesses multi-dimensional arrays resulting in very disjointed memory access.Task MasterTask ATask BTasks C,D,ETasks F,GSPRAB27BAugust 2012Multicore Programming GuidePage 5 of 52Submit Documentation Feedback One or more execution threads are mapped to each core.

24、Task assignment is achieved using message-passing between cores.The messages provide the control triggers to begin execution and pointers to the required data.Each core has at least one task whose job is to receive messages containing job assignments.The task is suspended until a message arrives tri

25、ggering the thread of execution.2.1.2 Data Flow ModelThe Data Flow model represents distributed control and execution.Each core processes a block of data using various algorithms and then the data is passed to another core for further processing.The initial core is often connected to an input interf

26、ace supplying the initial data for processing from either a sensor or FPGA.Scheduling is triggered upon data availability.Applications that fit the Data Flow model often contain large and computationally complex components that are dependent on each other and may not fit on a single core.They likely

27、 run on a realtime OS where minimizing latency is critical.Data access patterns are very regular because each element of the data arrays is processed uniformly.The challenge for applications using this model is partitioning the complex components across cores and the high data flow rate through the

28、system.Components often need to be split and mapped to multiple cores to keep the processing pipeline flowing regularly.The high data rate requires good memory bandwidth between cores.The data movement between cores is regular and low latency hand-offs are critical.An example of Data Flow processing

29、 is shown in Figure 2.Figure2Data Flow Processing ModelOne application that lends itself to the Data Flow model is the physical layer of a communication protocol stack.It translates communications requests from the data link layer into hardware-specific operations to affect transmission or reception

30、 of electronic signals.The software implements complex signal processing using intrinsic instructions that take advantage of the instruction-level parallelism in the hardware.The processing chain requires one or more tasks to be mapped to each core.Synchronization of execution is achieved using mess

31、age passing between cores.Data is passed between cores using shared memory or DMA transfers.Task ATask GTasks B,CTasks B,CTasks D,E,FPage 6 of 52Multicore Programming GuideSPRAB27BAugust 2012Submit Documentation Feedback 2.1.3 OpenMP ModelOpenMP is an Application Programming Interface(API)for develo

32、ping multi-threaded applications in C/C+or Fortran for shared-memory parallel(SMP)architectures.OpenMP standardizes the last 20 years of SMP practice and is a programmer-friendly approach with many advantages.The API is easy to use and quick to implement;once the programmer identifies parallel regio

33、ns and inserts the relevant OpenMP constructs,the compiler and runtime system figures out the rest of the details.The API makes it easy to scale across cores and allows moving from an m core implementation to an n core implementation with minimal modifications to source code.OpenMP is sequential-cod

34、er friendly;that is,when a programmer has a sequential piece of code and would like to parallelize it,it is not necessary to create a totally separate multicore version of the program.Instead of this all-or-nothing approach,OpenMP encourages an incremental approach to parallelization,where programme

35、rs can focus on parallelizing small blocks of code at a time.The API also allows users to maintain a single unified code base for both sequential and parallel versions of code.2.1.3.1 FeaturesThe OpenMP API consists primarily of compiler directives,library routines,and environment variables that can

36、 be leveraged to parallelize a program.Compiler directives allow programmers to specify which instructions they want to execute in parallel and how they would like the work distributed across cores.OpenMP directives typically have the syntax“#pragma omp construct clause clause.”For example,“#pragma

37、omp section nowait”where section is the construct and nowait is a clause.The next section shows example implementations that contain directives.Library routines or runtime library calls allow programmers to perform a host of different functions.There are execution environment routines that can confi

38、gure and monitor threads,processors,and other aspects of the parallel environment.There are lock routines that provide function calls for synchronization.There are timing routines that provide a portable wall clock timer.For example,the library routine“omp_set_num_threads(int numthreads)”tells the c

39、ompiler how many threads need to be created for an upcoming parallel region.Finally,environment variables enable programmers to query the state or alter the execution features of an application like the default number of threads,loop iteration count,etc.For example,“OMP_NUM_THREADS”is the environmen

40、t variable that holds the total number of OpenMP threads.SPRAB27BAugust 2012Multicore Programming GuidePage 7 of 52Submit Documentation Feedback 2.1.3.2 ImplementationThis section contains four typical implementation scenarios and shows how OpenMP allows programmers to handle each of them.The follow

41、ing examples introduce some important OpenMP compiler directives that are applicable to these implementation scenarios.For a complete list of directives,see the OpenMP specification available on the official OpenMP website at http:/www.openmp.org.Create Teams of ThreadsFigure 3 shows how OpenMP impl

42、ementations are based on a fork-join model.An OpenMP program begins with an initial thread(known as a master thread)in a sequential region.When a parallel region is encounteredindicated by the compiler directive“#pragma omp parallel”extra threads called worker threads are automatically created by th

43、e scheduler.This team of threads executes simultaneously to work on the block of parallel code.When the parallel region ends,the program waits for all threads to terminate,then resumes its single-threaded execution for the next sequential region.Figure3OpenMP Fork-Join ModelTo illustrate this point

44、further,it is useful to look at an implementation example.Figure 4 on page 8 shows a sample OpenMP Hello World program.The first line in the code includes the omp.h header file that includes the OpenMP API definitions.Next,the call to the library routine sets the number of threads for the OpenMP par

45、allel region to follow.When the parallel compiler directive is encountered,the scheduler spawns three additional threads.Each of the threads runs the code within the parallel region and prints the Hello World line with its unique thread id.The implicit barrier at the end of the region ensures that a

46、ll threads terminate before the program continues.Page 8 of 52Multicore Programming GuideSPRAB27BAugust 2012Submit Documentation Feedback Figure4Hello World Example Using OpenMP Parallel Compiler DirectiveShare Work Among ThreadsAfter the programmer has identified which blocks of code in the region

47、are to be run by multiple threads,the next step is to express how the work in the parallel region will be shared among the threads.The OpenMP work-sharing constructs are designed to do exactly this.There are a variety of work-sharing constructs available;the following two examples focus on two commo

48、nly-used constructs.The“#pragma omp for”work-sharing construct enables programmers to distribute a for loop among multiple threads.This construct applies to for loops where subsequent iterations are independent of each other;that is,changing the order in which iterations are called does not change t

49、he result.To appreciate the power of the for work-sharing construct,look at the following three situations of implementation:sequential;only with the parallel construct;and both the parallel and work-sharing constructs.Assume a for loop with N iterations,that does a basic array computation.SPRAB27BA

50、ugust 2012Multicore Programming GuidePage 9 of 52Submit Documentation Feedback The second work-sharing construct example is“#pragma omp sections”which allows the programmer to distribute multiple tasks across cores,where each core runs a unique piece of code.The following code snippet illustrates th

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

15 金币

版权申诉 word格式文档无特别注明外均可编辑修改；预览文档经过压缩，下载后原文更清晰！ 立即下载

配套讲稿：: 如PPT文件的首页显示word图标，表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
特殊限制：: 部分文档作品中含有的国旗、国徽等图片，仅作为作品整体效果示例展示，禁止商用。设计者仅对作品中独创性部分享有著作权。
关键词：: TI 多内核编程指南内核编程指南

得力文库 - 分享文档赚钱的网站所有资源均是用户自行上传分享，仅供网友学习交流，未经上传用户书面授权，请勿作他用。

限制150内

关于本文

本文标题：TI 多内核编程指南.pdf
链接地址：https://www.deliwenku.com/p-70021080.html