辅导案例-EECS 419

EECS 419: TAKE HOME TEST 3 3 December 2019 Due: December 11, 2019 Please do all required problems 1-9. You must also do one of problems 10 or 11. Deliver your answers in Hardcopy form. (source code, results and a README note delivered by email in a zip file.) Work your responses to the test problems individually. Problem 1. – The object of this problem is to explore numerical calculations for the continuum computation model on mesh or grid type multiple processors. You should model your problem in C/C++. Assume that you have a square array of points, 16×16, and that the value of the electric potential function on the boundary is zero on the top row, 20 on the bottom, and 10 along all other boundary points. 1.1 Initialize the potential function to 0 on all interior points. Calculate the Poisson solution for the values of all interior points by replacing each interior point with the average value of each of its neighboring points. Compute the new values for all interior points before updating any interior points. Run this simulation for five iterations and show the answers you obtain at the end. Note: The values on the boundary are fixed and do not change during the computation. 1.2 Repeat the process in the previous problem part 1.1, except update a point as soon as have computed the new value and use the new value when you reach a neighboring point. You should scan the interior points row by row from top to bottom and from left to right within rows. 1.3 The second process seems to converge faster. Give an intuitive explanation of why this might be the case. 1.4 How do your findings relate to the interconnection structure of a processor for solving this problem? Problem 2. – The purpose of this problem is to show the effect of information propagation within a calculation. Use the Poisson problem (discussed in Problem 1 previously) and write a computer program in C/C++ that iterates until no point value changes by more than 0.1 percent. Let this be the initial state of the problem for the following problem parts below, 2.1, 2.2 and 2.3. 2.1 Increase the boundary point at the upper left corner to a new value of 20. Perform five iterations of the Poisson solver and observe the values obtained. 2.2 Now restore the mesh to the initial state for problem 2.1. Change the problem so that, in effect, the upper left corner is rotated to the bottom right corner. To do this, scan the rows from right to left instead of left to right and scan from bottom to top instead of top to bottom. Perform five iterations of the Poisson solver and observe the values obtained. 2.3 Both problem parts 2.1 and 2.2 eventually converge to the same solution because the initial data are the same and the physical process modeled is the same. However, the results obtained from problem parts 2.1 and 2.2 are different after five iterations of the Poisson solver. Explain why they are different. Which of the two problems has faster convergence? Why? Problem 3. – Domain decomposition is a well known technique to solve parallel numerical problems for large computational grids, i.e. over 106 grid points. The idea is illustrated in Figure 3.1. The original grid is partitioned into small domains of subgrids and then the numerical problem is solved in parallel in each domain using a uniprocessor architecture. Care should be taken though to communicate results of processors located in neighboring domains as the computation goes on. As the number of processors and hence domains is usually rather small, say an order of less than 100, this technique provides modest speedup because it suffers from communication bottlenecks at the processor domain interfaces. However, the cost of resources is reasonable provided the number of processors is kept relatively small. P1 P2 P4P3 Figure 3.1: Domain decomposition Figure 3.2: Mesh of meshes arch It is proposed to modify the domain decomposition by a parallel mesh-of-meshes architec- ture style as shown in Figure 3.2 for a 2× 2 domain decomposition. Every grid point of each 2 domain will contain an embedded virtual processor, however, only the top left corner domain, Domain(0,0), will contain a corresponding mesh of physical processors. The computational model is simple. The processor mesh computes in parallel Domain(0,0) then it switches onto Domain(0,1), Domain(0,2) and so on, sweeping through the domains horizontally and vertically, step by step, until it reaches the bottom right corner domain. In general, let N×N be the original grid matrix which is partitioned into Q×Q domains such that every domain has M×M grid nodes where N = M×Q. Thus every full iteration through the entire grid consists of Q2 iterations through each of the Q2 domains. To achieves this, every physical processor node should have sufficient local storage to hold the intermediate results of corresponding nodes when going from domain to domain. However, to allow for inter communications between near neighboring domain nodes, we will use a wrap around or torus structure for the physical processor mesh shown in Fig. 3.3. In what follows we assume that we will use the above Q×Q architecture Figs 3.2, 3.3 to solve the Poisson potential equation on a N×N grid. The context switching time from domain to domain may be few cycles, say 5 (instruction) cycles. The grid has been initialized and the initials of every node are stored in the corresponding processors. Specifically, the initial potential value of the top grid nodes is 100 whereas all others are initialized to zero. P12 P8 P4 P0 P1 P2 P3 P5 P6 P7 P11 P15 P10 P14P13 P9 Figure 3.3: Mesh domain architecture 3.1 Formalize the computational model in the form of an algorithmic procedure for the Pois- son potential equation assuming we use the standard iterative solver. 3.2 Estimate the computation time for your solution assuming that the problem converges after 1000 iterations. 3.3 Estimate the speedup of the proposed approach (if any) in comparison to the standard 3 domain decomposition technique in Figure 3.1. To be fair you should use the same number of processors Q2 and same grid size N×N for the domain decomposition in Figure 3.1. 3.4 You should write a C/C++ model for your Poisson solver and simulate your solution to find the convergence values of all nodes in the mesh assuming that the top boundary initial values do not change. Assume that N = 104 and Q= 103. 3.5 Get some additional results for different grid sizes N×N. However, only a fraction of your results will be needed in your answers. Problem 4. – Consider a 5-dimensional hypercube system, Hyper-5. Assume that you are to calculate all partial sums of the i items up to the sum of 32 items. 4.1 Construct a program for Hyper-5 that performs this operation in a time that grows as O(logN) if the number of processors is N. Assume that every processor node in the computer executes the same program, although the program can be slightly different from node to node since the processors in the Hyper-5 are independent. 4.2 Show explicitly the instructions that send and receive data between processors. Invent instructions as you need them and describe what the instructions do. Include some type of instructions for synchronization that forces a processor to be idle until a neighboring processor sends a message sends a message or datum that enables computation to continue. 4.3 Which communication steps if any in your answers require communications with pro- cessors that are not among the five processors directly connected to a given processor? How do you propose to implement such communication in software (assume that the hardware itself does not provide remote communication as a basic instruction) ? Problem 5. Consider a homogeneous multicore machine containing two single-thread processor cores and four shared memory modules, M1, M2, M3, M4. The cores and memories are connected via a 2× 4 cross bar switch. A program to be executed on the multicore system is represented by the data dependency graph of the following figure involving the instructions I1, I2, …, I12. We assume instruction execution time 1 cycle, cross bar time 2 cycles and
memory reference time 2 cycles. We also assume that the instruction execution and memory reference is consolidated or not pipelined. The memory access requirements for each instruction are given in the table. We do consider memory conflicts and cross bar conflicts. There is a host processor which holds initially the instruction program. The host is responsible for loading this program on the multicore system. Based on the data dependencies, the host will load instruction parts to each core for parallel execution. Because of this loading, the cores do not go through the instruction fetch phase. Assume the decoding time is subsumed by the loading. 4 Then for every instruction, each core needs to access the memory through the crossbar, follow the access patterns, and get the data back to the core again through the crossbar. At the end of the computation, the results are in the registers of each core. 5.1 Make a suitable allocation mapping of instructions into the processors of the multicore system which requires as short as possible computation time. 5.2 Design an algorithm that will systematically solve this problem in this environment, i.e the 2-processor, 2× 4 cross bar and 4-memory homogeneous multicore system, assuming acyclic program dependency graph. 5.3 Suppose that the single threaded 2-core system is replaced by 2-threaded single processor core, keeping the four memory modules and replacing the cross bar by a bus. However, the instruction execution time is now 2 cycles, the bus transport time averages 1 cycle and the memory reference time remains 2 cycles. Consider that the data flow program graph consists of two program threads, i.e. T1 : I1, I3, I4, I5, I6, I7, I8, I10 and T2 : I2, I9, I11, I12 Redo the above mapping problem 5.1 for the 2-threaded single core system assuming an additional 1 cycle context-switching penalty. Compare the results to the single threaded 2-core system, problem 5.1. 5 M1 M2 M3 M4 Core 1 Core 2 Multicore Architecture Host System 1 23 5 6 7 8 9 10 11 4 12 1 2 3 4 5 6 7 8 9 10 11 12 X X X X X X X X X X X X X X X Data Dependency Graph Memory Access Patterns M 1 M 2 M 3 M 4 Problem 6. – A multiprocessor node must sometimes send a message to more than one other processor, a task referred to as ”broadcasting”. Suppose that a node P0 in an n-dimensional hyper- cube system has to broadcast a message to all 2n − 1 other processors. The broadcasting is subject to the constraints that the message can be forwarded (retransmitted) by a node only to a neigh- 6 boring node, and each node can transmit only one message at a time. Assume that each message transmission between adjacent nodes requires only one time unit. In a two-dimensional system, for example, P0 could broadcast a message MESS as follows. At time t = 0, P0 sends MESS to P1. At t = 1, P0 sends MESS to P2 and P1 sends MESS to P3, thus completing the broadcast in 2 time units. 6.1 Construct a general broadcasting algorithm for the n-dimensional case that allows a mes- sage to reach all nodes in n time units. Specify clearly the algorithm used by each node to determine the neighboring nodes to which it should forward an incoming message. 6.2 Apply your algorithm on the designing the 8× 8 heat-flow program (Laplace equation) to run on a 64-processor hypercube computer that simulates a mesh computer. Specify com- pletely the function you use to map the problem’s gridpoints onto the nodes of the hypercube. 6.3 Simulate the 16×16 heat-flow problem, i.e. build a C/C++ program to simulate the oper- ation of a net of 2×2 processor nodes, assuming reasonable communication cost. Tabulate your results. Comment how would the problem scale to a 256×256 heat-flow problem? Problem 7. – Consider the following arithmetic expression: F = (A0 + A1 + · · ·+ A7) × (B0 + B1 + · · ·+ B7) 7.1 Show a minimal height tree implementation of F assuming unlimited processors. Com- pute the speedup S1 and the efficiency E1. 7.2 Find a minimal height tree implementation of F assuming 4 (four) processors available each being able to execute the addition or multiplication operation in one time step. Compute the speedup and efficiency, S2 and E2, respectively. 7.3 Suppose you want to implement F using a 4× 4 mesh connected network. a) Describe a procedure that implements F in minimal time on the mesh network, also find the speedup and efficiency of your parallel mesh implementation. b) If your boundary processors are connected by end-around or wrap-around connections (toroidal connections), show how your solution improves in terms of the speedup and efficiency. 7.4 Suppose you want to implement F using 8 (eight) processors connected into a 8 node shuffle network using 4 switching elements and feedback paths (similar network shown in class). Describe an algorithm that implements F in minimal time using this shuffle network. Assume that the switches can shuffle a data set in one time step, during the same time the processors perform an operation on another data set. Compute the speedup and efficiency, S3 and E3, respectively. 7 7.5 Suppose that we want to implement F using a 4-dimensional hypercube network. As- sume that the 16 variables Ai and Bi, i= 0,1,…,7 have already been loaded to the 16 nodes of the cube. Describe an algorithm that will implement F in minimal time using this hypercube network. Compute S4 and E4. 7.6 Discuss the pros and cons of the above 5 implementations. Problem 8. – The purpose of this problem is to explore some of the properties of the perfect shuffle scheme. 8.1 Consider a processor that has the perfect shuffle and pair-wise exchange connections shown in Figure bellow. For an eight-processor system, show that the permutation that cyclically shifts the input vector by three positions is realized by some setting of the ex- change modules. Draw the network unrolled in time to show the setting that realizes this permutation. 8.2 Repeat the above problem question to show that a cyclical shift of two positions is real- izable. 8.3 Prove a shuffle-exchange network can realize any cyclical shift in log2N iterations for an N-processor system when N is a power of 2. Perfect shuffle network. Tall boxes depict modules that can exchage their inputs. Shaded modules are reached from Processor 3 via some sequence of shuffle−exchange 8 Problem 9. – Consider the use of a four-PE array processor (PE means processing element or processor) to multiply two 3×3 matrices. The interconnection structure of the four PEs is shown in Figure below. Wraparound connections appear in all rows and columns of the array. You need to map the matrix elements initially one onto each of the processors. All the three multiplications needed for each output element ci j must be performed in the same PE in order to accumulate the sum of products. Of course, you are allowed to shift the matrix elements around if needed. PE 3 PE 4 PE 1 X B = CA PE 2 9.1 Show the initial mapping of the A and B matrix elements to the processors before the first multiplication is carried out. (You may have to wrap around the matrix.) 9.2 What are the initial multiplications to be carried out in each processor (there may be more than one multiplication in each processor) without any data shifting? 9.3 Parallel shifts are carried out in the horizontal and vertical directions. Show the mapping of the A and B matrix elements before the second group of multiplications can be carried out. 9.4 What are the multiplications to be carried out in each processor without any further shifting? (Don’t bother with summing with the previous terms. Summation operations in dot product operations are embeded in the multiply hardware automatically.) 9.5 Suppose the two matrices have already been allocated as in problem part 9.1. Assume each processor can perform one multiplication per unit time or one shift in a single direction 9 per unit time for each number. (If you shift two numbers, it takes two units of time.) Deter- mine the time units needed to complete parts 9.1, 9.2, 9.3, 9.4, respectively. Minimizing the total time delay is the design goal. Problem 10. Co
nsruct a shuffle network to compute a second degree polynomial in one variable: Q(x) = Ax2 + Bx + C. Now consider a second degree polynomial in two variables: P(x,y) = A2x2y2 + A1x2y + A0x2 + B2xy2 + B1xy + B0x + C2y2 + C1y + C0 (1) It is desired to evaluate this polynomial on the basis of the shuffle construction for polynomials in one variable. To do this, you may apply the following transformation: P(x,y) = Q2(x)y2 + Q1(x)y + Q0(x) (2) You may assume that that you have four processors available so that each processor can execute three additions and em multiplications, concurrently. Masking of these operations is possible and the masks should be shuffled through the computation network. Broadcasting of x and y powers is possible from the last processor. The four processors are interconnected by a) shuffle connections; b) linear shift connections (to perform inter-processor additions). 10.1 Construct a shuffle/exchange network to evaluate the second degree polynomial P(x,y). 10.2 Summarize all the steps taken to solve above question. Problem 11. – Consider the following polynomial in two variables. P(x,y) = Ax3 +Bx2y + Cxy2 + Dy3 It is desired to evaluate this polynomial on the basis of the shuffle construction for polynomials in one variable. You may assume that that you have up to eight processors available and each processor can execute additions and multiplications. The variables x and y are initially shifted into the processors together with the cofficients A, B, C, D. The processors are interconnected by a) shuffle connections; b) linear shift connections. Appropriate values are to be shuffled through the network as arguments of addition or multiplication operations performed by the processors. 11.1 Construct a shuffle/exchange network to evaluate the polynomial P(x,y). 11.2 Summarize all the steps taken to solve the above question. 11.3 Calculate the number of computation time units required to evaluate the polynomial P(x,y). Assume that it takes one time unit for an addition or multiplication within a proces- sor, and one time unit for communication or broadcasting between two processors. 11.4 What is the minimum number of processors, and hence the simplest shuffle network, to solve this problem. 10

辅导案例-EECS 419

Related

Previous Post辅导案例-CS-521

Next Post辅导案例-ECE131A

Author admin