publications
publications by categories in reversed chronological order. generated by jekyll-scholar.
2011
- CNCCMaximizing Synchronization Coverage Via Controlling Thread ScheduleHyoYoung Kim, DaeHyun Cho, and Sungdo MoonIn 2011 IEEE Consumer Communications and Networking Conference (CNCC’11), Jan 2011
Coverage analysis has been playing an important role in testing traditional single threaded applications. As the number of multicore hardware being used increases, the number of multithreaded applications grows as well. In multi-threaded environment, traditional code coverage models cannot really cover the nondeterministic behavior of simultaneously running threads. The concept of synchronization coverage was invented to solve this problem. In this paper, we present a new coverage testing tool which maximizes synchronization coverage rates with less time and repetition. After searching for synchronization primitives, our tool instruments special code segments to these points and forces the application executions to maximize its synchronization coverage by scheduling individual threads. This enables achieving maximum synchronization coverage without manipulating source code or writing unit test code. We implemented this method and showed that our method achieves maximum synchronization coverage rates 1.5 to over 144 times faster than random method.
2004
- CGOSYZYGY - A Framework for Scalable Cross-Module IPOSungdo Moon, Xinliang D. Li, Robert Hundt, Dhruva R. Chakrabarti, Luis A. Lozano, Uma Srinivasan, and Shin-Ming LiuIn 2004 International Symposium on Code Generation and Optimization with Special Emphasis on Feedback-Directed and Runtime Optimization (CGO’04), Mar 2004
Performing analysis across module boundaries for an entire program is important for exploiting several runtime performance opportunities. However, due to scalability problems in existing full-program analysis frameworks, such performance opportunities are only realized by paying tremendous compile-time costs. Alternative solutions, such as partial compilations or user assertions, are complicated or unsafe and as a result, not many commercial applications are compiled today with cross-module optimizations. We present SYZYGY, a practical framework for performing efficient, scalable, interprocedural optimizations. The framework is implemented in the HP-UX Itanium/spl reg/ compilers and we have successfully compiled many very large applications consisting of millions of lines of code. We achieved performance improvements of up to 40% over optimization level two and compilation time improvements in the order of 100% and more compared to a previous approach.
2002
- DissertationCombining Compile-Time and Run-Time ParallelizationSungdo MoonUniversity of Southern California, May 2002
In this dissertation, we investigate how good today’s parallelizing compilers are, what opportunities remain to improve them, and, what technology is needed to exploit the remaining opportunities. To answer these questions, we perform experiments that measure the safety of parallelization at run time for loops left unparallelized by the Stanford SUIF compiler’s automatic parallelization system. The experimental results demonstrate that significant improvements to automatic parallelization technology require that existing systems be extended in two ways: (1) they must combine high-quality compile-time analysis with low-cost run-time testing; and, (2) they must take control flow into account during analysis. In order to exploit these remaining opportunities, we have developed a new compile-time analysis technique that can be used to parallelize most of these remaining parallelizable loops, This technique is designed to not only improve the results of compile-time parallelization, but also to produce low-cost, directed run-time tests that allow the system to defer binding of parallelization until run-time when safety cannot be proven statically, We call this approach predicated array data-flow analysis . We augment array data-flow analysis, which the compiler uses to identify independent and privatizable arrays, by associating with each array data-flow value a predicate. Predicated array data-flow analysis allows the compiler to derive “optimistic” data-flow values guarded by predicates; these predicates can be used to derive a run-time test guaranteeing the safety of parallelization. We demonstrate the effectiveness of predicated data-flow analysis by implementing it in the Stanford SUIF compiler and performing experiments on three benchmark suites and one additional program, Experimental results with a prototype implementation show that predicated array data-flow analysis is promising at finding additional parallel loops, as it parallelizes more than 40% of the remaining inherently parallel loops left unparallelized by the SUIF compiler, We demonstrate improved speedups with negligible run-time overhead for 5 programs among 6 programs in our benchmark suite where significant speedup improvement is possible.
2000
- TPDSEvaluating Automatic Parallelization in SUIFSungdo Moon, Byoungro So, and Mary W. HallIEEE Transactions on Parallel and Distributed Systems, Jan 2000
This paper presents the results of an experiment to measure empirically the remaining opportunities for exploiting loop-level parallelism that are missed by the Stanford SUIF compiler, a state-of-the-art automatic parallelization system targeting shared-memory multiprocessor architectures. For the purposes of this experiment, we have developed a run-time parallelization test called the Extended Lazy Privatizing Doall (ELPD) test, which is able to simultaneously test multiple loops in a loop nest. The ELPD test identifies a specific type of parallelism where each iteration of the loop being tested accesses independent data, possibly by making some of the data private to each processor. For 29 programs in three benchmark suites, the ELPD test was executed at run time for each candidate loop left unparallelized by the SUIF compiler to identify which of these loops could safely execute in parallel for the given program input. The results of this experiment point to two main requirements for improving the effectiveness of parallelizing compiler technology: incorporating control flow tests into analysis and extracting low-cost run-time parallelization tests from analysis results.
1999
- ICSA Tile Selection Algorithm for Data Locality and Cache InterferenceJacqueline Chame, and Sungdo MoonIn The 13th International Conference on Supercomputing (ICS’99), Jun 1999
Loop tiling is a well-known compiler transformation that increases data locality, exposes parallelism and reduces synchronization costs. Tiling increases the amount of data reuse that can be exploited by reordering the loop iterations so that accesses to the same data are closer together in time. However, tiled loops often suffer from cache interference in the direct-mapped or low-associativity caches typically found in state-of-the-art microprocessors. A solution to this problem is to choose a tile size that does not exhibit self interference. In this paper, we propose a new tile selection algorithm for eliminating self interference and simultaneously minimizing capacity and cross-interference misses. We have automated the algorithm in the SUIF compiler and used it to generate tiles for a range of problem sizes for three scientific computations. Our experimental results show that the algorithm consistently finds tiles that yield lower miss rates than existing tile selection algorithms.
- PPoPPEvaluation of Predicated Array Data-Flow Analysis for Automatic ParallelizationSungdo Moon, and Mary W. HallIn The 7th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’99), May 1999
This paper presents an evaluation of a new analysis for parallelizing compilers called predicated array data-flow analysis. This analysis extends array data-flow analysis for parallelization and privatization to associate predicates with data-flow values. These predicates can be used to derive conditions under which dependences can be eliminated or privatization is possible. These conditions can be used both to enhance compile-time analysis and to introduce run-time tests that guard safe execution of a parallelized version of a computation.As compared to previous work that combines predicates with array data-flow analysis, our approach is distinguished by two features: (1) it derives low-cost, run-time parallelization tests; and, (2) it incorporates predicate embedding and predicate extraction, which translate between the domain of predicates and data-flow values to derive more precise analysis results. We present extensive experimental results across three benchmark suites and one additional program, demonstrating that predicated array data-flow analysis parallelizes more than 40% of the remaining inherently parallel loops left unparallelized by the SUIF compiler and that it yields improved speedups for 5 programs.
- SPCombining Compile-Time and Run-Time ParallelizationSungdo Moon, Byoungro So, and Mary W. HallScientific Programming, May 1999
This paper demonstrates that significant improvements to automatic parallelization technology require that existing systems be extended in two ways: (1) they must combine high-quality compile-time analysis with low-cost run-time testing; and (2) they must take control flow into account during analysis. We support this claim with the results of an experiment that measures the safety of parallelization at run time for loops left unparallelized by the Stanford SUIF compiler’s automatic parallelization system. We present results of measurements on programs from two benchmark suites – SPECFP95 and NAS sample benchmarks – which identify inherently parallel loops in these programs that are missed by the compiler. We characterize remaining parallelization opportunities, and find that most of the loops require run-time testing, analysis of control flow, or some combination of the two. We present a new compile-time analysis technique that can be used to parallelize most of these remaining loops. This technique is designed to not only improve the results of compile-time parallelization, but also to produce low-cost, directed run-time tests that allow the system to defer binding of parallelization until run-time when safety cannot be proven statically. We call this approach predicated array data-flow analysis. We augment array data-flow analysis, which the compiler uses to identify independent and privatizable arrays, by associating predicates with array data-flow values. Predicated array data-flow analysis allows the compiler to derive “optimistic” data-flow values guarded by predicates; these predicates can be used to derive a run-time test guaranteeing the safety of parallelization.
1998
- ICSPredicated Array Data-Flow Analysis for Run-Time ParallelizationSungdo Moon, Mary W. Hall, and Brian R. MurphyIn The 12th International Conference on Supercomputing (ICS’98), Jul 1998
This paper presents a new analysis for parallelizing compilers called predicated array data-flow analysis, whereby array dataflow analysis for parallelization and privatization is extended to associate predicates with data-flow values. These predicates can be used to derive conditions under which dependences can be eliminated or privatization is possible. These conditions, which can consist of arbitrary program statements, can be used both to enhance compile-time analysis and to introduce run-time tests that guard safe execution of a parallelized version of a computation. We have implemented predicated array data-llow analysis in the Stanford SUIF compiler. We describe features of the implementation and present experimental results that demonstrate this analysis improves the performance of three programs from the SPEC95FP benchmark suite.
- ICSMeasuring the Effectiveness of Automatic Parallelization in SUIFByoungro So, Sungdo Moon, and Mary W. HallIn The 12th International Conference on Supercomputing (ICS’98), Jul 1998
This paper presents both an experiment and a system for inserting run-time dependence and privatization testing. The goal of the experiment is to measure empirically the remaining opportunities for exploiting loop-level parallelism that are missed by state-of-theart parallelizing compiler technology. We perform run-time testing of data accessed within all candidate loops not parallelized by the compiler to identify which of these loops could safely execute in parallel for the given program input. This system extends the Lazy Privatizing Doall (LPD) test to simultaneously instrument multiple loops in a nest. Using the results of interprocedural array dataflow analysis, we avoid unnecessary instrumentation of arrays with compile-time provable dependences or loops nested inside outer parallelized loops. We have implemented the system in the Stanford SUIF compiler and have measured programs in three benchmark suites.
- USC TRHyperblocking: A Data Reorganization Method to Eliminate Cache Conflicts in Tiled Loop NestsSungdo Moon, and Rafael H. SaavedraFeb 1998
This paper presents hyperblocking, or hypertiling, a novel optimization technique that makes it possible to drastically eliminate the self and cross interference misses of tiled loop nests, while significantly reducing the overhead involved in copying data blocks. In hyperblocking, all arrays used in a tiled loop nest are reorganized at the beginning of the computation such that all data blocks are guaranteed to map into disjoint and contiguous cache regions. Doing this effectively eliminates the occurrence of interference misses in many dense-matrix computations. When data prefetching is used in combination with hyperblocking, it is possible to completely hide the latency of capacity misses, and given that no cache conflicts can occur, it makes it possible to increase the prefetch distance to cover larger latencies without causing any amount of cache pollution. Furthermore, the performance of hyperblocked loop nests no longer depends on input sizes and starting addresses, so it’s possible to predict the execution time of tiled loop nests for any input sizes. According to the simulation done in a couple of machines, hyperblocking significantly improved the performance of several tiled loop nests and outperformed previous approaches such as copy optimization and tile size selection.
- LCRA Case for Combining Compile-Time and Run-Time ParallelizationSungdo Moon, Byoungro So, Mary W. Hall, and Brian R. MurphyIn The 4th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers (LCR’98), May 1998
This paper demonstrates that significant improvements to automatic parallelization technology require that existing systems be extended in two ways: (1) they must combine high-quality compile-time analysis with low-cost run-time testing; and, (2) they must take control flow into account during analysis. We support this claim with the results of an experiment that measures the safety of parallelization at run time for loops left unparallelized by the Stanford SUIF compiler’s automatic parallelization system. We present results of measurements on programs from two benchmark suites — Specfp95 and Nas sample benchmarks — which identify inherently parallel loops in these programs that are missed by the compiler. We characterize remaining parallelization opportunities, and find that most of the loops require run-time testing, analysis of control flow, or some combination of the two. We present a new compile-time analysis technique that can be used to parallelize most of these remaining parallel loops. This technique is designed to not only improve the results of compile-time parallelization, but also to produce low-cost, directed run-time tests that allow the system to defer binding of parallelization until run-time when safety cannot be proven statically. We call this approach predicated array data-flow analysis. We augment array data-flow analysis, which the compiler uses to identify independent and privatizable arrays, by associating with each array data-flow value a predicate. Predicated array data-flow analysis allows the compiler to derive “optimistic” data-flow values guarded by predicates; these predicates can be used to derive a run-time test guaranteeing the safety of parallelization.
1997
- IJPPAdaptive Granularity: Transparent Integration of Fine- and Coarse-Grain CommunicationDaeyeon Park, Rafael H. Saavedra, and Sungdo MoonInternational Journal of Parallel Programming, Oct 1997
The granularity of shared data is one of the key factors affecting the performance of distributed shared memory machines (DSM). Given that programs exhibit quite different sharing patterns, providing only one or two fixed granularities cannot result in an efficient use of resources. On the other hand, supporting arbitrarily granularity sizes significantly increases not only hardware complexity but software overhead as well. Furthermore. the efficient use of arbitrarily granularities put the burden on users to provide information about program behavior to compilers and/or runtime systems. These kind of requirements tend to restrict the programmability of the shared memory model. In this paper, we present a new communication scheme, calledAdaptive Granularity (AG). Adaptive Granularity makes it possible to transparently integrate bulk transfer into the shared memory model by supporting variable-size granularity and memory replication. It consists of two protocols: one for small data and another for large data. For small size data, the standard hardware DSM protocol is used and the granularity is fixed to the size of a cache line. For large array data, the protocol for bulk data is used instead, and the granularity varies depending on the runtime sharing behavior of the applications. Simulation results show that AG improves performance up to 43% over the hardware implementation of DSM (e.g., DASH, Alewife). Compared with an equivalent architecture that supports fine-grain memory replication at the fixed granularity of a cache line (e.g., Typhoon), AG reduces execution time up to 35%.
1996
- ICPPThe Combined Effectiveness of Unimodular Transformations, Tiling, and Software PrefetchingRafael H. Saavedra, Weihua Mao, Daeyeon Park, Jacqueline Chame, and Sungdo MoonIn The 1996 International Conference on Parallel Processing (ICPP’96), Aug 1996
Unimodular transformations, tiling, and software prefetching are loop optimizations known to be effective in increasing parallelism, reducing cache miss rates, and eliminating processor stall time. Although these optimizations individually are quite effective, there is the expectation that even better improvements can be obtained by combining them together. In this paper we show that indeed this is the case when unimodular transformations are combined with either tiling or software prefetching. However, our results also show that although combining tiling with prefetching tends to improve the performance of tiling alone, it is also the case that in some situations tiling can degrade the cache performance of software prefetching. The reasons for this unexpected behavior are three fold: 1) tiling introduces interference misses inside the localized space which are difficult to characterize with current techniques based on locality analysis; 2) prefetch predicates are computed using only estimates on the amount of capacity misses, so the latency induced by cache interference is not completely covered; and 3) tiling limits the maximum amount of latency that can be masked with prefetching.