publications | Sungdo Moon

2011

CNCC

Maximizing Synchronization Coverage Via Controlling Thread Schedule

HyoYoung Kim, DaeHyun Cho, and Sungdo Moon

In 2011 IEEE Consumer Communications and Networking Conference (CNCC’11), Jan 2011

Abs DOI

Coverage analysis has been playing an important role in testing traditional single threaded applications. As the number of multicore hardware being used increases, the number of multithreaded applications grows as well. In multi-threaded environment, traditional code coverage models cannot really cover the nondeterministic behavior of simultaneously running threads. The concept of synchronization coverage was invented to solve this problem. In this paper, we present a new coverage testing tool which maximizes synchronization coverage rates with less time and repetition. After searching for synchronization primitives, our tool instruments special code segments to these points and forces the application executions to maximize its synchronization coverage by scheduling individual threads. This enables achieving maximum synchronization coverage without manipulating source code or writing unit test code. We implemented this method and showed that our method achieves maximum synchronization coverage rates 1.5 to over 144 times faster than random method.

2004

CGO

SYZYGY - A Framework for Scalable Cross-Module IPO

Sungdo Moon, Xinliang D. Li, Robert Hundt, Dhruva R. Chakrabarti, Luis A. Lozano, Uma Srinivasan, and Shin-Ming Liu

In 2004 International Symposium on Code Generation and Optimization with Special Emphasis on Feedback-Directed and Runtime Optimization (CGO’04), Mar 2004

Abs DOI

Performing analysis across module boundaries for an entire program is important for exploiting several runtime performance opportunities. However, due to scalability problems in existing full-program analysis frameworks, such performance opportunities are only realized by paying tremendous compile-time costs. Alternative solutions, such as partial compilations or user assertions, are complicated or unsafe and as a result, not many commercial applications are compiled today with cross-module optimizations. We present SYZYGY, a practical framework for performing efficient, scalable, interprocedural optimizations. The framework is implemented in the HP-UX Itanium/spl reg/ compilers and we have successfully compiled many very large applications consisting of millions of lines of code. We achieved performance improvements of up to 40% over optimization level two and compilation time improvements in the order of 100% and more compared to a previous approach.

2002

Dissertation

Combining Compile-Time and Run-Time Parallelization

Sungdo Moon

University of Southern California, May 2002

Abs

In this dissertation, we investigate how good today’s parallelizing compilers are, what opportunities remain to improve them, and, what technology is needed to exploit the remaining opportunities. To answer these questions, we perform experiments that measure the safety of parallelization at run time for loops left unparallelized by the Stanford SUIF compiler’s automatic parallelization system. The experimental results demonstrate that significant improvements to automatic parallelization technology require that existing systems be extended in two ways: (1) they must combine high-quality compile-time analysis with low-cost run-time testing; and, (2) they must take control flow into account during analysis. In order to exploit these remaining opportunities, we have developed a new compile-time analysis technique that can be used to parallelize most of these remaining parallelizable loops, This technique is designed to not only improve the results of compile-time parallelization, but also to produce low-cost, directed run-time tests that allow the system to defer binding of parallelization until run-time when safety cannot be proven statically, We call this approach predicated array data-flow analysis . We augment array data-flow analysis, which the compiler uses to identify independent and privatizable arrays, by associating with each array data-flow value a predicate. Predicated array data-flow analysis allows the compiler to derive “optimistic” data-flow values guarded by predicates; these predicates can be used to derive a run-time test guaranteeing the safety of parallelization. We demonstrate the effectiveness of predicated data-flow analysis by implementing it in the Stanford SUIF compiler and performing experiments on three benchmark suites and one additional program, Experimental results with a prototype implementation show that predicated array data-flow analysis is promising at finding additional parallel loops, as it parallelizes more than 40% of the remaining inherently parallel loops left unparallelized by the SUIF compiler, We demonstrate improved speedups with negligible run-time overhead for 5 programs among 6 programs in our benchmark suite where significant speedup improvement is possible.

2000

TPDS

Evaluating Automatic Parallelization in SUIF

Sungdo Moon, Byoungro So, and Mary W. Hall

IEEE Transactions on Parallel and Distributed Systems, Jan 2000

Abs DOI

This paper presents the results of an experiment to measure empirically the remaining opportunities for exploiting loop-level parallelism that are missed by the Stanford SUIF compiler, a state-of-the-art automatic parallelization system targeting shared-memory multiprocessor architectures. For the purposes of this experiment, we have developed a run-time parallelization test called the Extended Lazy Privatizing Doall (ELPD) test, which is able to simultaneously test multiple loops in a loop nest. The ELPD test identifies a specific type of parallelism where each iteration of the loop being tested accesses independent data, possibly by making some of the data private to each processor. For 29 programs in three benchmark suites, the ELPD test was executed at run time for each candidate loop left unparallelized by the SUIF compiler to identify which of these loops could safely execute in parallel for the given program input. The results of this experiment point to two main requirements for improving the effectiveness of parallelizing compiler technology: incorporating control flow tests into analysis and extracting low-cost run-time parallelization tests from analysis results.

1999

ICS

A Tile Selection Algorithm for Data Locality and Cache Interference

Jacqueline Chame, and Sungdo Moon

In The 13th International Conference on Supercomputing (ICS’99), Jun 1999

Abs DOI

Loop tiling is a well-known compiler transformation that increases data locality, exposes parallelism and reduces synchronization costs. Tiling increases the amount of data reuse that can be exploited by reordering the loop iterations so that accesses to the same data are closer together in time. However, tiled loops often suffer from cache interference in the direct-mapped or low-associativity caches typically found in state-of-the-art microprocessors. A solution to this problem is to choose a tile size that does not exhibit self interference. In this paper, we propose a new tile selection algorithm for eliminating self interference and simultaneously minimizing capacity and cross-interference misses. We have automated the algorithm in the SUIF compiler and used it to generate tiles for a range of problem sizes for three scientific computations. Our experimental results show that the algorithm consistently finds tiles that yield lower miss rates than existing tile selection algorithms.
PPoPP

Evaluation of Predicated Array Data-Flow Analysis for Automatic Parallelization

Sungdo Moon, and Mary W. Hall

In The 7th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’99), May 1999

Abs DOI

This paper presents an evaluation of a new analysis for parallelizing compilers called predicated array data-flow analysis. This analysis extends array data-flow analysis for parallelization and privatization to associate predicates with data-flow values. These predicates can be used to derive conditions under which dependences can be eliminated or privatization is possible. These conditions can be used both to enhance compile-time analysis and to introduce run-time tests that guard safe execution of a parallelized version of a computation.As compared to previous work that combines predicates with array data-flow analysis, our approach is distinguished by two features: (1) it derives low-cost, run-time parallelization tests; and, (2) it incorporates predicate embedding and predicate extraction, which translate between the domain of predicates and data-flow values to derive more precise analysis results. We present extensive experimental results across three benchmark suites and one additional program, demonstrating that predicated array data-flow analysis parallelizes more than 40% of the remaining inherently parallel loops left unparallelized by the SUIF compiler and that it yields improved speedups for 5 programs.
SP

Combining Compile-Time and Run-Time Parallelization

Sungdo Moon, Byoungro So, and Mary W. Hall

Scientific Programming, May 1999

Abs DOI

This paper demonstrates that significant improvements to automatic parallelization technology require that existing systems be extended in two ways: (1) they must combine high-quality compile-time analysis with low-cost run-time testing; and (2) they must take control flow into account during analysis. We support this claim with the results of an experiment that measures the safety of parallelization at run time for loops left unparallelized by the Stanford SUIF compiler’s automatic parallelization system. We present results of measurements on programs from two benchmark suites – SPECFP95 and NAS sample benchmarks – which identify inherently parallel loops in these programs that are missed by the compiler. We characterize remaining parallelization opportunities, and find that most of the loops require run-time testing, analysis of control flow, or some combination of the two. We present a new compile-time analysis technique that can be used to parallelize most of these remaining loops. This technique is designed to not only improve the results of compile-time parallelization, but also to produce low-cost, directed run-time tests that allow the system to defer binding of parallelization until run-time when safety cannot be proven statically. We call this approach predicated array data-flow analysis. We augment array data-flow analysis, which the compiler uses to identify independent and privatizable arrays, by associating predicates with array data-flow values. Predicated array data-flow analysis allows the compiler to derive “optimistic” data-flow values guarded by predicates; these predicates can be used to derive a run-time test guaranteeing the safety of parallelization.

1998

ICS

Predicated Array Data-Flow Analysis for Run-Time Parallelization

Sungdo Moon, Mary W. Hall, and Brian R. Murphy

In The 12th International Conference on Supercomputing (ICS’98), Jul 1998

Abs DOI

This paper presents a new analysis for parallelizing compilers called predicated array data-flow analysis, whereby array dataflow analysis for parallelization and privatization is extended to associate predicates with data-flow values. These predicates can be used to derive conditions under which dependences can be eliminated or privatization is possible. These conditions, which can consist of arbitrary program statements, can be used both to enhance compile-time analysis and to introduce run-time tests that guard safe execution of a parallelized version of a computation. We have implemented predicated array data-llow analysis in the Stanford SUIF compiler. We describe features of the implementation and present experimental results that demonstrate this analysis improves the performance of three programs from the SPEC95FP benchmark suite.
ICS

Measuring the Effectiveness of Automatic Parallelization in SUIF

Byoungro So, Sungdo Moon, and Mary W. Hall

In The 12th International Conference on Supercomputing (ICS’98), Jul 1998

Abs DOI

This paper presents both an experiment and a system for inserting run-time dependence and privatization testing. The goal of the experiment is to measure empirically the remaining opportunities for exploiting loop-level parallelism that are missed by state-of-theart parallelizing compiler technology. We perform run-time testing of data accessed within all candidate loops not parallelized by the compiler to identify which of these loops could safely execute in parallel for the given program input. This system extends the Lazy Privatizing Doall (LPD) test to simultaneously instrument multiple loops in a nest. Using the results of interprocedural array dataflow analysis, we avoid unnecessary instrumentation of arrays with compile-time provable dependences or loops nested inside outer parallelized loops. We have implemented the system in the Stanford SUIF compiler and have measured programs in three benchmark suites.
USC TR

Hyperblocking: A Data Reorganization Method to Eliminate Cache Conflicts in Tiled Loop Nests

Sungdo Moon, and Rafael H. Saavedra

Feb 1998

Abs

This paper presents hyperblocking, or hypertiling, a novel optimization technique that makes it possible to drastically eliminate the self and cross interference misses of tiled loop nests, while significantly reducing the overhead involved in copying data blocks. In hyperblocking, all arrays used in a tiled loop nest are reorganized at the beginning of the computation such that all data blocks are guaranteed to map into disjoint and contiguous cache regions. Doing this effectively eliminates the occurrence of interference misses in many dense-matrix computations. When data prefetching is used in combination with hyperblocking, it is possible to completely hide the latency of capacity misses, and given that no cache conflicts can occur, it makes it possible to increase the prefetch distance to cover larger latencies without causing any amount of cache pollution. Furthermore, the performance of hyperblocked loop nests no longer depends on input sizes and starting addresses, so it’s possible to predict the execution time of tiled loop nests for any input sizes. According to the simulation done in a couple of machines, hyperblocking significantly improved the performance of several tiled loop nests and outperformed previous approaches such as copy optimization and tile size selection.
LCR

A Case for Combining Compile-Time and Run-Time Parallelization

Sungdo Moon, Byoungro So, Mary W. Hall, and Brian R. Murphy

In The 4th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers (LCR’98), May 1998

Abs DOI

This paper demonstrates that significant improvements to automatic parallelization technology require that existing systems be extended in two ways: (1) they must combine high-quality compile-time analysis with low-cost run-time testing; and, (2) they must take control flow into account during analysis. We support this claim with the results of an experiment that measures the safety of parallelization at run time for loops left unparallelized by the Stanford SUIF compiler’s automatic parallelization system. We present results of measurements on programs from two benchmark suites — Specfp95 and Nas sample benchmarks — which identify inherently parallel loops in these programs that are missed by the compiler. We characterize remaining parallelization opportunities, and find that most of the loops require run-time testing, analysis of control flow, or some combination of the two. We present a new compile-time analysis technique that can be used to parallelize most of these remaining parallel loops. This technique is designed to not only improve the results of compile-time parallelization, but also to produce low-cost, directed run-time tests that allow the system to defer binding of parallelization until run-time when safety cannot be proven statically. We call this approach predicated array data-flow analysis. We augment array data-flow analysis, which the compiler uses to identify independent and privatizable arrays, by associating with each array data-flow value a predicate. Predicated array data-flow analysis allows the compiler to derive “optimistic” data-flow values guarded by predicates; these predicates can be used to derive a run-time test guaranteeing the safety of parallelization.

1997

IJPP

Adaptive Granularity: Transparent Integration of Fine- and Coarse-Grain Communication

Daeyeon Park, Rafael H. Saavedra, and Sungdo Moon

International Journal of Parallel Programming, Oct 1997

Abs DOI

The granularity of shared data is one of the key factors affecting the performance of distributed shared memory machines (DSM). Given that programs exhibit quite different sharing patterns, providing only one or two fixed granularities cannot result in an efficient use of resources. On the other hand, supporting arbitrarily granularity sizes significantly increases not only hardware complexity but software overhead as well. Furthermore. the efficient use of arbitrarily granularities put the burden on users to provide information about program behavior to compilers and/or runtime systems. These kind of requirements tend to restrict the programmability of the shared memory model. In this paper, we present a new communication scheme, calledAdaptive Granularity (AG). Adaptive Granularity makes it possible to transparently integrate bulk transfer into the shared memory model by supporting variable-size granularity and memory replication. It consists of two protocols: one for small data and another for large data. For small size data, the standard hardware DSM protocol is used and the granularity is fixed to the size of a cache line. For large array data, the protocol for bulk data is used instead, and the granularity varies depending on the runtime sharing behavior of the applications. Simulation results show that AG improves performance up to 43% over the hardware implementation of DSM (e.g., DASH, Alewife). Compared with an equivalent architecture that supports fine-grain memory replication at the fixed granularity of a cache line (e.g., Typhoon), AG reduces execution time up to 35%.

1996

ICPP

The Combined Effectiveness of Unimodular Transformations, Tiling, and Software Prefetching

Rafael H. Saavedra, Weihua Mao, Daeyeon Park, Jacqueline Chame, and Sungdo Moon

In The 1996 International Conference on Parallel Processing (ICPP’96), Aug 1996

Abs DOI

Unimodular transformations, tiling, and software prefetching are loop optimizations known to be effective in increasing parallelism, reducing cache miss rates, and eliminating processor stall time. Although these optimizations individually are quite effective, there is the expectation that even better improvements can be obtained by combining them together. In this paper we show that indeed this is the case when unimodular transformations are combined with either tiling or software prefetching. However, our results also show that although combining tiling with prefetching tends to improve the performance of tiling alone, it is also the case that in some situations tiling can degrade the cache performance of software prefetching. The reasons for this unexpected behavior are three fold: 1) tiling introduces interference misses inside the localized space which are difficult to characterize with current techniques based on locality analysis; 2) prefetch predicates are computed using only estimates on the amount of capacity misses, so the latency induced by cache interference is not completely covered; and 3) tiling limits the maximum amount of latency that can be masked with prefetching.