| Last Modified On : | May 8, 2007 4:12 PM PDT |
Rate |
|
The ability of software used in research and clinical healthcare settings to scale along with advances in hardware requires attention to threading issues. The representative set of factors discussed here can assist developers in identifying the challenges to address within specific software solutions in order to achieve long-term scalability.
By Matt Gillespie
The advancement of computing solutions in the healthcare industry is a core requirement for making today's dramatic rate of medical advances sustainable. Two core factors drive this necessity: the need to support research and patient care with complex modeling and analysis capabilities, and the ability to control the cost of healthcare delivery at hospitals and other point-of-care facilities.
The former family of requirements includes implementations such as research simulations in areas like pharmaceuticals and genetics, as well as automated manipulation and analysis of medical images, blood and tissue samples, and various forms of test data. The latter category includes streamlining business processes, managing supply chains, and reducing errors in an increasingly complex care environment.
Scaling in this context includes both accommodating larger data sets and taking good advantage of advances in hardware. Both types of solutions inherently benefit from the ability to accommodate larger bodies of data, which allows them to provide more comprehensive, high-quality results. At the same time, they must be able to leverage the increasingly parallel processor architectures that Intel and the computing industry at large are advancing. Those architectures consist of multi-core processing, symmetric multi-processor (SMP) systems, high-performance computing (HPC) architectures, and combinations thereof.
An important consideration for supporting any or all of these types of parallelism is to increase the degree of multi-threading in digital health software applications. An inherently complex undertaking, multi-threading software becomes increasingly demanding as the level of parallelism increases. That is, while multi-threading for two cores carries with it some challenges as well as performance penalties for poor threading practice, the allowable margin of error is far smaller when threading for four, 8, 16, or more.
The demands of synchronization and the potential for conflicts in data access grow exponentially as the number of cores that a piece of software supports increases. Given that Intel has already demonstrated an 80-core proof-of-concept processor, it is clear that developers of digital health software must prepare for supporting increasing levels of parallelism in their applications in order to compete effectively in the future. This paper helps to lay the foundation for that effort by outlining some central factors that limit the ability of digital health application software to scale gracefully when larger numbers of processor cores are available.
A key consideration in assessing healthcare software optimization for multi-core architectures is the degree t o which the application workload is inherently serial. When any large problem is broken down into smaller tasks, it is likely that the interdependencies between some of those tasks will prevent them from being performed strictly in parallel. For example, if the input data from task 'A' includes output data from task 'B', then task 'A' must be completed before task 'B' can be started. The simplest solution for the overall problem (1+2)*3 is to add 1+2 and then to multiply the sum by 3. While parallelizing the problem by multiplying 1*3 and 2*3 in parallel and then adding the results will yield the correct answer, that approach takes more work, which would correspond to a performance penalty.
Large problems, then, typically involve some serial portion, and the greater that portion is as a percentage of the overall application code, the greater limitation it places on overall application scalability. Analysis of this type can quantify the theoretical maximum speedup possible from a given application using a specific number of processing cores by means of Amdahl's Law. Given that s is the percentage of the application that is inherently serial (expressed as a decimal in the range 0 to 1) and n is the number of available processing cores, the theoretical maximum speedup for perfect threading without any overhead is expressed in Figure 1:
Figure 1. Amdahl's Law
To illuminate the significance of this formula, consider first the case where s—the amount of inherently serial code—is equal to zero. This condition comprises what is typically referred to as an embarrassingly parallel problem, and real-world approximations of it might include rendering or other interpretation of medical data, where each pixel (in the case of image-based studies such as CT or MRI) or each frame or time slice (in the case of time-based studies such as fMRI or EEG) can be rendered independently.
As the amount of serial code tends toward zero, the maximum theoretical speedup tends toward n. Thus, for example, an application with no serial code executing on an eight-core system could theoretically reach an 8x speedup. As the degree of inherently serial code in the application increases, the maximum theoretical speedup decreases, as depicted in Figure 2:
Figure 2. Maximum theoretical speedup from Amdahl's Law
Also consider the case where an infinite number of cores is available to the parallel application; that circumstance is useful when considering an open-ended future where the number of cores per processor continues to grow. As n tends toward infinity, then, the theoretical maximum speedup tends toward 1/s. This relationship is useful in the sense that it provides a rough means of gauging the long-term value of multi-threading to an application. For example, if an application is 25% serial, the theoretical maximum speedup of that application by multi-threading is 4x, regardless of the number of processing cores that are applied to it.
As mentioned above, the theoretical maximum speedup calculated by Amdahl's Law does not account for threading overhead, which can be significant. Examples of this overhead include the following fac tors associated with managing threads:
Thread-creation overhead: The creation of threads takes system resources that cannot be used to perform the primary work being addressed by an application. The impact of this factor on scalability can be addressed by thread pooling (the creation of reusable threads), as well as threading larger (outer) loops where possible.
Lock-management (synchronization) overhead: Critical sections of code are those that require locking of data locations to prevent contention that could otherwise cause one thread to overwrite data being used by another thread in an uncontrolled manner. Since locking data locations restricts their use by other threads, this factor can cause one thread to wait for another to finish executing, often acting as the main source of threading overhead that slows down overall execution.
Lock-management overhead is discussed further in the "Synchronization Issues" section below. In some cases, the overhead associated with managing threads can outweigh the benefits associated with them. This issue is related to the concept of granularity, which is loosely defined as the ratio of computation to overhead; conceptually, dividing a problem into larger 'grains' of work (which implies fewer sections of overhead required between them) leads to higher efficiency. Intel® Thread Profiler can assist developers in identifying overhead, as well as the impact that overhead has on execution.
Another limitation of Amdahl's Law is that it does not take into account the fact that as more compute resources become available, the natural tendency is that the problem size will also increase. For example, as more powerful computers become available, customers will expect more business needs to be addressed by the applications running on them, in addition to wanting more efficient results. Increase in the problem size often leads to a decrease in the proportion of serial computation, leading to an increase in scalability.
In practical terms, this might lead to using larger problem sets with more data points in the case of research and analysis applications, using higher resolution images or larger simulations. In the case of operational software for the hospital or clinical setting, more sophisticated functionality may be incorporated, such as comparing patient vital statistics against running averages or providing real-time data analysis as a diagnostic aid.
Another key issue that impacts the scalability of applications is imbalance of work between threads. To see why this is true, consider an extreme case involving an application that compares an individual patient's data against a large database to estimate hypothetical outcomes of specific courses of treatment. Consider a two-threaded application, where one thread queries a large database and performs comparative calculations on the data retrieved, while the other thread updates the text-based user interface. Here, the interface thread performs far less work than the other thread, resulting in idle processor resources while it waits for data to write to the interface.
While this example is far clearer than most real-world problems, it illustrates the concept of load imbalance. In fact, load imbalance is easier to spot in systems with larger numbers of execution cores and active threads, since in the case of a dual-core, two-threaded system, idle time can easily be mistaken for serial sections of code. When more cores are available, if the CPU utilization of multiple cores drops off, it suggests that multiple threads are waiting for the other one to complete its assigned work, providing a clearer indication of load imbalance.
Therefore, one means of detecting load imbalance is to use perfmon under Windows* or mpstat under Linux* and to watch for unexpected drops in utilization of individual cores during execution. Another is to use Profile View in Intel Thread Profiler to watch for significant differences in the time spent in Active thread state among threads, as shown in Figure 3.
Figure 3. Load imbalance indicated by differences in Active thread state in Intel® Thread Profiler
As mentioned above, synchronization associated with locks on critical sections of code can be the most significant source of threading overhead in an application. These locks are used to manage access to shared resources so that conflicts do not arise, by allowing only one thread to enter a critical section at once. Effectively, then, critical sections serialize execution in certain regions of code. If the length of time a thread spends within a critical section is long, other threads may need to wait for a long time in order to acquire the lock to that section.
Short critical sections can minimize the effect on performance. At the same time, however, large numbers of small critical sections can introduce significant overhead associated with acquiring and releasing locks. Thus, a balance is needed between the extremes of a single very large critical section for an entire body of data on one hand, and a multitude of small critical sections on the other. Intel Thread Profiler can aid in arriving at the correct balance. Highly contended synchronization objects in Intel Thread Profiler are associated with large amounts of Locks or Impact time during execution.
In many cases, it is good practice to use synchronization routines provided by a threading API, such as OpenMP*, Pthreads*, or the Win32 API, rather than hand-coding. This advice can help to avoid scalability issues as the number of threads increases, particularly complex ones that may arise in association with porting across processor architectures. Moreover, since the synchronization routines associated with these APIs are well-tuned, their use avoids introducing complexity to the general application-tuning process.
As healthcare costs continue to escalate and the industry becomes more competitive, medical researchers and care facilities will continue to look for productivity enhancements and new functionality from computing technology. The ability to perform more sophisticated research activities and increase the efficiency of operations will help to drive the development of the industry, which represents a significant opportunity to providers of hardware and software.
Increasing hardware parallelization, as exemplified by the proliferation of execution cores in processors, is the primary means by which the computing industry will deliver higher levels of performance for the foreseeable future. In order to take advantage of these advances, it is necessary for providers of Digital Health sof tware to increase the degree of multi-threading in their products, as well as the quality of their multi-threading practices.
Developers should take note of the factors that impact the scalability of their applications on hardware with increasingly large numbers of execution cores. By doing so, they can increase performance, allowing them to add features and functionality to their products to become more competitive in their market segments.
The following materials provide a point of departure for further research on this topic:
Healthcare Research and Solutions from Intel are helping to accelerate improvements in healthcare quality by delivering technology solutions to enhance health and wellness.
Health Research & Innovation at Intel explores the ways in which ubiquitous computing can support the daily health and wellness needs of people in their homes and everyday lives.
Intel® Multi-Core Developer Community brings together a wide range of developer resources related to creating software that takes optimal advantage of multi-core processing.
Intel® Multi-Core Technology and Research Portal provides access to a variety of resources about current multi-core technology at Intel, as well as ongoing innovation and research.
The Mobile Clinical Assistant is a mobile point-of-care solution for clinicians that Intel helped to develop, which can help reduce medication dispensing errors and ease staff workloads.
|
|
Matt Gillespie is an independent technical author and editor working out of the Chicago area and specializing in emerging hardware and software technologies. Before going into business for himself, Matt developed training for software developers at Intel Corporation and worked in Internet Technical Services at California Federal Bank. He spent his early years as a writer and editor in the fields of financial publishing and neuroscience. |