This article is intended for programmers (especially CFD developers or Engineers) who intend to test scale up or performance of their applications in modern processor using either MPI(Message Passing Interface ) or OpenMP(Open Multiprocessing).
Modern computer hardware are loaded with technologies like Turbo Boost Technology or Turbo Core Technolgy, vector registers and others to accelerate code performance or with such technologies which enable more threads to operate than physical available units, thus increasing parallel execution of task. However effect of few of the technologies in performance may vary depending on load, memory requirement and other parameter, hence either not uniform for all processor or affect performance which may not be desired for developer as it may give wrong picture.
Parallelization of the code aims at faster execution of program using multiple processes or multiple threads on desired number of processing units i,e processor. This article aim to teach basic concepts as well equip one with other information which every programmers or testers should know, before going to test their application for speedup and efficiency as well the best practices to measure them. It do not intend to teach or guide ways of doing parallelization. Before going further, it will be good to brush a few terminologies which are used in context of parallel computation.
Basic Terminology
Speedup
Speedup is ratio of time taken in single processor to time taken
in multiple processors
Higher the value, better is the speedup (desired). Usually application’s speedup curve flattens with increase in
processor in CFD Computational Fluid Dynamics) solvers using MPI
Linear
Speedup or Ideal Speedup:
Linear speedup is achieved when doubling the number of processors
decrease the execution time by half. It is an ideal case and not always achieved.
Super Speedup:
A special case when speedup achieved is more
than the number of processors or processing units in which case is run. One of the reasons is the cache effect in modern processors.
Efficiency
Efficiency tells about the processor/resource utilization. Efficiency is the
ratio of the time taken in single processor to total time taken in multiprocessor (i.e
time taken for parallel run x number of
processor). Higher the efficiency better is the program. Its value vary from
0 to 1 . Higher the efficiency better the utilization of resources
Program running on single
processor as well program having linear speedup has efficiency of one.
Lower efficiency usually points to complex program which is difficult to
parallelize as well communication and synchronization overheads
Amdahl’s law
Amdahl’s law enables to determine
maximum speed up that one can expects by parallelizing program or algorithm. Every
parallel program consists of two section i.e parallel section and sequential section.
Parallel section is that section which can be parallelized or work in that
section can be distributed between processor. While sequential section is common
to every process and need to be executed by all processor. So a
process speedup is limited by ratio of time taken to execute parallel section
to sequential section.
Where f is fraction of code that
can be parallelized and n is number of processor to be used. So if code takes 80% of execution
time for section of code that can be be parallelized in single processor,then
maximum speed up that one can expect is 5 .
What should programmer know
Once one has parallelized the program, the
next task is to benchmark the performance of parallelized code i.e to check
speedup or scale up and efficiency of code. Using Amdahl's law, one can
estimate the maximum speedup that one can get
after parallelizing the code. However in
all analysis, we assume similar hardware performance which is not so in reality with modern processor . With advent of new technology to efficiently utilize
resources and maximize resource usage, the operating frequency as well cache memory
effect, depend on process to be running as well its memory usage.The point
is on using n number of processor,it is natural to expect same performance from
each of the processor but in actual, it is not
the case. Below we briefly discuss Hyper-Threading as well Turbo Boost technology/Turbo Core technolgy. before going on guidelines.
Hyper-Threading
It is Intel patented technology that
enables one core to look like two cores. So physically there will be one core
but to user it will look like two cores. This is based on efficient use of
processor resources that enable multiple threads to execute concurrently. Hyper-Threading enable better usage of resource allowing concurrent executions of two threads and delivers more output. However for memory intensive calculations, Hyper-Threading can affect performance in negative way too, as one physical processor is being used by two processes which can put load on resources i.e cache memory and other and slower the execution as compared to what if two
physical cores is being used. So while calculating speed up, we should switch
off the Hyper-Threading, otherwise you could see lower speed up or in some
cases and also one can find higher execution time with increase in processor.
Suggestion : Switch off Hyper-Threading, It can be disabled from BIOS. Google how to disable Hyper-Threading .
Turbo Boost Technology
Turbo Boost technology(Intel's ) or Turbo Core Technology(AMD) enables
cores to operate at frequency higher than operating frequency hence increasing
speed of processor. Frequency is like speed of processor, higher the frequency, faster the processor and higher the heat generation.However higher heat generation has limited
the frequency on which processors can operate
Back ground :
Multicore are designed to operate at base operating frequency which enable
safer operations i.e operate below TDP(Thermal design power : maximum amount of
heat generated which can be dissipated by cooling system of generator).
However there is always possibility that few core are sitting idle because of
less workload and hence less heat generated than TDP. In such scenario it will
be useful that cores which are not idle operate at frequency higher than
operating frequency which will increase speed of execution
Effect : It can cause
execution of program in less number of processor to be faster as compared to larger number of
processor, hence giving confusing data for speed up
Suggestion: Switch of the Turbo Boost Technology or Turbo Core Technology(AMD). It can be disabled from BIOS. Please google to find how to do it or whether your hardware has this technology
One last thing
What happens when user launch
parallelized code in multi-core processor. Does thread or process always run on some specified processor ?
The answer is no until you have
specified the affinity i.e one can ask the OS(Operating System) to utilize particular processor
or processing unit for the process. In general, scheduler depending on load switches the process
among the different processors. Even when parallelized code is run on
multiprocessor machine, by default, scheduler decides usage of processor for
processes. In simple language, a processor say with id 1 may be running process with rank 1 at particular time and process with rank 2 at another time.
Benchmarking the performance of
parallelized code
Hardware Suggestion
1. Disable the Hyper-Threading.
2. Disable the Turbo Boost technology
3. Machine to be used for testing performance of
application should be of same configuration. If testing is being done in distributed
systems ensure high speed network connectivity as infiniband.
.
Software Suggestion
1. Make sure the executable
to be tested is not in debug mode.
2. Use profiler to note the
timing or use wall clock time. Using stop watch to check timing will not always
give right results. Why?
3. Make sure no other process is running except
normal OS process.
4. Also check whether computer is not running any
scheduled task at time of testing .In Linux, Cron is used to run scheduled task
at specified time or regular interval. It will be good if we do not run such
task or spare one core if possible which will allow system process or
other scheduled job .
Case Selection and Running Suggestion (Written considering CFD
Solvers)
1. Run case for sufficient number of iterations so that there can be
conspicuous difference in timing when it is to be run for large number of
processors especially in case of distributed memory systems with MPI .
2. Also as in CFD-Solver there is one time computational cost for
reading case and other pre-processing , this time can be removed from actual
analysis.
3. Do not write or save any
data from application when being tested for speedup or performance analysis.
4. Memory requirement for case to be tested should be such to minimize effect of cache (L1,L2 and L3 ) on performance. A detail analysis of effect on cache on speed can be found on
this link.
Conclusion
The above set of guidelines will enable programmers to better judge the performances of their applications. Please note that above guidelines is for testing performance and helping programmer in better judging their parallelization of code. The above guidelines are based on my experience with parallel computing especially for CFD software.
Few of the guidelines like
switching OFF Hyper-Threading is applicable while running
CFD Solver in production run for memory intensive calculation whereas other
like turbo boost frequency is always suggested to be ON
(If you have any suggestion to add more thing or improve the article . Please comment on the article or mail me at pawan24ghildiyal@gmail.com )