Project

General

Profile

Parallel processing

Added by Julia Neelmeijer almost 10 years ago

It is possible to influence the number of threads for all functions that use OPENMP.
To accomplish this you have to set the OMP_NUM_THREADS environment variable.

Example: usage of 16 threads in parallel

E.g. C-shell:

setenv OMP_NUM_THREADS 16

E.g. bash shell:

export OMP_NUM_THREADS=16 

You can see in the output how many threads are used:

offset_pwr <...>
*** Offsets between SLC images using intensity cross-correlation ***
*** Copyright 2013, Gamma Remote Sensing, v3.6 13-Oct-2013 clw/uw ***
OPENMP: maximum number of physically available processors: 24
OPENMP: maximum number of available threads: 16
Maximum number of threads available: 16

Replies (17)

RE: Parallel processing - Added by Peter Friedl over 9 years ago

I am currently working on an HPC-Cluster with 16 available threads. So I put the command export OMP_NUM_THREADS=16 at the top of my bash script, hoping that the number of threads is set for all following functions that use OPENMP. Unfortunately this did not work (e.g. with running offset_pwr), as still the default number of 4 threads is used. Does anybody have any suggestions?

RE: Parallel processing - Added by Julia Neelmeijer over 9 years ago

Peter, check out Jacky's answer in RE: Running GAMMA on a HPC-Cluster .

RE: Parallel processing - Added by Peter Friedl over 9 years ago

First of all: thank you all for your help.

I have some new interesting information on multitreading with GAMMA. Overall we were successful in enabling multithreading with the approach Julia presented above. It is working fine with most of the algorithms that use OPENMP.

As I wrote 6 days ago, we were trying to test the multithreading-ability of GAMMA mostly with the algorithm “offset_pwr”. In our case (in contrast to the example Julia showed us) only 4 of the 16 available threads were used for this function. That was why I was thinking that multithreading does`t work for us in general. But after a short conversation with the people from GAMMA it turned out that in newer versions of GAMMA (we use the Version 2014-12-11) for some algorithms (offset_pwr is such an algorithm) the usage of max. 4 threads now is hard coded in the C-code. The reason behind is (referring to what the GAMMA people told me) that multithreading with more than 4 threads doesn’t improve the performance of the mentioned algorithms in a noticeable way. A statement which is from my point of view a little bit questionable, especially concerning offset_pwr, as this function is very CPU-intensive…

RE: Parallel processing - Added by Julia Neelmeijer over 9 years ago

Hi Peter, thanks for your input.

I actually also stumbled over the 4 threads in offset_pwr lately, but didn't really continued to think about it, since it obviously (from my post above) had worked before. Since I also can hardly believe that 4 vs. 16 threads does not make a difference, I think we should ask GAMMA to reconsider about that setting. Maybe it would be an option to offer a performance test with one data pair, but different settings?

RE: Parallel processing - Added by Peter Friedl over 9 years ago

I already wrote to GAMMA and asked them whether they are really sure that increasing the number of threads for offset_pwr doesn`t bring any improvement of its performance. I will let you know what`s their answer on that. 

RE: Parallel processing - Added by Jacqueline Tema Salzer over 9 years ago

It does not seem to be a problem with the offset_pwr in the GAMMA version installed on our cluster.

OPENMP: maximum number of physically available processors: 16
OPENMP: maximum number of available threads: 16
Maximum number of threads available: 16

Overall producing an interferogram runs 2-3 times faster on a single of those nodes than on my machine. Haven't tried any unwrapping yet though
Jackie

RE: Parallel processing - Added by Peter Friedl over 9 years ago

Jackie: is this the output of offset_pwr? Which version of the algorithm are you unsing?

RE: Parallel processing - Added by Jacqueline Tema Salzer over 9 years ago

Yes, it is from offset_pwr. The Gamma version on the cluster appears to be from July 2014 (note it is not the ubuntu version!). There is also a newer version installed, I will check if the problem appears when I use that.

RE: Parallel processing - Added by Peter Friedl over 9 years ago

That`s interesting...we are using the GAMMA version for ubuntu and the offset_pwr algorithm in the version v3.6 clw 3-Dec-2014. Which version of the algorithm (not software) are you using?

RE: Parallel processing - Added by Jacqueline Tema Salzer over 9 years ago

Ok slowly identifying the issue.
In the version I was currently running it said v3.6 13-Oct-2013 for offset_pwr, which was working using 16 threads
the newer version we have is v3.6 3-Dec-2014. In this one, offset_pwr was only taking 4 threads, like Peter and Julia are saying. I couldn't directly compare how it affects the processing time yet.

RE: Parallel processing - Added by Jacqueline Tema Salzer over 9 years ago

The total processing time does in fact not appear to be affected much by it (doing 200 offset measurements in range and azimuth at an oversampled window size of 256). Interesting...
Jackie

RE: Parallel processing - Added by Julia Neelmeijer over 9 years ago

But then I am wondering whether the code can be tweaked by GAMMA in a way that we do benefit from having more than 4 threads available?

RE: Parallel processing - Added by Peter Friedl over 9 years ago

That`s what I`m asking myself too. Hence I wrote exactly this question to GAMMA. I will tell you their answer. :-)

RE: Parallel processing - Added by Thorsten Seehaus over 9 years ago

One reason why using more then 4 threads is not more effective doing tracking could be, that the SAR images are not loaded completely in the RAM. Only the small tracking frames are loaded in the RAM. Consequently, the data traffic between the RAM and the hard disk or the speed of the hard disk could limit the processing speed. Did someone try to do feature tracking on a regular SATA hard disk and a SSD hard disk. Maybe this could increase the processing speed.

RE: Parallel processing - Added by Peter Friedl over 9 years ago

I made new experiences with multithreading in GAMMA, which I’d like to share with you:

1. The new offset_pwr and offset_tracking programs are now designed to allow multithreading with more than 4 threads. So the processing with these programs is not hard coded to 4 threads anymore like in the previous versions. This is what GAMMA told me.

2. Still we struggle a lot with enabling multithreading on our machine in general. At a first glance everything seems to be fine: Our 16 available threads are detected by OPENMP and the environment variable is correctly set to 16. The console output confirms this for every GAMMA-program like this:

OPENMP: number of physically available processors: 16
OPENMP: max. number of threads (program defined): 4
OPENMP: max. number of available threads, (environment variable OMP_NUM_THREADS): 16
OPENMP: number of threads that will be used: 16

As Thorsten mentioned, the benefit of multithreading with more than 4 cores is quite small or even not existing, though. In our case it concerns not only tracking but all other operations, too. Due to the proper console output I thought that our processing runs with 16 threads. But I always wondered why this gives no real improvement to the processing speed. Then I started to monitor the CPU performance and I had to find out that just 25% of our CPU-capacity (so 4 of 16 cores) are used during the entire processing. This explains the missing processing speed-up. So in our case the GAMMA-programs just claim to use 16 threads but in fact they do not. What`s about you? Have you already checked your CPU-usage? Did you make the same experiences? Does anybody have any suggestions?

RE: Parallel processing - Added by Jacqueline Tema Salzer over 9 years ago

Running the new programs on our cluster, the given number of threads are used.
I checked the processing speed for a list of interferograms. The same stack took 110min, 70min and 45min with 4, 8, and 16 threads respectively, giving a factor of ~0.64 (instead 0f 0.5) for doubling the number of processors. Bear in mind there may be other bottlenecks on a cluster that limit the processing speed (such as the write/read speeds, and for transfer of data between the master and the nodes). I'm planning to run some more exhaustive pixel offset time series soon, will see what the differences are then.
Jackie

RE: Parallel processing - Added by Peter Friedl over 9 years ago

Ok, after using a different program for resource monitoring I figured out that indeed 16 cores are used for processing. So multithreading is working, although just about 25% of the available CPU-capacity is used and the processing is comparatively slow. But the reason for the low performance must be something else, maybe one of the other bottlenecks you mentioned. We will see...:-)

    (1-17/17)