non-reproducible machine-specific convergence failure

Extension of Elmer in computational glaciology
Post Reply
rgladstone
Posts: 64
Joined: 15 Apr 2013, 16:23
Antispam: Yes

non-reproducible machine-specific convergence failure

Post by rgladstone »

Hi all, not sure whether to put this in the Elmer/Ice section or in the main Elmer section as it is probably not ice-specific, though I am not yet certain!

I am running flowline marine ice sheet simulations. I am running on Taito (CSC machine in Finland) and on Raijin (NCI machine in Australia). Some of my simulations run fine on both machines and give results that look very similar.

Some simulations run fine on Taito, but on Raijin they fail. The failure varies somewhat. It occurs after some decades or centuries (1 year timestep), and usually takes the form of failure to converge in the Stokes solver (this is during Piccard iterations after using the MUMPS solver to get the linear solution). I have tried lowering the non-linear relaxation factor by an order of magnitude and playing around with convergence criteria and number of iterations. Unfortunately the failure is robust, if not reproducible. When I say it is not reproducible, let me clarify: I can run the exact same simulation several times on Raijin and it will fail at a different timestep each time, whereas it will successfully run for 20000 years on Taito.

I am running on 16 processors.

The Elmer version has a (very) slight difference: 6804 on Taito, 6798 on Raijin.

The Taito installation uses the Intel MPI library. The Raijin installation uses OpenMPI.

On Taito MUMPS version is 4.10.0. On Raijin MUMPS is accessed via PetSC 3.4.3 (I am not sure which version of MUMPS is used, but I could find out).

I have not posted my full setup yet, I thought I'd just ask first if anyone else has encountered this kind of problem? And if the Elmer installation or some aspect of the parallel implementation is at fault then posting my setup will probably not be useful.

Things I can try:
Re-installing Elmer on Raijin, making sure to use the exact same version as is in use on Taito.
Try to install Elmer on Raijin using the same mpi library as on Taito (I'm not sure how many dependancies will need to be re-compiled in order to do this).
Check MUMPS versions and compile new MUMPS on Raijin?
Try switching from MUMPS to a different solver?

If you've seen something like this before or have any ideas about things I could try, all suggestions are welcome!

Regards,
Rupert
tzwinger
Site Admin
Posts: 99
Joined: 24 Aug 2009, 12:20
Antispam: Yes

Re: non-reproducible machine-specific convergence failure

Post by tzwinger »

Hi Rupert,
Checking the commit history of the SVN repository, the most likely changes after revision 6798 that could influence your simulation where actually committed by you (Grid2DInterpolator.f90 and FlowDepth,f90). I don't know if those influence something set for your Stokes solver.
Else, my largest suspicion would be that there is something fishy with MUMPS. For MUMPS it also makes a difference, which library it uses internally for ordering (pord, scotch or metis). So it is not just the version, but also the libraries that are linked into it. You could try to build your own MUMPS and link to this library at your box in down-under.

Cheers,

Thomas
rgladstone
Posts: 64
Joined: 15 Apr 2013, 16:23
Antispam: Yes

Re: non-reproducible machine-specific convergence failure

Post by rgladstone »

Thanks Thomas.

I am guessing either something to do with MUMPS or MPI libraries, but I'll check after installing the same Elmer version first as that is the easiest thing to do.

I guess I'll try installing MUMPS next rather than using the default MUMPS installation we have here. Which libraries does MUMPS use on Taito? Pord scotch or metis? I'll try to make sure I use the same one when I try my own install of MUMPS in down-under-land.

Cheers,
Rupert
tzwinger
Site Admin
Posts: 99
Joined: 24 Aug 2009, 12:20
Antispam: Yes

Re: non-reproducible machine-specific convergence failure

Post by tzwinger »

Hi Rupert,
I send you the setup scripts of MUMPS on taito - although they might not be of particular help, as system differs. If I interpret them right, we use parmetis there.

Cheers,

Thomas
Post Reply