non-reproducible machine-specific convergence failure
Posted: 07 Aug 2014, 04:48
Hi all, not sure whether to put this in the Elmer/Ice section or in the main Elmer section as it is probably not ice-specific, though I am not yet certain!
I am running flowline marine ice sheet simulations. I am running on Taito (CSC machine in Finland) and on Raijin (NCI machine in Australia). Some of my simulations run fine on both machines and give results that look very similar.
Some simulations run fine on Taito, but on Raijin they fail. The failure varies somewhat. It occurs after some decades or centuries (1 year timestep), and usually takes the form of failure to converge in the Stokes solver (this is during Piccard iterations after using the MUMPS solver to get the linear solution). I have tried lowering the non-linear relaxation factor by an order of magnitude and playing around with convergence criteria and number of iterations. Unfortunately the failure is robust, if not reproducible. When I say it is not reproducible, let me clarify: I can run the exact same simulation several times on Raijin and it will fail at a different timestep each time, whereas it will successfully run for 20000 years on Taito.
I am running on 16 processors.
The Elmer version has a (very) slight difference: 6804 on Taito, 6798 on Raijin.
The Taito installation uses the Intel MPI library. The Raijin installation uses OpenMPI.
On Taito MUMPS version is 4.10.0. On Raijin MUMPS is accessed via PetSC 3.4.3 (I am not sure which version of MUMPS is used, but I could find out).
I have not posted my full setup yet, I thought I'd just ask first if anyone else has encountered this kind of problem? And if the Elmer installation or some aspect of the parallel implementation is at fault then posting my setup will probably not be useful.
Things I can try:
Re-installing Elmer on Raijin, making sure to use the exact same version as is in use on Taito.
Try to install Elmer on Raijin using the same mpi library as on Taito (I'm not sure how many dependancies will need to be re-compiled in order to do this).
Check MUMPS versions and compile new MUMPS on Raijin?
Try switching from MUMPS to a different solver?
If you've seen something like this before or have any ideas about things I could try, all suggestions are welcome!
Regards,
Rupert
I am running flowline marine ice sheet simulations. I am running on Taito (CSC machine in Finland) and on Raijin (NCI machine in Australia). Some of my simulations run fine on both machines and give results that look very similar.
Some simulations run fine on Taito, but on Raijin they fail. The failure varies somewhat. It occurs after some decades or centuries (1 year timestep), and usually takes the form of failure to converge in the Stokes solver (this is during Piccard iterations after using the MUMPS solver to get the linear solution). I have tried lowering the non-linear relaxation factor by an order of magnitude and playing around with convergence criteria and number of iterations. Unfortunately the failure is robust, if not reproducible. When I say it is not reproducible, let me clarify: I can run the exact same simulation several times on Raijin and it will fail at a different timestep each time, whereas it will successfully run for 20000 years on Taito.
I am running on 16 processors.
The Elmer version has a (very) slight difference: 6804 on Taito, 6798 on Raijin.
The Taito installation uses the Intel MPI library. The Raijin installation uses OpenMPI.
On Taito MUMPS version is 4.10.0. On Raijin MUMPS is accessed via PetSC 3.4.3 (I am not sure which version of MUMPS is used, but I could find out).
I have not posted my full setup yet, I thought I'd just ask first if anyone else has encountered this kind of problem? And if the Elmer installation or some aspect of the parallel implementation is at fault then posting my setup will probably not be useful.
Things I can try:
Re-installing Elmer on Raijin, making sure to use the exact same version as is in use on Taito.
Try to install Elmer on Raijin using the same mpi library as on Taito (I'm not sure how many dependancies will need to be re-compiled in order to do this).
Check MUMPS versions and compile new MUMPS on Raijin?
Try switching from MUMPS to a different solver?
If you've seen something like this before or have any ideas about things I could try, all suggestions are welcome!
Regards,
Rupert