Linear Solver - Different behaviours on different OS
Posted: 03 Jun 2021, 12:00
Hello everyone,
I am currently experiencing a strange behaviour of Elmer, and I really do not understand its origin.
In a nutshell: though using the same Elmer version (9.0), same sif file and same mesh, I am getting very different results on different computers.
My workflow is simply to solve the strain-imposed compression of a piece of porous material, in steady-state and using an isotropic non-linear viscoplastic rheology for the constitutive material. The porous media meshes are produced with the CGAL library and converted to the Elmer format with a personal program.
I'm running the simulations on two different HPCs, one based on Centos 7 on which I did not personally do the installation, and another based on Ubuntu 20 where I manually compiled Elmer (linked against, MMG, Hypre, and Mumps). Both machines use Elmer 9.0. After the installation on the Ubuntu machine, 85% of the tests pass the ctest.
The resolution of the linear system does not occur properly on the Ubuntu machine. I've put a dropbox link below towards an archive with a test case. When I'm running the simulation with the mesh mesh_100 (sif file is Compression_100.sif), the pre-conditioner manages to obtain quite low residuals (1e-6), which is below my standard convergence criterion (1e-4). However, when I look at the proposed solution it is clearly faulty (and just correspond to the initial solution provided in the sif). However, when I run the same case of the Centos HPC, everything looks fine.
From there I've tested a few things on the Ubuntu machine:
- Modifying the initial condition from imposing the average strain everywhere to zero everywhere. The residual after the pre-conditioner remains below the convergence criterion, and Elmer simply outputs the initial condition.
- I dropped the convergence criterion to 1e-10. In this case the residual after the pre-conditioner is thus above the criterion, and in this case the linear solver diverges.
- If I use a smaller mesh (mesh_50 in the attached archive), then everything occurs nicely (residuals are not so low just after the pre-conditioner, and the linear solver converge afterwards). Here the Centos and Ubuntu machine behave similarly.
- If I use a large mesh with a simple geometry produced with GMSH (not with CGAL) and then convert it with ElmerGrid, everything occurs nicely.
- I've tested on another Ubuntu 20 machine. I have the same behaviour.
- The problem occurs whether I am running the problem sequentially or in parallel.
- If I increase the "Critical Shear Rate" of the material, things start to look normal on Ubuntu.
- I also realised that simulations on the Ubuntu machine are prone to producing the error "WARNING:: RealBiCGStab(l): kappal^2 is non-positive, iteration halted" during the Linear Solving stage. It is something I seldom encounter with Centos.
I am quite lost, and have no idea why the two machines behave differently. Visibly it could be related to the mesh (as it only occur with sufficiently large CGAL meshes), to the libraries/os (as the behaviour is different on Ubuntu and Centos), and/or to the way the effective viscosity is computed in the material law.
Does any one have any idea on the origin of this problem? Let me know if you require some more informations (specific version of the libraries, etc).
Thanks a lot!
Kévin
DROPBOX LINK: https://www.dropbox.com/s/pnd6on1bcdl2s ... n.zip?dl=0
I am currently experiencing a strange behaviour of Elmer, and I really do not understand its origin.
In a nutshell: though using the same Elmer version (9.0), same sif file and same mesh, I am getting very different results on different computers.
My workflow is simply to solve the strain-imposed compression of a piece of porous material, in steady-state and using an isotropic non-linear viscoplastic rheology for the constitutive material. The porous media meshes are produced with the CGAL library and converted to the Elmer format with a personal program.
I'm running the simulations on two different HPCs, one based on Centos 7 on which I did not personally do the installation, and another based on Ubuntu 20 where I manually compiled Elmer (linked against, MMG, Hypre, and Mumps). Both machines use Elmer 9.0. After the installation on the Ubuntu machine, 85% of the tests pass the ctest.
The resolution of the linear system does not occur properly on the Ubuntu machine. I've put a dropbox link below towards an archive with a test case. When I'm running the simulation with the mesh mesh_100 (sif file is Compression_100.sif), the pre-conditioner manages to obtain quite low residuals (1e-6), which is below my standard convergence criterion (1e-4). However, when I look at the proposed solution it is clearly faulty (and just correspond to the initial solution provided in the sif). However, when I run the same case of the Centos HPC, everything looks fine.
From there I've tested a few things on the Ubuntu machine:
- Modifying the initial condition from imposing the average strain everywhere to zero everywhere. The residual after the pre-conditioner remains below the convergence criterion, and Elmer simply outputs the initial condition.
- I dropped the convergence criterion to 1e-10. In this case the residual after the pre-conditioner is thus above the criterion, and in this case the linear solver diverges.
- If I use a smaller mesh (mesh_50 in the attached archive), then everything occurs nicely (residuals are not so low just after the pre-conditioner, and the linear solver converge afterwards). Here the Centos and Ubuntu machine behave similarly.
- If I use a large mesh with a simple geometry produced with GMSH (not with CGAL) and then convert it with ElmerGrid, everything occurs nicely.
- I've tested on another Ubuntu 20 machine. I have the same behaviour.
- The problem occurs whether I am running the problem sequentially or in parallel.
- If I increase the "Critical Shear Rate" of the material, things start to look normal on Ubuntu.
- I also realised that simulations on the Ubuntu machine are prone to producing the error "WARNING:: RealBiCGStab(l): kappal^2 is non-positive, iteration halted" during the Linear Solving stage. It is something I seldom encounter with Centos.
I am quite lost, and have no idea why the two machines behave differently. Visibly it could be related to the mesh (as it only occur with sufficiently large CGAL meshes), to the libraries/os (as the behaviour is different on Ubuntu and Centos), and/or to the way the effective viscosity is computed in the material law.
Does any one have any idea on the origin of this problem? Let me know if you require some more informations (specific version of the libraries, etc).
Thanks a lot!
Kévin
DROPBOX LINK: https://www.dropbox.com/s/pnd6on1bcdl2s ... n.zip?dl=0