Random crashes while repeatedly running the same script

Clearly defined bug reports and their fixes
Post Reply
robtovey
Posts: 7
Joined: 22 Jan 2024, 14:35
Antispam: Yes

Random crashes while repeatedly running the same script

Post by robtovey »

I've recently been playing with a particular mechanical simulation problem and found a surprisingly broad spectrum of behaviours. I've run the attached example 8 times this morning and had:
- 5 successful executions
- two overflow/underflow errors with the message "Norm of solution appears to be NaN"
- one degenerate mesh error which prints the line "ERROR:: ElementMetric: Degenerate 1D element: ..." a LOT of times. The script prints over 2.9e6 lines to stdout!

The command line printouts are attached as log files.
One of the weirdest things to me is that the numerical tracking data is completely deterministic. If I compare the logs of two successful outputs, then the convergence metrics are identical with no rounding errors. But whether the solver crashes or not and which error message I get is completely random.

This doesn't occur on all installations of elmer:
- on 2 ubuntu ppa installs with MUMPS, HYPRE, and MPI linked in I see a roughly 30%-40% crash rate
- on 1 ubuntu ppa install with only MUMPS and MPI linked there are no crashes running the script
- on 1 self-built install (version a399e88) without MUMPS, HYPRE, or MPI linked there are no crashes running the script

I haven't done too much digging but at least from the header printed at the beginning of every ElmerSolver call these seem to be the only differences in setup. It looks like HYPRE is the distinguishing factor, but I don't think the script is actually using it. I've used the default umfpack solver without any preconditioners.

Happy to do some debugging on my end if there are any suggestions for how to narrow down the source of the problem, but I don't know how to go much further beyond pointing the finger at HYPRE.

I've attached the script, mesh, and log files to the parallel github issue; https://github.com/ElmerCSC/elmerfem/issues/465
kevinarden
Posts: 2327
Joined: 25 Jan 2019, 01:28
Antispam: Yes

Re: Random crashes while repeatedly running the same script

Post by kevinarden »

I would suspect mesh quality.

I downloaded the files and ran as is and I received the numerous ERROR:: ElementMetric: Degenerate errors

I did a refine by split 1 time in gmsh and that error went away

I then received a non-convergence error ERROR:: ComputeChange: Numerical Error: Norm of solution appears to be NaN

The stresssolver is very dependent on mesh quality, triangles are always a problem and should be avoided if possible, if necessary use 2nd order and a fine mesh, without a severe mesh density change in the model.
kevinarden
Posts: 2327
Joined: 25 Jan 2019, 01:28
Antispam: Yes

Re: Random crashes while repeatedly running the same script

Post by kevinarden »

After the refining by splitting, I recombined the triangles to quads, went 2nd order, and the all of the errors went away and a solution was obtained.
robtovey
Posts: 7
Joined: 22 Jan 2024, 14:35
Antispam: Yes

Re: Random crashes while repeatedly running the same script

Post by robtovey »

Hi kevinarden,
Thank you for having a look.

My reason for calling this a bug is that different builds of Elmer do not experience any crashing. I have a friend with an older version of the ppa installation (https://launchpad.net/~elmer-csc-ubuntu ... er-csc-ppa) which has not crashed while running this script 100 times. I also have a version built from a much newer devel version https://github.com/ElmerCSC/elmerfem/co ... f1d6c0dce3 and that also doesn't crash when I run the script 100 times.

It looks like you are getting the same spectrum of errors at quite a similarly high frequency to my local install.
My understanding is that the elmer solver should be deterministic (with consistent random seed), but this crashing behaviour is not deterministic.
Do you see the same behaviour when you repeat the script with your quad mesh?
On my local installation I cannot repeat this script 5 times without at least one of them crashing (even with higher resolution 2nd order meshes).
robtovey
Posts: 7
Joined: 22 Jan 2024, 14:35
Antispam: Yes

Re: Random crashes while repeatedly running the same script

Post by robtovey »

An update added by juharu has fixed the bug. There was an uninitialised parameter being used in part of the Mortar Boundary code.
To be precise, this line: https://github.com/ElmerCSC/elmerfem/co ... 2aaadf0f84

Now when I update to the latest version (latest git or latest ppa) I can run the test script 100 times without any crashes.

Thanks again to kevinarden and juharu for their help on this.
Post Reply