Weird error

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
hgm
Posts: 27795
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Weird error

Post by hgm »

I am plagued by an illusive error, and I understand less and less how it could be my fault. In the search routine of my engine I have a piece of code like

Code: Select all

if(...) {
    ...
} else {
    ...
    // if(FATAL_NODE) printf("OK\n"), fflush(stdout), exit(0);
}
// if(FATAL_NODE) printf("OK2\n"), fflush(stdout), exit(0);
Searching a certain position the engine reproducibly crashes. When I try to print lots of diagnostics to figure out where it crashes, the crash doesn't occur. With great difficulty I nevertheless managed to figure out in which node exactly it crashes ("FATAL_NODE" above represents a test for being in that node). Anything I print in that node seems to make the crash disappear, or at least move elsewhere (to a later iteration, and presumably another node).

So I have been moving around a print statement + exit() combination to figure out where in this node the crash happens. When I only uncommentize the printing of "OK" above, the program prints "OK", and then of course exits. When I only uncomment the printing of OK2, nothing is printed. So one can deduce that the crash is caused by what it is doing between the two print statements. Except of course that it isn't doing anything there...

When I uncomment both (clipping the exit() call off the first) it prints both OK and OK2. When I then comment out the OK2 printing again, it crashes only in iteration 8 rather than iteration 5 (presumably in another node).

To test the hypothesis that this is because I use some unintialized variable altered by printf, I made routines Save() and Restore() that copy their stack frame (of 4KB) to a static memory area, or copy it back from there, and put those around the printf/fflush. That still shifts the crash to another place. If I only call Save/Restore in that place (i.e. remove the printf for "OK", and leave the OK2 stuff commented out), it shifts the crash to yet another iteration.

What on earth could I be doing wrong to get such sick behavior? This is with gcc under Linux. Any ideas? Is it possible that the code is modified during execution? I though that under Linux the code segment was write-protected?

Oh, and when I compile with the -m32 flag, the crash seems to have gone away completely...
abulmo
Posts: 151
Joined: Thu Nov 12, 2009 6:31 pm

Re: Weird error

Post by abulmo »

hgm wrote:I am plagued by an illusive error, and I understand less and less how it could be my fault. In the search routine of my engine I have a piece of code like

Code: Select all

if(...) {
    ...
} else {
    ...
    // if(FATAL_NODE) printf("OK\n"), fflush(stdout), exit(0);
}
// if(FATAL_NODE) printf("OK2\n"), fflush(stdout), exit(0);
Searching a certain position the engine reproducibly crashes. When I try to print lots of diagnostics to figure out where it crashes, the crash doesn't occur. With great difficulty I nevertheless managed to figure out in which node exactly it crashes ("FATAL_NODE" above represents a test for being in that node). Anything I print in that node seems to make the crash disappear, or at least move elsewhere (to a later iteration, and presumably another node).

So I have been moving around a print statement + exit() combination to figure out where in this node the crash happens. When I only uncommentize the printing of "OK" above, the program prints "OK", and then of course exits. When I only uncomment the printing of OK2, nothing is printed. So one can deduce that the crash is caused by what it is doing between the two print statements. Except of course that it isn't doing anything there...

When I uncomment both (clipping the exit() call off the first) it prints both OK and OK2. When I then comment out the OK2 printing again, it crashes only in iteration 8 rather than iteration 5 (presumably in another node).

To test the hypothesis that this is because I use some unintialized variable altered by printf, I made routines Save() and Restore() that copy their stack frame (of 4KB) to a static memory area, or copy it back from there, and put those around the printf/fflush. That still shifts the crash to another place. If I only call Save/Restore in that place (i.e. remove the printf for "OK", and leave the OK2 stuff commented out), it shifts the crash to yet another iteration.

What on earth could I be doing wrong to get such sick behavior? This is with gcc under Linux. Any ideas? Is it possible that the code is modified during execution? I though that under Linux the code segment was write-protected?

Oh, and when I compile with the -m32 flag, the crash seems to have gone away completely...
Looks like an Heisenbug.
Did you try a debugger (gdb)?
Another fantastic tool for hard to find bugs (not necessarily causing a crash) is valgrind.
Also try to use gcc with all warning on, ie: -std=c99 -W -Wall -Wextra -pedantic. It may help to find some stupid bugs or the presence of useless code.
Richard
syzygy
Posts: 5563
Joined: Tue Feb 28, 2012 11:56 pm

Re: Weird error

Post by syzygy »

Most likely an uninitialised variable or an array out of bounds error or something to that effect.

If you are compiling with optimisation, then anything can happen if you change a little thing. Of course disabling optimisations will likely hide the error.

One option is compiling with debugging information, catching the error in gdb and then trying to figure out what went wrong (with the limited information that you can extract, because optimisation will obscure things).

You could also try compiling with -fsanitize and see what happens:
https://gcc.gnu.org/onlinedocs/gcc/Inst ... tions.html
ymatioun
Posts: 64
Joined: Fri Oct 18, 2013 11:40 pm
Location: New York

Re: Weird error

Post by ymatioun »

the only way i could fix problems like this to catch the crash, save a crush dump, open it in debugger. Then you can see exact instruction that causes the crash, and all the register values.

As was mentioned above, this is probably array going out of bounds; but without catching this "in the act", you may not be able to figure it out.
mar
Posts: 2555
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: Weird error

Post by mar »

abulmo wrote:Another fantastic tool for hard to find bugs (not necessarily causing a crash) is valgrind.
+1 for valgrind and another for using a debugger
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Weird error

Post by bob »

hgm wrote:I am plagued by an illusive error, and I understand less and less how it could be my fault. In the search routine of my engine I have a piece of code like

Code: Select all

if(...) {
    ...
} else {
    ...
    // if(FATAL_NODE) printf("OK\n"), fflush(stdout), exit(0);
}
// if(FATAL_NODE) printf("OK2\n"), fflush(stdout), exit(0);
Searching a certain position the engine reproducibly crashes. When I try to print lots of diagnostics to figure out where it crashes, the crash doesn't occur. With great difficulty I nevertheless managed to figure out in which node exactly it crashes ("FATAL_NODE" above represents a test for being in that node). Anything I print in that node seems to make the crash disappear, or at least move elsewhere (to a later iteration, and presumably another node).

So I have been moving around a print statement + exit() combination to figure out where in this node the crash happens. When I only uncommentize the printing of "OK" above, the program prints "OK", and then of course exits. When I only uncomment the printing of OK2, nothing is printed. So one can deduce that the crash is caused by what it is doing between the two print statements. Except of course that it isn't doing anything there...

When I uncomment both (clipping the exit() call off the first) it prints both OK and OK2. When I then comment out the OK2 printing again, it crashes only in iteration 8 rather than iteration 5 (presumably in another node).

To test the hypothesis that this is because I use some unintialized variable altered by printf, I made routines Save() and Restore() that copy their stack frame (of 4KB) to a static memory area, or copy it back from there, and put those around the printf/fflush. That still shifts the crash to another place. If I only call Save/Restore in that place (i.e. remove the printf for "OK", and leave the OK2 stuff commented out), it shifts the crash to yet another iteration.

What on earth could I be doing wrong to get such sick behavior? This is with gcc under Linux. Any ideas? Is it possible that the code is modified during execution? I though that under Linux the code segment was write-protected?

Oh, and when I compile with the -m32 flag, the crash seems to have gone away completely...
This is almost certainly a case of an undefined/unitialized local variable. These are allocated on the stack, and ANYTHING you do that uses the stack can unintentionally zero the value so that everything is OK. IE a call to printf() immediately makes a direct call to the shared C library, which pushes stuff on the stack, and then eventually returns.

If you compile with -O, generally gcc will catch the obvious cases where scalar variables are initialized, but won't catch arrays and such where some values are assigned and some are not.
flok

Re: Weird error

Post by flok »

Try running it under valgrind.
Valgrind shows you uninitialized variables, buffer over- and underruns, use after free, etc etc
brtzsnr
Posts: 433
Joined: Fri Jan 16, 2015 4:02 pm

Re: Weird error

Post by brtzsnr »

Most likely a buffer overflow or use after free. Did you try with address sanitizer? https://github.com/google/sanitizers/wi ... sSanitizer
User avatar
stegemma
Posts: 859
Joined: Mon Aug 10, 2009 10:05 pm
Location: Italy
Full name: Stefano Gemma

Re: Weird error

Post by stegemma »

hgm wrote:[...]So one can deduce that the crash is caused by what it is doing between the two print statements.[...]
The closed } means that the compiler will destroy local variables. If it would be C++ code I would search for class destructors but in C I don't know if this idea could help.
Author of Drago, Raffaela, Freccia, Satana, Sabrina.
http://www.linformatica.com
Antonio Torrecillas
Posts: 90
Joined: Sun Nov 02, 2008 4:43 pm
Location: Barcelona

Re: Weird error

Post by Antonio Torrecillas »

Each time you modify the code, you are testing another thing. things previously optmized may now not be.Keep the changes to a minimal expression.
When you compile for 32 bits, alignments can be different, so that our conflicting variable can be surrounded by some spare bytes that can be "safely" hit without harm.
So the problem is which variable is in trouble and find the place in the code where is the bug.
Assuming that no debugger help you, there is still things that can be done to resolve this.
For the first question, you can put a filler variable in some position of the stack.
See the effect in the bug, and do a dichotomic search to locate the conflicting variable.
For the test condition, try to minimize the impact in the stack, if your FATAL_NODE test involve calls to function you are changing se stack and eventually displacing the effect to another place.
to ensure the compiler don't optimize out your variable you can declare it volatile and do something like this:

Code: Select all

int function(int params)
{
int a;
char b[100];
volatile int filler = 0xCCCCCCCC; // volatile to prevent optimization
int c;
....
 if(filler != 0xCCCCCCCC) goto ko_label; // no impact in stack test
...
return 0;
ko_label:
 printf("KO\n");
 exit(1);
}
be patient