strcpy() revisited

bob · Post by **bob** » Mon Dec 09, 2013 2:25 am

wgarvin wrote:bob, I humbly suggest you take an hour and read each of these links with an open mind. I know it seems like a waste of time because you think you already know how it works. But just for an hour, forget what you know and read what these two experts have to say about the topic of undefined behavior:

(1) Chris Lattner, primary author of LLVM and the clang optimizing compiler. Chris is a bona-fide expert on compiler optimization and knows more about undefined behavior than any ten average programmers combined.
* LLVM Project BLog: What Every C Programmer Should Know About Undefined Behavior, Part 1 / 3
* LLVM Project BLog: What Every C Programmer Should Know About Undefined Behavior, Part 2 / 3
* LLVM Project BLog: What Every C Programmer Should Know About Undefined Behavior, Part 3 / 3

(2) John Regehr, associate professor of CS at University of Utah. He does research on software correctness, including things like compiler fuzzing, static analysis of source programs to find undefined behavior, etc. Using fuzzing and test-case-reducing tools he and his students wrote, Regehr has reported many hundreds of bugs in most of the widely-used optimizing compilers (GCC, clang, etc.) and helped popularize that style of compiler testing.
* A Guide to Undefined Behavior in C and C++, Part 1 of 3
* A Guide to Undefined Behavior in C and C++, Part 2 of 3
* A Guide to Undefined Behavior in C and C++, Part 3 of 3
* Integer Overflow Paper
* Winners of a contest to find the most surprising code snippet affected by undefined behavior

If you read these pages with an open mind, you may come away with a new respect for (and perhaps a healthy fear of) undefined behavior.

I think most programmers who use the C and C++ languages don't really understand what they are messing with when they stray into 'undefined' territory. They think the C language closely maps to what the underlying hardware does, so anything you write in C will somehow be mapped in a sensible way. This is only true for the parts of the language that are NOT declared as 'off limits' by labelling them as 'undefined behavior'. If you stray into UB territory, nine times out of ten you'll get away with it and the tenth time you'll get ripped apart by a rabid bear. Or something equally unpleasant.

Most programmers just don't know about this. If something happens to work on their compiler, they assume it is OK. They are blissfully unaware of how close they came to disaster (and how their code is a time-bomb that might someday fail in a surprising way). Chris even explains how LLVM takes pains to try and do the 'sensible' thing for some cases of undefined behavior, because in those cases if the compiler actually exploited it to the fullest, too many existing programs would not run correctly.

I read it all. Here's something that might surprise you.

Not one new thing did I see. I lived thru the 70's and 80's when optimizers were just starting to get good. I attended ACM conferences where these topics were discussed. I listened to similar arguments for Fortran, C, you-name-it. I listened to the optimizer guys explain why they could optimize better if they could take advantage of certain assumptions such as no signed overflow. The general take on all of that, by those present, was a simple "bullshit." There are reasonable optimizations, from constant folding to common subexpression elimination, to moving loop invariant code out of the loop. the list goes on and on. And then there were those unsafe optimizations that could break a program.

The arguments raged just as much back then as they do now. I am not supposed to take advantage of undefined behavior, yet the compiler optimizer guys are allowed to do so. Even to the extend of breaking a working program. And the ultimate bad news is most of those unsafe optimizations are not very significant in most programs. Who REALLY does "if (a + 1 > a)"???

I didn't see anything there that sways my opinion of any of this, sorry. I STILL believe the compiler's job is to give me a higher-level way of expressing ideas so that I can write code quicker and debug it faster. Not to try to mangle code because the standard writers could not agree on a specific behavior, and therefore, they chose to leave it open. This is not a war of cleverness. It is a war of productivity. And this Apple change was anything but pro-productivity.

syzygy · Post by **syzygy** » Mon Dec 09, 2013 2:25 am

wgarvin wrote:
Nevertheless, there is something very troubling here. Your program's behavior is undefined -- you have no way of knowing what will happen...That means compilers may generate code to do whatever they like: reformat your disk, send suggestive email to your boss, fax source code to your competitors, whatever.
-- Scott Meyers, "Effective C++

He left out "impersonate you on internet fora posting hilarious complaints".

bob · Post by **bob** » Mon Dec 09, 2013 2:27 am

mcostalba wrote:
bob wrote: I'm going to repeat my original story, with some early blanks filled in.
I don't want to get into this discussions because it is useless. I'd just want to give you a good practical advice: run a valgrind session on your engine.

This tool can find (and warn on) the overlapping addresses and many other dubious usages. I had an overlapping with memcpy() and valgrind was able to spot it. And of course I have quickly fixed it in the proper way, without posting on talkchess...but this is another story.

Funny, but I have done this many times. However, how often do you rebuild a book while running valgrind, as opposed to testing the engine in various positions. This strcpy() problem was in the code that parses PGN, it never even occurred me to do any significant testing on it.

rbarreira · Post by **rbarreira** » Mon Dec 09, 2013 2:38 am

Code like "a + 1 > a" does happen, typically due to expansion of macros etc.

bob · Post by **bob** » Mon Dec 09, 2013 2:53 am

wgarvin wrote:
hgm wrote:All I can say is that I would not want to use compilers that use the very liberal definition of 'undefined behavior' in a standard as an excuse to maliciously sabotage my program when such behavior might occur, anymore than I would hire personel that thinks an unclear order I give them can mean I want them to shoot me in the back.

A crime remains a crime, even when you commit it because your host told you "make yourself at home"...
Sure, of course. But all optimizing compilers take advantage of undefined behavior to generate better code. If you read the links I gave from Lattner and Regehr, they describe several examples of how the compiler is able to generate better code for common, valid programs because it knows that the undefined behaviors "aren't allowed" and it is free to ignore them.

I don't buy any part of that statement, "generate better code" and "free to ignore undefined behavior". Think about that for a minute. First, I'd bet "better" turns into "very tiny speed improvement." And that "deleting code with undefined behavior" turns into "gross programming errors."

How is that tiny speed improvement "better code" if the underlying assumptions broke the code completely and caused it to crash?

The vast majority of C and C++ programmers don't understand this stuff very well, and yet they write lots of working code anyways. Most programmers know that some things are not defined and avoid using them. Trouble occurs though when they forget about signed overflow, unsafe pointer arithmetic, etc. or just accidentally rely on them in their programs. There are many many examples of this in real programs in the wild -- for example, Regehr recently surveyed open-source crypto libraries and found that most of them contain instances of undefined behavior. Serious bugs and exploitable security vulnerabilities have been traced back to undefined behavior.

So thats why I think every programmer ought to know enough about this stuff to avoid getting screwed by it. Which basically means "avoid undefined behavior like the plague". Even if your program works today, if it relies on UB somehow then it might fail unexpectedly 2 years from now or 5 years from now. As a professional programmer, I have an obligation to write robust, future-proof code that won't suddenly fail one day after I have moved on to something else.

I'd only add that the concept of a "idiot-proof programming language" has NEVER been developed. I think it far better to make every attempt to do what the programmer specified, as opposed to what a very loose standard allows in some cases that are labeled as "undefined" when they are really anything but on any available hardware today. Signed overflow as one example.

Here's a peculiar one using gcc 4.7.3.

Here's a simple source with an obvious problem:

#include <stdio.h>
#include <stdint.h>
int main()
{
int32_t i=0x40000000;
int32_t j=0x40000000;
int32_t k;

k=i+j;
printf("k=%x\n", k);
}

First, do we agree the k assignment is an example of undefined behavior, and using constants, an example where the compiler could eliminate it just as it does for the loop example posted yesterday.

however:

scrappy% gcc -O3 -o tst1 tst1.c
scrappy% ./tst1
k=80000000

Exactly what I want to see. It "did the right thing." So in addition to being a dangerous optimization, it is an inconsistent one. Let's do it here, let's remove it there. I would think the addition would be tossed, and since there is no value for the printf to print, that would be tossed, leaving a null program body. Makes you wonder. I'm looking to compilers for consistency, I look to the various ran lib stuff for randomness...

bob · Post by **bob** » Mon Dec 09, 2013 2:54 am

syzygy wrote:
hgm wrote:I would NEVER want to use a compiler that thinks the undefinedness of integer overflow would be any other than how the hardware it compiles for defines it.
Then you should use a compiler that makes a promise that it will deal with integer overflow in a particular way. For gcc, just get used to invoking it with -fwrapv. I'm sure you won't mind the loss in efficiency, especially on loops using an int as loop variable.

I'll bet this won't affect one loop in one thousand.

In fact, here is a run with crafty using -O3 with and without the signed wrap...

log.001: time=23.11 n=117008633 afhm=1.19 predicted=0 50move=0 nps=5.1M
log.002: time=23.13 n=117008633 afhm=1.19 predicted=0 50move=0 nps=5.1M

I can run it 10 times, each varies by a couple of hundredths of a second, one winning one round, the other winning the next. How is that for what you want to imply as a "crippling optimization limitation"??

This is NOT a big deal in terms of optimizations. Never was.

bob · Post by **bob** » Mon Dec 09, 2013 3:09 am

Rein Halbersma wrote:
hgm wrote:
I would NEVER want to use a compiler that thinks the undefinedness of integer overflow would be any other than how the hardware it compiles for defines it. That different hardware might define it in a different way is one thing. To use that as an excuse to do something completely different, which no hardware would ever do and which is certainly not what the programmer intends is quite another. These are not at the same level. Like filing something in a wrong drawer because I wasn't clear about where to file is not at the same legal level as shooting me in the back because I wasn't clear where to file.
No sane compiler will emit completely bogus machine instructions, but it can optimize away your instructions because it will think that no sane programmer will intend to rely on undefined behavior. For integer overflow, gcc has both a warning to let you know it is making these optimizations and a flag to force the behavior you intend to use on your machine. But because there are so many (~200 in the C Standard) different constructs that lead to undefined behavior, most compiler won't warn about them all.

Another minor problem. They are inconsistent with when code is removed. I posted a clear example of overflow that worked perfectly. Yet the loop posted yesterday confuses some compilers. There is a report online where someone did the if (a +1 > a) test and ran it against a bunch of versions of gcc and some optimized it away, some did not. Some optimized it away even with the -fwrapv compiler flag.

That's not exactly consistent, nor will it lead to good software development when a single compiler can't get it right. I compiled crafty with the -fwrapv option and it didn't slow it down at all. It is an uncommon optimization that commonly causes problems when done the other way...

bob · Post by **bob** » Mon Dec 09, 2013 3:10 am

rbarreira wrote:Code like "a + 1 > a" does happen, typically due to expansion of macros etc.

My term was RARELY. In the case of crafty, disabling the optimization makes absolutely no difference in speed... not surprising.

wgarvin · Post by **wgarvin** » Mon Dec 09, 2013 3:45 am

Okay, well, I've tried to convince you about what the situation currently is with undefined behavior, and if you don't buy it then there's not much more that I can do. Funny thing is, people have been asking questions about undefined behavior and receiving the correct-but-unsatisfying answers in places like comp.std.c for decades. The modern answer is the same as it was in 1992, with the added bonus that modern compilers are actually clever enough that they are more likely to be able to mangle code with undefined behavior in it.

The funny thing is I completely agree with Dr. Hyatt about how the world of undefined behavior SHOULD be. The current situation is quite unsatisfactory--the programmer can easily overlook a little thing, and get code that appears to be correct, and as best as he can tell DOES work on the compiler(s) he has in front of him. And then one day it will stop working because he overlooked a subtle piece of undefined behavior and some compiler's optimizer treated that the same as something that genuinely couldn't happen. And that code might be in a widely-deployed piece of software, it might even be a mission-critical piece of software for some business, or part of the trusted base of some security-conscious system, and the nasal demons of UB can be pretty frightening in that context.

I think there's probably a market for compilers with a "secure" mode, where they generate slightly less performant code but never perform these optimizations that rely on undefined behavior never happening. You can already get some of that using compiler options, such as -fwrapv (or -ftrapv) or -fno-strict-aliasing. They should take that to the logical conclusion and build us a mode that treats ALL undefined behavior as having some implementation-specific behavior (and bonus points if they spell out clearly what it is and commit to not changing it in the future). Maybe the standards committee or some research group will produce a spec that the compiler vendors could target for it. A lot of developers who make safety-critical embedded devices, avionics, medical equipment, OS kernels, etc. would be better served by a "safer" compiler rather than fastest possible generated code.

bob · Post by **bob** » Mon Dec 09, 2013 3:53 am

wgarvin wrote:Okay, well, I've tried to convince you about what the situation currently is with undefined behavior, and if you don't buy it then there's not much more that I can do. Funny thing is, people have been asking questions about undefined behavior and receiving the correct-but-unsatisfying answers in places like comp.std.c for decades. The modern answer is the same as it was in 1992, with the added bonus that modern compilers are actually clever enough that they are more likely to be able to mangle code with undefined behavior in it.

The funny thing is I completely agree with Dr. Hyatt about how the world of undefined behavior SHOULD be. The current situation is quite unsatisfactory--the programmer can easily overlook a little thing, and get code that appears to be correct, and as best as he can tell DOES work on the compiler(s) he has in front of him. And then one day it will stop working because he overlooked a subtle piece of undefined behavior and some compiler's optimizer treated that the same as something that genuinely couldn't happen. And that code might be in a widely-deployed piece of software, it might even be a mission-critical piece of software for some business, or part of the trusted base of some security-conscious system, and the nasal demons of UB can be pretty frightening in that context.

I think there's probably a market for compilers with a "secure" mode, where they generate slightly less performant code but never perform these optimizations that rely on undefined behavior never happening. You can already get some of that using compiler options, such as -fwrapv (or -ftrapv) or -fno-strict-aliasing. They should take that to the logical conclusion and build us a mode that treats ALL undefined behavior as having some implementation-specific behavior (and bonus points if they spell out clearly what it is and commit to not changing it in the future). Maybe the standards committee or some research group will produce a spec that the compiler vendors could target for it. A lot of developers who make safety-critical embedded devices, avionics, medical equipment, OS kernels, etc. would be better served by a "safer" compiler rather than fastest possible generated code.

I've grown up in an era of trusting the compiler. You got a bug, it is almost certainly a programming bug. That era seems to be coming to an end, because a change to a compiler really should not change program behavior, if the original compiler was worth a crap...

Yet we are seeing exactly that...

strcpy() revisited

Re: strcpy() revisited

Re: strcpy() revisited

Re: strcpy() revisited

Re: strcpy() revisited

Re: strcpy() revisited

Re: strcpy() revisited

Re: strcpy() revisited

Re: strcpy() revisited

Re: strcpy() revisited

Re: strcpy() revisited