strcpy() revisited
Posted: Sun Dec 08, 2013 7:02 pm
Since the other thread has gone so far off the deep end, with accusations of "you advocate undefined behavior" or "you don't understand undefined behavior" I'm going to repeat my original story, with some early blanks filled in.
Back in the 70's Bert Gower and I were thinking about opening books, and what to do with ours. Most of us back then organized a book as a sort of tree/linked-list, which failed to detect transpositions unless the transpositions were hard-coded into the book input. One day the "use the hash signature" idea popped up in my head and I wrote the code and was quite happy. One thing it did was eliminate any sort of hand-coding with links, meaning the book input (and resulting file) could be as large as we wanted.
Bert and I immediately began to hand-type the most recent MCO book (MCO 10) line by line, page by page. We needed some way to keep up with where moves came from so we started to use a "pseudo-PGN" format that looked like this:
[MCO10 Page xx Col yy Footnote zz]
1. P-K4 P-K4 2. N-Kb3 N-Qb3 ... etc (yes, that book was English Descriptive).
Cray Blitz was in fortran, with no strings or supporting functions of any kind, so I wrote a quick parser that included a "shift()" procedure that does what the overlapped strcpy() call did in Crafty, namely shift the input string left N elements (it was an array of course) to eliminate the just-parsed move. All worked well.
In 1994 when I started the Crafty project, I copied some of the CB stuff and simply converted it to C as quickly as practical. That "shift" was apparently turned into the equivalent strcpy(). Don't remember any of it specifically, just remember doing the quick conversion on a few pieces of Cray Blitz to get started (current seaboard command, for example, as we had the "Forsythe notation" to encode a position long before the "Forsythe-Edwards specification was written)..
So, overlapping strcpy() used, whether it was known that it used undefined behavior or not I have absolutely no idea. We are talking almost 20 years ago for this conversion, late 1994.
Flash forward to today. It has worked since late 1994, not one single problem, with crafty running on everything from a Cray, to early 64 bit processors (alpha, SPARC, MIPS, even Itanium) all the way through my macbook with the 3rd generation i7. Worked with every compiler ever tried, with only a few incompatibilities found (IBM AIX C compiler defaulted chars to unsigned, most everyone else defaults to signed, as an example). the code which was eventually modified into the 1998 edition of ReadPGN() has never failed. Not once. Not on any machine. Not on any O/S. Not on any compiler. That was the state of affairs until 2-3 weeks ago.
I released a new version of Crafty to fix a couple of quirks I had found dealing with EGTB hits (which I don't use). Within a week, I received a bug report that said "Can't build a book with crafty, it crashes..."
First thought was an obvious one... I broke something. I diff'ed everything and looked at each change. Nothing even remotely related to book creating, PGN parsing, move translation, file I/O (I do my own file I/O to avoid endian issues). Nothing.
I asked for more information and for the PGN that caused the problem. Received it (a fairly long file) and tried it. Worked perfectly. Unfortunately most of this communication and testing happened while I was at my office so I was doing the testing on my office linux box. Could not make it fail.
I asked "what system?" He responded "Apple Macbook." A lead. Since I had canned clang and had installed gcc, I looked up a co-worker that still had a macbook that was unchanged. Unfortunately it was running Mountain Lion. But not knowing that was the problem, I tested. It worked. Another request. Exactly which OS X version and which compiler. The reply was "Mavericks clang".
Now I am at an impasse, because I have Mavericks, but I dumped clang for gcc 4.7.3... My co-worker mentioned he was fixing to upgrade, if I wanted I could do it for him and then test. I did, and it crashed. Aha, I thought. clang again, since the clang on my macbook (with 10.7) was broken and Crafty would not work correctly at all. I compiled on my mac and crash. Just the mysterious "abort" printed, no explanation or anything. Since the POSIX standard says "diagnostic output should go to stderr", and since I was not seeing anything other than "abort" I set about debugging. The debugger was helpless here, it could not tell me anything at all, not even which source file was crashing or anything. So many printf()'s later, I track it down to the strcpy().
Now, "why does that suddenly fail?" A quick google, after wasting more than a week, and I discover many others have encountered the same thing and they were just as unhappy as I was.
This is where the thread I started came from.
My complaint: Something that worked for a LONG time suddenly failed with no explanation. I dig up the library change and look at it, and think "why did they slow this down to check for overlapping?" I then thought "if they wasted the time to check for overlapping, why did they not just call memmove() rather than aborting?"
And that's where we are. To be clear:
I am NOT a proponent of using undefined behavior.
I DO understand what it means (which, by the way, does not match much of what I have read here). This "demons" nonsense is just that. Unless you REALLY believe that a strcpy(a, a+1) is allowed to format your hard disk or any other equally damaging act. I don't buy that, and would call that an outright stupid opinion. The generally accepted definition of undefined behavior is that anything can happen, but within the scope of the operation being performed, ONLY. That is, an overlapping string might overwrite a buffer, or it might not, but it won't do anything INTENTIONALLY worse.
The Apple example is an example of poor behavior, IMHO. Why? because (1) there is no reason to intentionally break something that has worked for a long time (precedence). (2) changing something to make it faster, smaller, more accurate, whatever is perfectly acceptable, and if that changes the behavior, fine. But don't change it just because you can.
That is ALL I have said. Repeatedly. I come from the perspective of actually having written multiple compilers over the years. The writer's goal is to do what the programmer's code says do, no more, no less. If you add two positive integers and get a negative integer, fine. That is not "random undefined behavior". Ditto for memory race conditions.
Some here seem to believe that by using "undefined" in the standard, the compiler writers have carte blanche to do whatever they want in such cases. I, on the other hand, believe the compiler writers have the responsibility to do what I say do, to the best of their ability. Period.
If you want to argue "Apple did this to prevent future problems." I would argue "nonsense."
1. They COULD have fixed the problem. Check for overlap. If detected, directly call memmove() instead. Problem fixed. No code broken. Future problems eliminated as well.
2. They COULD have left it alone. Legacy code is just that. There's no sense in breaking something just because you can.
3. They did capriciously break a BUNCH of software products, with no warning. Not even an error in the place where POSIX says it should be displayed.
So please get off of the "Bob believe undefined behavior is acceptable and he teaches his students to write code like that." Pure nonsense. I've not defended my code in any way other than to say "it worked for a long time, it was unintentional, the existing library source files show that the overlap I use will work correctly. Apple should not have changed it without any warning."
That's all I have said. If there are any reasonable comments, feel free to leave 'em here. I don't see any point in the endless debate about what undefined means. Doesn't matter at all to the original point of this discussion. Still doesn't.
Back in the 70's Bert Gower and I were thinking about opening books, and what to do with ours. Most of us back then organized a book as a sort of tree/linked-list, which failed to detect transpositions unless the transpositions were hard-coded into the book input. One day the "use the hash signature" idea popped up in my head and I wrote the code and was quite happy. One thing it did was eliminate any sort of hand-coding with links, meaning the book input (and resulting file) could be as large as we wanted.
Bert and I immediately began to hand-type the most recent MCO book (MCO 10) line by line, page by page. We needed some way to keep up with where moves came from so we started to use a "pseudo-PGN" format that looked like this:
[MCO10 Page xx Col yy Footnote zz]
1. P-K4 P-K4 2. N-Kb3 N-Qb3 ... etc (yes, that book was English Descriptive).
Cray Blitz was in fortran, with no strings or supporting functions of any kind, so I wrote a quick parser that included a "shift()" procedure that does what the overlapped strcpy() call did in Crafty, namely shift the input string left N elements (it was an array of course) to eliminate the just-parsed move. All worked well.
In 1994 when I started the Crafty project, I copied some of the CB stuff and simply converted it to C as quickly as practical. That "shift" was apparently turned into the equivalent strcpy(). Don't remember any of it specifically, just remember doing the quick conversion on a few pieces of Cray Blitz to get started (current seaboard command, for example, as we had the "Forsythe notation" to encode a position long before the "Forsythe-Edwards specification was written)..
So, overlapping strcpy() used, whether it was known that it used undefined behavior or not I have absolutely no idea. We are talking almost 20 years ago for this conversion, late 1994.
Flash forward to today. It has worked since late 1994, not one single problem, with crafty running on everything from a Cray, to early 64 bit processors (alpha, SPARC, MIPS, even Itanium) all the way through my macbook with the 3rd generation i7. Worked with every compiler ever tried, with only a few incompatibilities found (IBM AIX C compiler defaulted chars to unsigned, most everyone else defaults to signed, as an example). the code which was eventually modified into the 1998 edition of ReadPGN() has never failed. Not once. Not on any machine. Not on any O/S. Not on any compiler. That was the state of affairs until 2-3 weeks ago.
I released a new version of Crafty to fix a couple of quirks I had found dealing with EGTB hits (which I don't use). Within a week, I received a bug report that said "Can't build a book with crafty, it crashes..."
First thought was an obvious one... I broke something. I diff'ed everything and looked at each change. Nothing even remotely related to book creating, PGN parsing, move translation, file I/O (I do my own file I/O to avoid endian issues). Nothing.
I asked for more information and for the PGN that caused the problem. Received it (a fairly long file) and tried it. Worked perfectly. Unfortunately most of this communication and testing happened while I was at my office so I was doing the testing on my office linux box. Could not make it fail.
I asked "what system?" He responded "Apple Macbook." A lead. Since I had canned clang and had installed gcc, I looked up a co-worker that still had a macbook that was unchanged. Unfortunately it was running Mountain Lion. But not knowing that was the problem, I tested. It worked. Another request. Exactly which OS X version and which compiler. The reply was "Mavericks clang".
Now I am at an impasse, because I have Mavericks, but I dumped clang for gcc 4.7.3... My co-worker mentioned he was fixing to upgrade, if I wanted I could do it for him and then test. I did, and it crashed. Aha, I thought. clang again, since the clang on my macbook (with 10.7) was broken and Crafty would not work correctly at all. I compiled on my mac and crash. Just the mysterious "abort" printed, no explanation or anything. Since the POSIX standard says "diagnostic output should go to stderr", and since I was not seeing anything other than "abort" I set about debugging. The debugger was helpless here, it could not tell me anything at all, not even which source file was crashing or anything. So many printf()'s later, I track it down to the strcpy().
Now, "why does that suddenly fail?" A quick google, after wasting more than a week, and I discover many others have encountered the same thing and they were just as unhappy as I was.
This is where the thread I started came from.
My complaint: Something that worked for a LONG time suddenly failed with no explanation. I dig up the library change and look at it, and think "why did they slow this down to check for overlapping?" I then thought "if they wasted the time to check for overlapping, why did they not just call memmove() rather than aborting?"
And that's where we are. To be clear:
I am NOT a proponent of using undefined behavior.
I DO understand what it means (which, by the way, does not match much of what I have read here). This "demons" nonsense is just that. Unless you REALLY believe that a strcpy(a, a+1) is allowed to format your hard disk or any other equally damaging act. I don't buy that, and would call that an outright stupid opinion. The generally accepted definition of undefined behavior is that anything can happen, but within the scope of the operation being performed, ONLY. That is, an overlapping string might overwrite a buffer, or it might not, but it won't do anything INTENTIONALLY worse.
The Apple example is an example of poor behavior, IMHO. Why? because (1) there is no reason to intentionally break something that has worked for a long time (precedence). (2) changing something to make it faster, smaller, more accurate, whatever is perfectly acceptable, and if that changes the behavior, fine. But don't change it just because you can.
That is ALL I have said. Repeatedly. I come from the perspective of actually having written multiple compilers over the years. The writer's goal is to do what the programmer's code says do, no more, no less. If you add two positive integers and get a negative integer, fine. That is not "random undefined behavior". Ditto for memory race conditions.
Some here seem to believe that by using "undefined" in the standard, the compiler writers have carte blanche to do whatever they want in such cases. I, on the other hand, believe the compiler writers have the responsibility to do what I say do, to the best of their ability. Period.
If you want to argue "Apple did this to prevent future problems." I would argue "nonsense."
1. They COULD have fixed the problem. Check for overlap. If detected, directly call memmove() instead. Problem fixed. No code broken. Future problems eliminated as well.
2. They COULD have left it alone. Legacy code is just that. There's no sense in breaking something just because you can.
3. They did capriciously break a BUNCH of software products, with no warning. Not even an error in the place where POSIX says it should be displayed.
So please get off of the "Bob believe undefined behavior is acceptable and he teaches his students to write code like that." Pure nonsense. I've not defended my code in any way other than to say "it worked for a long time, it was unintentional, the existing library source files show that the overlap I use will work correctly. Apple should not have changed it without any warning."
That's all I have said. If there are any reasonable comments, feel free to leave 'em here. I don't see any point in the endless debate about what undefined means. Doesn't matter at all to the original point of this discussion. Still doesn't.