strcpy() revisited

Discussion of chess software programming and technical issues.

Moderator: Ras

syzygy
Posts: 5721
Joined: Tue Feb 28, 2012 11:56 pm

Re: strcpy() revisited

Post by syzygy »

bob wrote:[No idea what they are talking about there. There is no size argument to strcpy(). strcpy(dest, src) is all there is.
Yes, that is a strange mistake. I suppose it should have read string (or source string) argument.

This is interesting:

Code: Select all

#include <stdio.h>
#include <string.h>

int main(int argc, char **argv)
{
  char *a = argv[1];
  char b[256];
  strcpy(b, a);
  strcpy(b+1, b);
  printf("strlen("%s") = %d\n", b, strlen(b));
  return 0;
}

Code: Select all

$ gcc -O3 bla.c
$ ./a.out 12345
strlen("112345") = 5
How can that be?

Answer: the program learns the original length of the string b from the first strcpy() which it implements using stpcpy(). The second strcpy() is implemented using memcpy(). Since strcpy(b+1, b) cannot possibly involve overlapping regions, this memcpy() cannot possibly change the length of b. So the program does not have to recalculate strlen(b) but can output the result it found earlier.
wgarvin
Posts: 838
Joined: Thu Jul 05, 2007 5:03 pm
Location: British Columbia, Canada

Re: strcpy() revisited

Post by wgarvin »

syzygy wrote:strlen("112345") = 5[/code]
How can that be?

Answer: the program learns the original length of the string b from the first strcpy() which it implements using stpcpy(). The second strcpy() is implemented using memcpy(). Since strcpy(b+1, b) cannot possibly involve overlapping regions, this memcpy() cannot possibly change the length of b. So the program does not have to recalculate strlen(b) but can output the result it found earlier.
Nice! Anyone want to bet whether or not this find will silence those critics who have complained that there are no possible optimization benefits from the non-overlapping API specification of strcpy?

[more explicitly: In order to implement bob's suggestion that they redirect any overlapping strcpy calls to memmove, the implementation provider would also have to disable this optimization. That might also be complicated by the fact that different groups maintain the library and the compiler. But fortunately, they have (drumroll) STANDARDS to follow, that allow them to independently perform clever optimizations like this without breaking any legal programs.]
User avatar
hgm
Posts: 28387
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: strcpy() revisited

Post by hgm »

mvk wrote: Once you say "#include <string.h>" the compiler can know the semantics of strcpy and strlen in the code that follows, because <string.h> is standardised. There is no obligation to implement <string.h> with a file for example. (if you want that, you have to say #include "string.h").
Semantics? Do you mean that the library file string.h actually contains the complete definition of strcpy, rather than just a prototype? When I do that for my own header files, it usually leads to "multipy defined symbol" error messages from the linker, when such a header is #included in multiple .c files. Is the new standard now allowing multiple definitions of the same routine now allowed, provided the definitions are identical? Would that also work for definitions I write myself, or just from ehader files?
Rein Halbersma
Posts: 751
Joined: Tue May 22, 2007 11:13 am

Re: strcpy() revisited

Post by Rein Halbersma »

hgm wrote:
mvk wrote: Once you say "#include <string.h>" the compiler can know the semantics of strcpy and strlen in the code that follows, because <string.h> is standardised. There is no obligation to implement <string.h> with a file for example. (if you want that, you have to say #include "string.h").
Semantics? Do you mean that the library file string.h actually contains the complete definition of strcpy, rather than just a prototype? When I do that for my own header files, it usually leads to "multipy defined symbol" error messages from the linker, when such a header is #included in multiple .c files. Is the new standard now allowing multiple definitions of the same routine now allowed, provided the definitions are identical? Would that also work for definitions I write myself, or just from ehader files?
You can put function definitions in headers if you prefix them with the inline keyword.
wgarvin
Posts: 838
Joined: Thu Jul 05, 2007 5:03 pm
Location: British Columbia, Canada

Re: strcpy() revisited

Post by wgarvin »

hgm wrote:
mvk wrote: Once you say "#include <string.h>" the compiler can know the semantics of strcpy and strlen in the code that follows, because <string.h> is standardised. There is no obligation to implement <string.h> with a file for example. (if you want that, you have to say #include "string.h").
Semantics? Do you mean that the library file string.h actually contains the complete definition of strcpy, rather than just a prototype? When I do that for my own header files, it usually leads to "multipy defined symbol" error messages from the linker, when such a header is #included in multiple .c files. Is the new standard now allowing multiple definitions of the same routine now allowed, provided the definitions are identical? Would that also work for definitions I write myself, or just from ehader files?
I think what he meant is that the standard specifies fully what #include <string.h> means. It means that the standard library functions like strcpy, memcpy, etc. are now defined with the behaviors specified in the spec (but how exactly this is implemented, is up to the implementation). It means the compiler is allowed to be "smart". After it has seen #include <string.h>, it knows that "strcpy" means "the C standard library function strcpy" and not just some random function named "strcpy" that you may have written yourself. So it is then allowed to do clever things that take advantage of its intrinsic knowledge of the semantics of strcpy.

Thats how we get this kind of optimization, where it rewrites a strcpy into something else (stpcpy or memcpy, or the hard-coded code to copy 16 bytes for the string literal). Its also how we get "intrinsic memcpy" optimizations of the same flavor (i.e. where you write a call to memcpy of size sizeof(MyStruct) and it generates some inline code to copy those 24 bytes instead of actually emitting a call to the library function). I expect, though I am not 100% sure, that the compiler isn't allowed to do those optimizations unless it has seen the proper #include <string.h>.
Rein Halbersma
Posts: 751
Joined: Tue May 22, 2007 11:13 am

Re: strcpy() revisited

Post by Rein Halbersma »

wgarvin wrote:
hgm wrote:
mvk wrote: Once you say "#include <string.h>" the compiler can know the semantics of strcpy and strlen in the code that follows, because <string.h> is standardised. There is no obligation to implement <string.h> with a file for example. (if you want that, you have to say #include "string.h").
Semantics? Do you mean that the library file string.h actually contains the complete definition of strcpy, rather than just a prototype? When I do that for my own header files, it usually leads to "multipy defined symbol" error messages from the linker, when such a header is #included in multiple .c files. Is the new standard now allowing multiple definitions of the same routine now allowed, provided the definitions are identical? Would that also work for definitions I write myself, or just from ehader files?
I think what he meant is that the standard specifies fully what #include <string.h> means. It means that the standard library functions like strcpy, memcpy, etc. are now defined with the behaviors specified in the spec (but how exactly this is implemented, is up to the implementation). It means the compiler is allowed to be "smart". After it has seen #include <string.h>, it knows that "strcpy" means "the C standard library function strcpy" and not just some random function named "strcpy" that you may have written yourself. So it is then allowed to do clever things that take advantage of its intrinsic knowledge of the semantics of strcpy.
The compiler treats standard library headers just like anybody else's code. The only thing that distinguishes stdlib code is the functions included heavily make use of __builtin_XXX type of stuff that can be more easily optimized. But you can write such stuff yourself at the cost of losing portability (e.g. I use __builtin_ctz and __builtin_popcnt to do semi-portable bit-twiddling).
User avatar
hgm
Posts: 28387
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: strcpy() revisited

Post by hgm »

Rein Halbersma wrote:You can put function definitions in headers if you prefix them with the inline keyword.
Indeed you can. (More about 'inline' later.) But it seems that this would have the side effect of automatically inlining the code, which might not always be what you want. (Although for strcpy might be a good idea.)

However, when I run a gcc -E (but not on an Apple machine, of course) to see what the #include <string.h> expands to, the only occurrence of strcpy in the output is:

Code: Select all

char *__attribute__((__cdecl__)) strcpy (char *, const char *);
Doesn't seem to actually define the semantics, nor contain an inline directive. And indeed, the following program

Code: Select all

#include <string.h>

char *strcpy (char *a, const char *b)
{
 static char buf[] = "fooled!";
 char *p = buf;
 while((*a++ = *p++));
 return NULL;
}


int main()
{
  char a[0], b[10];
  strcpy(a, "Hello");
  strcpy(b, a);
  printf("%s\n", b);
  return 0;
}
causes no errors or warnings on the re-definition of strcpy:

Code: Select all

Makro@Makro-PC ~
$ gcc -Wall test.c
test.c: In function `main':
test.c:17: let op: implicit declaration of function `printf'

Makro@Makro-PC ~
$ ./a.exe
fooled!

Makro@Makro-PC ~
$
If the Apple string.h would define strcpy as an inlined routine, it should generate error messages for conflicting declarations, however. Does the C standard allow such different behavior between compilers? Perhaps using a routine named strcpy in itself is specified as 'undefined behavior'?

Now about Apple and inlining. We received a bug report from someone trying to compile XBoard with an Apple compiler. Now the XBoard PGN parser defines two inlined routines:

Code: Select all

inline int
Match (char *pattern, char **ptr)
{
    char *p = pattern, *s = *ptr;
    while(*p && (*p == *s++ || s[-1] == '\r' && *p--)) p++;
    if(*p == 0) {
	*ptr = s;
	return 1;
    }
    return 0; // no match, no ptr update
}

inline int
Word (char *pattern, char **p)
{
    if(Match(pattern, p)) return 1;
    if(*pattern >= 'a' && *pattern <= 'z' && *pattern - **p == 'a' - 'A') { // capitalized
	(*p)++;
	if(Match(pattern + 1, p)) return 1;
	(*p)--;
    }
    return 0;
}
Of course these routines are only called from parser.c, where they are defined. But XBoard refuses to build, because of linker errors. In particular complaints that parser.c makes function calls to undefined symbols _Match and _Word (or something like that. I would have to look up the actual bug report). I am not sure what exactly causes this problem, so I haven't made any attempt to fix it, so far.
Rein Halbersma
Posts: 751
Joined: Tue May 22, 2007 11:13 am

Re: strcpy() revisited

Post by Rein Halbersma »

hgm wrote:
Rein Halbersma wrote:You can put function definitions in headers if you prefix them with the inline keyword.
Indeed you can. (More about 'inline' later.) But it seems that this would have the side effect of automatically inlining the code, which might not always be what you want. (Although for strcpy might be a good idea.)
Let's see what the Standard says
http://www.open-std.org/jtc1/sc22/wg14/ ... /n1548.pdf
6.7.4 Function specifiers

6 A function declared with an inline function specifier is an inline function. Making a function an inline function suggests that calls to the function be as fast as possible.138) The extent to which such suggestions are effective is implementation-defined.139)

Footnotes:
138) By using, for example, an alternative to the usual function call mechanism, such as ‘‘inline substitution’’. Inline substitution is not textual substitution, nor does it create a new function. Therefore, for example, the expansion of a macro used within the body of the function uses the definition it had at the point the function body appears, and not where the function is called; and identifiers refer to the declarations in scope where the body occurs. Likewise, the function has a single address, regardless of the number of inline definitions that occur in addition to the external definition.
139) For example, an implementation might never perform inline substitution, or might only perform inline substitutions to calls in the scope of an inline declaration.
So inlining is only a suggestion to elide the function call.
Rein Halbersma
Posts: 751
Joined: Tue May 22, 2007 11:13 am

Re: strcpy() revisited

Post by Rein Halbersma »

hgm wrote: Perhaps using a routine named strcpy in itself is specified as 'undefined behavior'?
Yes indeed:
If the program declares or defines an identifier in a context in which it is reserved (other than as allowed by 7.1.4), or defines a reserved identifier as a macro name, the behavior is undefined.
User avatar
hgm
Posts: 28387
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: strcpy() revisited

Post by hgm »

But strcpy is not a reserved keyword, is it? It is just the name of a routine. Declared in a header file, and defined in the library source.