Diving into Gcc: OpenBSD and m88k
Pages: 1, 2
Two Wrongs Don't Make a Right
Even with the function calls fixed, gcc 2.95 was unable to recompile itself:
the build would die when running the helper program genrecog,
which would segfault and dump core.
The genrecog tool is used to generate insn-recog.c
from the machine-dependent backend script, in our case
config/m88k/m88k.md. It shares a lot of code with other
genfoo tools extracting the various information from the
m88k.md machine description file. In fact, I very quickly tracked
the breakage to genrecog.c. Compiling every other part of
genrecog except genrecog.c with gcc 2.95 and
compiling genrecog.c with gcc 2.8 produced a working binary.
While I could continue the build (and did several times!) either by
cross-generating insn-recog.c on a different machine or simply by
reverting to gcc 2.8 for genrecog.c build only, I had to dive in
this problem.
genrecog provides a lot of information (in fact, C source code)
on standard output. When compiled with gcc 2.95, it would die very quickly with
this output:
$ cd obj
$ ./genrecog /usr/src/gnu/egcs/gcc/config/m88k/m88k.md
/* Generated automatically by the program `genrecog'
from the machine description file `md'. */
#include "config.h"
#include "system.h"
#include "rtl.h"
#include "insn-config.h"
#include "recog.h"
#include "real.h"
#include "output.h"
#include "flags.h"
extern rtx gen_split_1 ();
extern rtx gen_split_2 ();
Memory fault (core dumped)
$
By searching for this kind of output (such as extern rtx
gen_split) in the source, I could quickly trace the segfault to this
snippet from main():
while (1)
{
c = read_skip_spaces (infile);
if (c == EOF)
break;
ungetc (c, infile);
desc = read_rtx (infile);
if (GET_CODE (desc) == DEFINE_INSN)
recog_tree = merge_trees (recog_tree,
make_insn_sequence (desc, RECOG));
else if (GET_CODE (desc) == DEFINE_SPLIT)
split_tree = merge_trees (split_tree,
make_insn_sequence (desc, SPLIT));
if (GET_CODE (desc) == DEFINE_PEEPHOLE
|| GET_CODE (desc) == DEFINE_EXPAND)
next_insn_code++;
next_index++;
}
The function responsible for the extern rtx gen_split
lines is make_insn_sequence. The program would then die
between the second and the third merge_trees invocation.
make_insn_sequence might produce an insn
sequence or whatever. To continue diving into genrecog, all
we need to know is what kind of return value it provides. A quick glance
at the source shows:
static struct decision_head make_insn_sequence PROTO((rtx, enum routine_type));
...
static struct decision_head merge_trees PROTO((struct decision_head,
struct decision_head));
These functions work on decision_head structures, which are
defined as:
/* Data structure for a listhead of decision trees. The alternatives
to a node are kept in a doubly-linked list so we can easily add nodes
to the proper place when merging. */
struct decision_head { struct decision *first, *last; };
The best way to know at which point invalid data would be generated was to
add a trace of the decision_head structures.
Adding traces to genrecog is very easy: since it outputs C
code, you just have to output your traces as C comments. Adding simple
traces to merge_trees proved very quickly that it was given
an invalid argument. But then traces in make_insn_sequence
would prove that it would produce valid data. The output would end like
this:
...
extern rtx gen_split_1 ();
/* make_insn_sequence -> 2e000.2e000 */
/* merge_trees <- 0.0 8.4008 */
/* merge_trees -> 8.4008 */
extern rtx gen_split_2 ();
/* make_insn_sequence -> 2e380.2e380 */
/* merge_trees <- 8.4008 18.4018 */
Memory fault (core dumped)
$
This is proof that the problem lies in the way short structures are passed to and returned from functions. Consider the following example:
$ cat maze.c
#include <stdio.h>
struct maze { const char *item1, *item2; };
struct maze
builder(void)
{
struct maze m;
m.item1 = "you are lost";
m.item2 = "in the maze.";
printf("builder: m.item1 = %p, item2 = %p\n", m.item1, m.item2);
return m;
}
void
checker(struct maze m)
{
printf("checker: m.item1 = %p, item2 = %p\n", m.item1, m.item2);
}
main()
{
checker(builder());
}
$
When compiled by gcc 2.8, it would output, for example:
$ gcc28 -O0 -o maze maze.c
$ ./maze
builder: m.item1 = 0x10f8, item2 = 0x1108
checker: m.item1 = 0x10f8, item2 = 0x1108
$
However, gcc 2.95 produces:
$ gcc295 -O0 -o maze maze.c
$ ./maze
builder: m.item1 = 0x10f8, item2 = 0x1108
checker: m.item1 = 0x0, item2 = 0x0
$
Trial and error showed that this problem only affected structures from 8 to 32 bytes, inclusive.
If you look at the generated code for main(), aside from the
prologue and epilogue code, you'll see that gcc 2.8 generates the following
correct code:
or r12,r0,r31
bsr _builder
bsr _checker
gcc 2.95 adds two extra instructions:
ld.d r24,r0,r31
or r12,r0,r31
bsr _builder
st.d r24,r0,r31
bsr _checker
Before we go further, we need to know more about the m88k calling
convention. The standard is never to pass the structures in registers,
always on the stack. If the function returns a struct itself, the address
of the returned struct should be set in r12 by the caller
which will have allocated the appropriate space.
In the gcc 2.8 code, the compiler has already optimized the call flow
so that the structure is immediately on the stack frame and is thus
immediately usable in checker(). After r12 is
set to point to the temporary location, builder() and
checker() are invoked.
gcc 2.95, on the other hand, adds two extra statements. The first one
saves the stack area which is about to be used by builder()
in registers r24 and r25. The second statement
restores this area immediately before invoking checker(),
effectively making it check uninitialized memory.
If we use a temporary variable to store the builder() result,
such as:
#if 0
checker(builder());
#else
struct maze m = builder();
checker(m);
#endif
then gcc 2.8 and gcc 2.95 will produce the same, correct, code:
addu r12,r30,8
bsr _builder
or r13,r0,r31
addu r11,r30,8
ld r12,r11,0
ld r11,r11,4
st r12,r13,0
st r11,r13,4
bsr _checker
In this case, the variable m is at r30(8).
After builder() returns, the structure is copied to
r31(0), which will be the frame for checker().
(Note that the structure could be copied with one ld.d and
one st.d instruction, rather than two ld and two
st...)
So we have a problem which affects structures only when they are not
forced to be in memory. It makes then sense to hunt for a bug in the
RETURN_IN_MEMORY macro, which will decide whether a
particular object can be returned in a register:
$ cd config/m88k
$ head -1001 m88k.h | tail -8
/* Disable the promotion of some structures and unions to registers. */
#define RETURN_IN_MEMORY(TYPE) \
(TYPE_MODE (TYPE) == BLKmode \
|| ((TREE_CODE (TYPE) == RECORD_TYPE || TREE_CODE(TYPE) == UNION_TYPE) \
&& !(TYPE_MODE (TYPE) == SImode \
|| (TYPE_MODE (TYPE) == BLKmode \
&& TYPE_ALIGN (TYPE) == BITS_PER_WORD \
&& int_size_in_bytes (TYPE) == UNITS_PER_WORD))))
$
But if we look back at the code for FUNCTION_ARG, the comments
say:
Structures and unions which are not 4 byte, and word
aligned are passed in memory rather than registers, even if they
would fit completely in the registers under OCS rules.
and the code indeed does:
if (type != 0 /* undo putting struct in register */
&& (TREE_CODE (type) == RECORD_TYPE || TREE_CODE (type) == UNION_TYPE))
mode = BLKmode;
...
if (mode == BLKmode
&& (TYPE_ALIGN (type) != BITS_PER_WORD
|| bytes != UNITS_PER_WORD))
return (rtx) 0;
which really prevents any structure larger than 4 bytes from being passed in a register.
A more realistic version of RETURN_IN_MEMORY becomes then:
/* Disable the promotion of some structures and unions to registers.
Note that this matches FUNCTION_ARG behavior. */
#define RETURN_IN_MEMORY(TYPE) \
(TYPE_MODE (TYPE) == BLKmode \
|| ((TREE_CODE (TYPE) == RECORD_TYPE || TREE_CODE(TYPE) == UNION_TYPE) \
&& (TYPE_ALIGN (TYPE) != BITS_PER_WORD || \
GET_MODE_SIZE (TYPE_MODE (TYPE)) != UNITS_PER_WORD)))
Unfortunately, this fix, albeit an improvement, did not solve the issue. However, this problem was specific to the m88k compiler and could not be triggered on various other architectures, from Alpha to Vax. Investigating the differences between m88k and the other CPUs was the next logical step.
Depending on the various way function calls work across the different cpus, with various calling conventions and stack layouts, not even counting register windows, the machine-independent part of gcc tries to offer as much flexibility as possible, letting architecture-dependent configuration files define the different behaviors they provide or expect, using a bunch of macros.
Some of these macros must be defined for every architecture, such as
LIBCALL_VALUE which defines where a function returns its
result (on m88k, in register r2), while other macros are only
defined if the architecture requires it. The m88k is unique in defining
REG_PARM_STACK_SPACE and
OUTGOING_REG_PARM_STACK_SPACE (actually, a few other
architectures do this as well), as well as
STACK_PARMS_IN_REG_PARM_AREA.
What do these macros tell the compiler? Let's quote from the manuals:
REG_PARM_STACK_SPACE (FNDECL)Define this macro if functions should assume that stack space has been allocated for arguments even when their values are passed in registers.
The value of this macro is the size, in bytes, of the area reserved for arguments passed in registers for the function represented by
FNDECL, which can be zero if GNU CC is calling a library function.This space can be allocated by the caller, or be a part of the machine-dependent stack frame:
OUTGOING_REG_PARM_STACK_SPACEsays which.OUTGOING_REG_PARM_STACK_SPACE- Define this if it is the responsibility of the caller to allocate the area reserved for arguments passed in registers.
Nothing fancy here. The m88k-specific backend indeed will automagically allocate 32 bytes on the stack during the function prologue. Hey, wait, 32 bytes, exactly like the structure size threshold, where bigger structures don't get clobbered.
A simple experiment is to comment out
OUTGOING_REG_PARM_STACK_SPACE. Compiling gcc with these settings
will produce a stack-wasting compiler, because every function prologue will now
automagically allocate an extra 64 bytes on the stack: 32 from the
m88k-specific prologue and 32 from the architecture-independent prologue code,
since it has been now instructed to do so. However, when using this compiler,
the problem disappears completely, whatever the size of the structure.
Another path worth trying would be to remove this automatic stack
allocation, and undefine REG_PARM_STACK_SPACE. However, I am
afraid this could break some 88Open rule, or, even worse, some implicit
assumption hidden in the m88k-specific backend code.
Now it's time to look at STACK_PARMS_IN_REG_PARM_AREA. Quoting
the manuals again:
STACK_PARMS_IN_REG_PARM_AREADefine this macro if
REG_PARM_STACK_SPACEis defined, but the stack parameters don't skip the area specified by it.Normally, when a parameter is not passed in registers, it is placed on the stack beyond the
REG_PARM_STACK_SPACEarea.Defining this macro suppresses this behavior and causes the parameter to be passed on the stack in its natural location.
This is very interesting, and also very obscure (I'll know what to blame for
my next headache). No other architecture defines this. If I understand this
correctly, it means that the REG_PARM_STACK_SPACE area can be
shared by both function call parameters, and local variables.
While gcc 2.8 would not care about this area being shared, and assumes we
know what we are doing, gcc 2.95 is more strict, and will explicitly save any
variable of this shared area, when it might be clobbered. This is exactly what
we witnessed. The area on the stack which has been saved before invoking
builder() and restored after it returned is the implicit location
for a temporary struct maze.
A real fix would be to teach gcc to handle the shared area as gcc 2.8 did,
but then this behaviour is objectionable, and probably more a bug than a
feature. My workaround was to stop defining
STACK_PARMS_IN_REG_PARM_AREA. The generated code becomes:
addu r12,r30,8
bsr _builder
addu r13,r31,32
addu r11,r30,8
ld r12,r11,0
ld r11,r11,4
st r12,r13,0
st r11,r13,4
bsr _checker
which is the same code as when the result of builder()
was stored in an explicit temporary variable, except for the stack
location: instead of being at the beginning of the
REG_PARM_STACK_SPACE area, it is now past this area. So there
is still some stack waste, but as told before I do not want to change
REG_PARM_STACK_SPACE at the moment.
With these bugs fixed, I reached a point where -O0 would
apparently always produce correct code. It was time to make optimization
reliable as well!
References
- The GCC Internals. Some parts are not applicable to gcc 2.95, but most of this still applies and will provide more details than I can do here.
- The OpenBSD/mvme88k port page
- C Inline functions with GCC
Miod Vallat makes all sort of strange telephony hardware work reliably for a living.
Return to the BSD DevCenter.