Next: Stack Reduction Results Up: Experiences with modularizing the Previous: Comparison of core sizes

Reduction of Stack Usage

The second task was to reduce the load placed on the C stack by the interpreter to enable the interpreter to run in environments with limited stack space. In the case of our customer this would be a router.

To this end three different techniques were used:

Identification of variable sized data types with a large default size, instances of which are placed on the stack.

For such types we reduced their default size either through an existing preprocessor macro determining their size, or by introducing such a macro and proceeding as before.

An example of such a type is Tcl_DString. Its base size is on typical 32bit systems, i.e. 12 bytes. However it also allocates a static character buffer as part of the structure to handle small strings without having to allocate space on the heap. The macro controlling the size of this buffer is
TCL_DSTRING_STATIC_SIZE. Its default value is 200, driving the size of the whole structure to 212 bytes.

Table 3 shows the macros introduced in this way.

**Table 3:** Adaptable default sizes
Macro	Default size	Meaning, Location of usage
TCL_DSTRING_STATIC_SIZE	200	Default static buffer in Tcl_DString
TCL_FMT_STATIC_FLOATBUFFER_SZ	320	generic/tclCmdAH.c, line 1978
TCL_FMT_STATIC_VALIDATE_LIST	16	generic/tclScan.c, line 266
TCL_FOREACH_STATIC_ARGS	9	generic/tclCmdAH.c, line 1735
TCL_FOREACH_STATIC_LIST_SZ	4	generic/tclCmdAH.c, line 1739
TCL_FOREACH_STATIC_VARLIST_SZ	5	generic/tclCompCmds.c, line 621
TCL_RESULT_APPEND_STATIC_LIST_SZ	16	generic/tclResult.c, line 458
		generic/tclStringObj.c, line 1199
TCL_MERGE_STATIC_LIST_SZ	20	generic/tclListObj.c, line 1009
		generic/tclUtil.c, line 831
TCL_PROC_STATIC_CLOCALS	20	generic/tclExecute.c, line 6272
TCL_PROC_STATIC_ARGS	20	generic/tclProc.c, line 751
TCL_INVOKE_STATIC_ARGS	20	generic/tclBasic.c, line 1782
		generic/tclBasic.c, line 1857
		generic/tclBasic.c, line 2951
		generic/tclParse.c, line 1332
TCL_EVAL_STATIC_VARCHARS	30	generic/tclParse.c, line 1166
TCL_STATS_COUNTERS	10	generic/tclHash.c, line 306
		generic/tclLiteral.c, line 898
TCL_LSORT_STATIC_MERGE_BUCKETS	30	generic/tclCmdIL.c, line 2680

Identification of large variables on the stack which cannot be shrunk. For these we devised a set of macros for the declaration and use of such variables whose implementation can be switched between the allocation of such variables on the stack and their allocation on the heap.

An example of the usage of these macros can be found in the file generic/tclCompCmds.c, see the function TclCompileIncrCmd:

int
TclCompileIncrCmd(interp, parsePtr, envPtr)
    Tcl_Interp *interp;
    Tcl_Parse *parsePtr;
    CompileEnv *envPtr;
{
    Tcl_Token *varTokenPtr, *incrTokenPtr;
    TEMP (Tcl_Parse) elemParse;
    ...
    STRING (160, buffer);

    NEWTEMP (Tcl_Parse,elemParse);
    NEWSTR (160, buffer);

    envPtr->maxStackDepth = 0;
    if ((parsePtr->numWords != 2) &&
	(parsePtr->numWords != 3)) {
        ...
        RELTEMP(elemParse);
        return TCL_ERROR;
    }
    ...
}

Figure 4 shows side by side the two implementations for allocating temporary variables on either the stack or the heap. If the macro TCL_STRUCT_ON_HEAP is defined during compilation temporary variables will be allocated on the heap.

**Table 4:** Switchable temporary variables
Macro	Heap	Stack
TEMP(t)	t *	t
ITEM(var,item)	var $\rightarrow$ item	var . item
REF(var)	(var)	&(var)
NEWTEMP(t,var)	(var) = (t *) ckalloc(sizeof(t))
RELTEMP(var)	ckfree((void*)(var))
STRING(n,var)	char* var	char var [n]
NEWSTR(n,var)	(var) = (char *) ckalloc(n)

Miguel Sofer provided us with a modified engine for the execution of bytecode. This engine does not invoke itself recursively when calling a byte-compiled procedure from other byte-compiled code. It rather jumps directly back into its own main loop, reusing the state variables on the C stack. The actual state of the engine is saved to and restored from the Tcl evaluation stack, which is managed on the heap. This means that calling Tcl procedures inside of Tcl procedures, and especially recursion of Tcl procedures does not consume any C stack anymore.
This new executor was named NRE, for ``non-recursive engine''.

To identify the hot spots of stack usage, and from there the variables and data structures causing them, we semi-automatically instrumented a copy of the base sources with code monitoring the entrance to and returns from all functions in the interpreter and also recording the the location of the stack pointer for each call.

This instrumentation was only semi-automatic because the script used to insert the code was very simple and did not recognize all possible C syntax for return statements and function calls. About 25 % of the instrumented code had to be reworked manually to get the instrumentation code to work properly in them. There were two big offenders in this regard.

The regular expression engine. It hid lots of return statements in C preprocessor macros and also used a number of unbraced if statements, i.e.
```
if (...)
   return X;
```
causing the instrumentation script to change the semantics of the instrumented code.
Tail-calls scattered throughout the code, i.e.
```
foo ()
{
    ...
    return bar ();
}
```
These were rewritten to
```
foo ()
{
    ...
    X = bar ();
    return X;
}
```
allowing the insertion of the instrumentation code between the called function and the return of the caller itself.

The inserted code manages a logical duplicate of the C call stack on the heap and uses that to record both the names of the active functions and the locations of the cpu stack pointer for them. By comparing the location of the stack pointer before the call of a function F and its location inside, it was possible for us to deduce the amount of C stack space consumed by F.

This method also records the amount of stack required for function linkage, i.e. stack management done by the C compiler. This is no true disadvantage as this amount is usually small, and more important, constant. Because of the latter this does not disturb the ranking of functions when sorting them by the amount of stack they require.

As the execution of the inserted code causes a noticeable slowdown of the interpreter additional processing beyond the comparison of the stack pointer is not done. Instead we write the information about the association between functions and required stack directly to a log file. After each run a number of external scripts is used to postprocess the logged data. The result is a sorted list where the functions consuming the most stack are shown at the top. Thus identifying the hot spots to look at.

Instead of trying to devise a Tcl application which exercises all parts of the interpreter we simply executed the testsuite that comes with the Tcl core to generate the stack traces.

Next: Stack Reduction Results Up: Experiences with modularizing the Previous: Comparison of core sizes

[email protected]