CXXR (C++ R): Refactoring History

This page describes the phases so far completed within the CXXR project to refactor the R engine into C++. Each phase is placed within the Subversion tags directory, with a name of the form 0.00-2.5.0, where 0.00 indicates the phase, and 2.5.0 indicates the R release to which that phase is intended to correspond.

Phase 0: `0.00-2.5.0`

In this phase all .cpp files within src/main are renamed to .cpp, with the following exceptions:

complex.c: This file uses the C99 complex types, which are not (under the current C++ standard) understood by a C++ compiler;
gram.c: This file is automatically generated by yacc/bison;
regex.c: The source of this file is very insistent that it is C, not C++: it gives a #warning if you attempt to compile it with a C++ compiler.

(Subsequently, RNG.c was also reverted to C, to respect Knuth's copyright statement.)

The result of this phase does not build correctly; however, it is useful as a baseline for seeing the subsequent changes.

Phase 1: `0.01-2.5.0`

Make such changes to the result of Phase 0 to enable the .cpp files to compile without warning using -Wall with gcc-4.1.3, retaining C linkage conventions for everything defined in .h files. Ensure that the whole of R will build correctly and pass make check.

A desirable side effect of enforcing C linkage was that the linkage editor picked up several instances where the source file implementing a function failed to #include the appropriate header file, and consequently generated a function with C++ linkage: see below.

This needed to address the following issues:

Rboolean is different from C++ bool. Rboolean is an enumeration with elements FALSE=0 and TRUE=1; bool is a primitive type, with values false and true. (Also, there are #defines of FALSE to 0 and TRUE to 1 lurking around in the R code, just to confuse matters.) In particular an Rboolean is a different size from a bool. It was necessary to introduce many explicit conversions from bool (resulting in C++ from evaluating Boolean expressions) or integer types to Rboolean.
In connection with this, defined a macro RBOOL(x) within Rinlinedfuns.h expands to x in C and Rboolean(x) in C++.
The C++ keywords class, new, private and this were used as identifiers; these had to be renamed, e.g. class changed to connclass.
In various places, particularly connections.cpp, a void* was implicitly converted to another type of pointer. These conversions were made explicit, and flagged /*CCAST*/.
datetime.cpp and memory.cpp used statements of the form i -= d; where i is of integer type and d is an expression evaluating to a floating point type. This was converted to the form i = int(i - (d)); to avoid a compiler warning. This interpretation complies with sec.Â 6.5.12.2 of the C99 standard ISO:IEC 9899:1999.
The structure type NewDevDesc defined in GraphicsDevice.h contains a number of pointers to functions as members, and the types of these functions were specified without giving the number and types of the function arguments. This was rectified. It was also necessary to give this structure a tag (_NewDevDesc) because most of these functions included a pointer to a NewDevDesc among their arguments.
It was necessary to shift some of the material in R_ext/GraphicsEngine.h, in particular the definition of R_GE_context, into a new header file R_ext/GraphicsContext.h, to avoid reciprocal dependencies between GraphicsEngine.h and GraphicsDevice.h.
The pointer to function type CCODE, defined in Defn.h, was redefined to make the number and type of its arguments explicit, as follows:
```
typedef SEXP (*CCODE)(SEXP, SEXP, SEXP, SEXP);
```
If __MAIN__ is defined, libextern.h #defined extern to the empty string, which could play havoc with the extern "C" used in C++ to enforce C-style linkage. This #define was commented out, and instead a new macro extern1 was #defined within Defn.h.
In numerous places it was necessary to make conversions from floating-point types to integer types explicit. In other places it was clear that the same effect could be achieved without deleterious side effect by changing the type of a variable.
It was necessary to introduce reinterpret_casts in various places in memory.cpp, scan.cpp, serialize.cpp and vfonts.cpp. (In future it is the intention to get rid of as many of these as possible, as well as getting rid of all C-style casts.)
In Defn.h, the whole declaration extern FUNTAB R_FunTab[]; was made #ifndef __R_Names__, not just the word extern.
In some places I couldn't resist changing the type of a function argument from a plain pointer to a const pointer. We can expect much more of this later, but this may have been premature.
sysutils.cpp (conditionally) contained an extern declaration of environ; the compiler considered this to have C++ linkage, conflicting with the C-linkage definition in unistd.h (subsequently #included into sysutils.cpp). This extern declaration has been itself replaced by a (conditional) #include of unistd.h.
Sorted out problems where a file implementing a function failed to include the relevant header file. In some cases this was because the prototype didn't appear in any header file, and clients of the function were instead relying on a prototype within the client source file itself! Such misplaced prototypes were found in eval.cpp, format.cpp, memory.cpp, platform.cpp, printutils.cpp, and library/methods/src/methods_list_dispatch.c; they were commented out, and flagged with the comment "Use header files!". Needed prototypes that didn't appear in any header file were generally placed at the end of Defn.h.
A particularly obscure example of this kind concerns R_CHAR. This is declared as a pointer to a function in Rinternals.h, and implemented in memory.cpp. Now memory.cpp does #include Rinternals.h, but it does so with USE_RINTERNALS defined, as a result of which the R_CHAR declaration in the header file isn't seen by the compiler, and so the implemented function got C++ linkage. I modified the header file by moving the R_CHAR declaration outside the #ifndef USE_RINTERNALS.
The definitions in print.cpp of functions intended to be called from FORTRAN needed to be surrounded by extern "C"{ ... }.
deparse.cpp:1191 used & where && was surely intended; character.cpp:738 similarly used | instead of ||.
-Wall complains about attempts to compare signed with unsigned. This required explicit conversions in numerous places. Generally (but not always) I did this by converting unsigned to signed. In other places it was clear that the same effect could be achieved without deleterious side effect by changing the type of a variable.
In connection with this, the macro AGE_NODE in memory.cpp had to be changed to make an__g__ unsigned.

Phase 2: `0.02-2.5.0`

In a subsequent phases (possibly starting in Phase 3) it is our objective to replace the SEXPREC union by a hierarchy of C++ classes. This phase prepares for that by reorganising the material in the header files in src/include. This involves creating a new subdirectory src/include/CXXR, and within that creating a new header file RObject.h (ultimately to include a base class RObject for the new hierarchy), and further header files RClosure.h, REnvironment.h, RInternalFunction.h, RPairList.h, RPromise.h, RSymbol.h and RVector.h, corresponding respectively to closxp_struct, envsxp_struct, primsxp_struct, listsxp_struct, promsxp_struct, symsxp_struct and vecsxp_struct, which will eventually be derived classes. The material in these new headers comes predominantly from Rinternals.h, but to some extent (in the case of RInternalFunction.h) from Defn.h. All of the new header files, with the exception of RInternalFunction.h, are also installed in $(rincludedir)/CXXR.

Function prototypes moved into the new header files are documented using doxygen. Where is was clearly consistent with the semantics, some of the argument types of the functions were changed, either by adding const, or by converting int into Rboolean (however, see the issues below regarding the latter).

The following are implementational details and issues that arose:

The implementation of SEXPREC (though still the unchanged C code) was made visible only to C++ programs. This is to get advance warning of potential problems when the implementation is changed to C++.
In many places CR defined a name as a macro when USE_RINTERNALS was defined, and otherwise as a function. It has been the intention in this phase to replace the macros with C++ inline functions: these would automatically also generate a non-inlined form, so the separate definition (usually in memory.cpp) could be dispensed with.
This was all very well where the function form was implemented in CR simply by invoking the macro; however in some cases the function form carried out some error checking before invoking the macro. Trying to convert the macro to an inline function would then result in two distinct functions with the same name, which the compiler and/or linker would certainly reject.

In the end it was decided to leave the macros in place for the time being: they'll have to be changed when the C++ implementation rolls out anyway.
I considered getting rid of the USE_RINTERNALS compilation conditions, but decided to retain it to mark out material (usually currently in the form macro definitions) that will in the future need privileged access to a C++ class. Only memory.cpp now #defines USE_RINTERNALS.
Rinternals.h contained many #defines of function names to the same name prefixed by Rf_: this appears to correspond in C++ terms to putting these functions in a namespace. I split these #defines out into a separate header file Rf_namespace.h, which is #included by RObject.h (which is in turn included by the other new headers). There are various similar #defines scattered around other CR header files, which may need to be moved into Rf_namespace.h in due course.
I dithered about whether to name the file in question RInternalFunction.h or RPrimitiveFunction.h. Usage in the CR code (e.g. primsxp_struct) suggests the latter, and the R Internals document speaks of internal and primitive functions as being mutually exclusive, but fails to give a more general name covering any function handled via R_FunTab. But it seems to be reasonable to regard primitive functions as a special case of an internal function, hence the eventual choice of RInternalFunction.h.
It is noted that Rdynload.cpp and dotcode.cpp each give compiler warnings under -pedantic because they attempt to cast function pointers to void*. The source code of the former already contains a comment saying that it's illegal even in C. Not easy to fix, so leave for now.
It seems logical (!) that a logical vector (LGLSXP) should contain items of type Rboolean rather than of type int, and consequently that the macro/function LOGICAL(SEXP) should return Rboolean* rather than int*. I made some attempt to do this, but backed out of it for the following reasons:
- The .C interface expects these vectors to contain ints;
- ISO14882:1998 says that in C++, subject to certain constraints, it is implementation-defined which integral type is used as the underlying type for an enumeration (though gcc happens to use int for Rboolean).
- ISO9899:1999 says much the same for C, but with differently worded constraints.
- In any case, despite the commented-out MAYBE value in the enumeration, perhaps Rboolean is best thought of as 'bool for C', rather than having any capability to handle NAs.
Possible new policy: within functions visible from C, use Rboolean as a substitute for C++ bool, possibly constrained to be 32Â bits long to avoid the enum implementation dependencies noted above. However, R logical vectors will continue to be represented using ints. (One day we might define an Rlogical class - a wrapper round an int - to handle logical vectors within C++, while C programs simply see typedef int Rlogical;.)

Phase 3: `0.03-2.5.0`

The primary objective of this phase was to redefine R_NilValue as a null (i.e. zero) pointer of type SEXP. R_NilValue is widely used within CR as a stub, i.e. to signify that something that might be present is absent, in much the same way that a null pointer is used within C or C++. However, in CR it is actually implemented in effect as an element of a pairlist (i.e. struct listsxp), whose CAR, CDR, TAG and attributes all point to itself. This would cause difficulties in CXXR when we reimplement the SEXPREC union as a type hierarchy, because pairlist elements will need to be of a specific type within the hierarchy. If R_NilValue were given this type, it would preclude its use as a general-purpose stub. But zero is a possible value for a pointer of any type, so if we equate R_NilValue to zero this will sidestep the problem.

Another disadvantage of the CR definition of R_NilValue is that it needlessly introduces a cyclic data structure.

The following are implementational details and issues that arose in carrying out this change:

The existing code in many places invokes functions/macros CAR, CDR, TAG and ATTRIB on a SEXP that may in fact be R_NilValue, expecting in this case for each of these functions to return R_NilValue. These functions were reimplemented to preserve this behaviour: i.e. each of them returns a null pointer if passed a null pointer. At the same time the macro forms were abolished: they are now implemented as inline functions for C++, and ordinary functions if called from C.
In the same spirit, OBJECT and IS_S4_OBJECT have been reimplemented to return FALSE if passed a zero pointer. They too are now implemented as inline functions for C++, and ordinary functions if called from C.
No such modification was made to NAMED: the policy here is that the calling code should be modified as necessary to prevent it being invoked for a null pointer. Deal similarly with invocations of SET_NAMED, PRINTNAME, NODE_IS_MARKED, SET_ATTRIB, SET_OBJECT, and LENGTH. (This last case is interesting because LENGTH is meant to be applied to vector objects, i.e. components of the SEXPREC union different from struct listsxp.) The calling sites concerned were determined by running make check at top-level: doubtless many have slipped through the net!
Incidental to the above changes, some of the macros in memory.cpp were replaced by inline functions.

A secondary objective of this phase was to get rid of C-style casts within the C++ code, wherever the appropriate remedy was reasonably obvious and straightforward. The following kinds of C-style casts were left in place pending further work:

Casts from one function pointer type to another (often involving DL_FUNC);
Casts from one struct pointer type to another (often involving DevDesc and GEDevDesc);
Use of the construct (void*)(-1);
Casts to/from R_varloc_t;
Other puzzling casts.

Addendum 2007/08/06: although make check works with this release, make check-devel doesn't.

Phase 4: `0.04-2.5.1`

The primary objective of this phase was to update the program to parallel release 2.5.1 of R. This proved to be straightforward, except that it was necessary to install a later version of svn_load_dirs.pl to cope with filenames containing @ signs. (However, I was surprised to discover that svn merge doesn't track renames.)

Other changes were as follows:

Bugs revealed by make check-devel were fixed. In general this was done by modifying certain functions to behave reasonably if passed a null pointer, namely LENGTH (returns 0), NAMED (returns 0) and SET_NAMED (does nothing). These changes obviated some of the changes made leading up to svn revision 49 (see PhaseÂ 3 above), and these changes were accordingly reversed. make check-all also now works, but it was time-consuming to run and revealed no bugs.
I managed to get autoconf working properly, and accordingly backed out of some configuration kludges I had made previously.

Phase 5: `0.05-2.5.1`

The aim of this phase was to create a branch entitled const, to explore to what extent the R code is amenable to 'constifying': i.e. converting pointers and C++ references wherever possible to const pointers. Two preliminary steps, carried out in the trunk, were as follows:

In the C++ source files in main, macros were replaced by inline functions wherever it was reasonably straightforward to do so. (The reason for doing this now was that during the constification process, it was usually extremely difficult to see what the compiler was complaining about if a multiline macro was involved.)
Similar changes were made to the header files under src/include: however, the pattern here was to convert a macro to an inline function if the header files was #included into a C++ file, and to an out-of-line call to the same function if the header file was #included into a C file.

This macro conversion was counterindicated in the following circumstances:
- The body of the macro was not syntactically equivalent to a function call;
- The macro used ##
- The macro modified its arguments, e.g. something like
```
#define INC(x) ++(x)
```
  (Using C++ reference arguments to get round this is not as straightforward as it might seem.)
- The macro referred to local variables at the point of call (although in some cases such macros were converted to inline functions with additional arguments);
- In some cases macros were left in place if they expanded to a single C/C++ expression or to a single macro invocation: it is the multiline macros that are particularly opaque.
An incidental change was this: Until now, the type SEXPREC was defined along the following lines:
```
typedef struct SEXPREC { ... } SEXPREC;
```
with the first occurrence of SEXPREC being what in C would have been a structure tag. This has now been changed to:
```
typedef struct RObject { ... } SEXPREC;
```
exploiting the fact that in C++ RObject is a fully-fledged class name. The header files in src/include/CXXR now generally refer to RObject rather than SEXPREC.

Having established the const branch, constification was set in train by the brute force measure of redefining SEXP to mean const RObject* rather than simply RObject*; a new typedef mapped vSEXP onto plain RObject*. In the same spirit 'v' variants of many of the accessor functions were introduced: for example now CAR takes a SEXP argument and returns a SEXP, while vCAR takes and returns a vSEXP. (Since these accessor functions are required to be callable from C, we can't simply overload CAR.)

I then attempted to recompile various files, inserting 'v's wherever the compiler demanded it. It quickly became apparent that these 'v's were highly contagious: for example, both NA_STRING and R_EmptyEnv had to be declared as vSEXPs rather than SEXPs. This led me to the conclusion that it was premature to attempt constification until I understand the evaluation process better.

At the time of tagging this release, the following files compile without warnings in the const branch: memory.cpp, envir.cpp and names.cpp. eval.cpp gives one compilation error, when do_function attempts a non-const operation on its op argument: fixing this would mean changing the signature of all the do_ functions.

Phase 6: `0.06-2.5.1`

In CR, each SEXPREC has a node class in the range 0 to 7. Nodes of non-vector SEXPTYPE (i.e. not of types CHARSXP, LGLSXP, INTSXP, REALSXP, CPLXSXP, STRSXP, VECSXP, EXPRSXP, WEAKREFSXP or RAWSXP) are all in class 0, and are 28 bytes long. Class 7 is used for vector nodes whose vector data amount to more than 128Â bytes; the remaining classes are used for smaller vectors, classified according to their size. Nodes of class 7 are allocated directly using malloc; nodes of the remaining classes are allocated from 'pages' about 2Â kB in size, with each node class having its own pages. In CXXR it is intended to replace SEXPRECs with an extensible class hierarchy (rooted at RObject), so it will not be feasible to put a tight upper bound on the size of non-vector nodes.

Another feature of CR is that in vector nodes, a single block of memory contains the data of the vector preceded by a SEXPREC and information about the length of the header. This is quite incompatible with the design philosophy of C++, which is that the size of an object must be deducible from its (C++) type: in particular ::operator delete relies on this.

The purpose of Phase 6 was to circumvent these problems, and at the same time to endeavour to decouple the code for allocating memory from the code managing garbage collection. This comprised the following changes:

A new class CXXR::Heap was created to handle allocation and deallocation of blocks of memory. This parallels CR to the extent that requests for large blocks are passed on directly to ::operator new, while requests for small blocks are satisfied by allocating fixed-sized cells carved out of 'superblocks'. However, this is an implementational detail and is not visible to the remainder of CXXR: only the total number of bytes and the total number of blocks allocated via CXXR::Heap are visible (using static member functions).
It is intended that CXXR::Heap will serve as a back-end to implementations of operator new and to an STL-compatible Allocator class. Note in particular that the blocks allocated from CXXR::Heap are not exclusively used to create RObjects, but may be used for any purpose where rapid allocation/deallocation of small blocks is required.
Node classes have been abolished, and the garbage collector now treats all nodes in the same way. In particular, following garbage collection, all unused nodes are deallocated back to CXXR::Heap. (CR deallocates only large vector nodes.)
The data of a vector object now resides in a separate block allocated from CXXR::Heap; a data member m_data of RObject (in due course to be factored out into a derived class) points to this block. For non-vector objects, and vectors of size zero, m_data is a null pointer. (CR appears to allocate at least 8 bytes of vector data even when the nominal size of the vector is zero.)
In CR decisions about when to garbage collect, and how many generations to collect are based (a) on the total number of nodes of classes 0-6, and (b) the total size of the vector data in nodes of class 7 (reckoned in units of 8 bytes). In CXXR the same logic is used, but based (a) on the total number of nodes, and (b) the total number of bytes currently allocated from CXXR::Heap, divided by 8.
I was strongly tempted to base GC exclusively on (b), and to ignore the number of nodes - after all, we're talking about a single resource here: memory. I'd welcome opinions about this.

Phase 7: `0.07-2.5.1`

The purpose of this phase was to encapsulate all the garbage-collection logic within C++ classes. Five such classes were introduced, namely GCManager, GCNode, GCEdge, GCRoot and WeakRef, as now described.

Class GCManager, as the name implies, carries out high-level management of garbage collection. It has no non-static data or methods. When CXXR::Heap indicates (via a callback) that it is on the point of requesting additional memory from the operating system, method GCManager::gc() decides whether to carry out a garbage collection, and if so how many generations to collect. As comtemplated at tag 0.06-2.5.1, this decision is now based only on the total memory allocated via CXXR::Heap, and not on the number of nodes allocated. If GCManager decides to carry out a garbage collection, this is carried out by calling GCNode::gc(), specifying the number of generations to be collected.
Class GCNode is intended to be the base class for all objects subject to garbage collection; RObject is now derived from GCNode. All GCNodes are threaded on circular doubly-linked lists according to their generation, managed via the static private vector s_genpeg. Element 0 of this vector represents the 'new' generation of nodes that have not yet been exposed to the garbage collector; nodes that survive garbage collection are moved into successively higher generations.
Templated class GCEdge<T>, where T (defaulting to RObject*) is a pointer to a class type derived from GCNode, represents a directed edge within the directed graph whose nodes are the GCNodes. Whenever an object of a type derived from GCNode wishes to refer to another such object, it should do so by incorporating a GCEdge encapsulating an appropriate pointer, rather than by incorporating the pointer directly. The class provides for GCEdge<T> to be implicitly converted to T in contexts which require this.
GCEdge contains the logic for ensuring that a node in a higher generation never includes a reference to an object in a younger generation. If any attempt is made to direct a GCEdge from an older node to a younger node, that younger node is immediately promoted to the the generation of the older node, and this change is propagated through the outgoing GCEdges of the younger node, and so on recursively. (In other words, it implements the EXPEL_OLD_TO_NEW logic that can be configured into CR (but is not the default for CR).)
Templated class GCRoot<T>, where T (defaulting to RObject*) is a pointer to a class type derived from GCNode, is intended to protect GCNodes from the garbage-collector. A GCNode pointed to by a GCRoot will not be garbage collected for as long as the GCRoot object exists. The constructor and destructor of this class therefore perform similar functions to the PROTECT/UNPROTECT macros of CR, but within a C++ idiom, in which the programmer is spared the need to check that PROTECTs are balanced by UNPROTECTs. (However, PROTECT and UNPROTECT continue, and will continue, to be available within CXXR.) The class provides for GCRoot<T> to be implicitly converted to T in contexts which require this.
The implementation of GCRoot uses an internal stack, and consequently requires (and checks) that GCRoots are destroyed in the reverse order of their creation. This should cause no problem as long as only variables with automatic or static storage duration are declared as GCRoots.

Despite successful experiments, the deployment of this class has been deferred, pending the replacement of setjmp/longjmp within CXXR by C++ exceptions. This is because destructors of C++ automatic variables are not called when the stack is unwound by longjmp (see ISO14882:2003 sec. 18.7); they are when the stack is unwound by a C++ exception.
Class WeakRef implements weak references (SEXPTYPE WEAKREFSXP) in a way intended to be functionally identical to CR. Each weak reference has a key and, optionally, a value and/or a finalizer. The finalizer may either be a C/C++ function or an R object.
The garbage collector will consider the value and finalizer to be reachable provided the key is reachable. If, during a garbage collection, the key is found not to be reachable then the finalizer (if any) will be run, and the weak reference object will be 'tombstoned', so that subsequent calls to key() and value() will return null pointers. A weak reference object with a reachable key will not be garbage collected even if the weak reference object is not itself reachable.

Note that, in CXXR, weak references are not implemented as four-element vectors, and the class has separate, appropriately typed fields for R and C/C++ finalizers (though at most one of these fields may be used in any particular WeakRef object).

Phase 8: `0.08-2.5.1`

All uses of setjmp and longjmp (and sigsetjmp and siglongjmp) within directory main have been removed, and replaced by using JMPException, a C++ exception class designed as far as possible to be a drop-in replacement for setjmp/longjmp. This is to ensure that the destructors of C++ objects are invoked as the stack is unwound following an exceptional condition.
Use of JMPException should be regarded as an interim measure. Normal C++ coding practice is for throw simply to report the exceptional condition that has arisen, rather than - as with JMPException - in effect requesting a specific subsequent flow of control.
The preferred way for C++ code to protect GCNodes from the garbage collector is now to use the templated class GCRoot. GCRoot's constructor will protect the GCNode in question, and its destructor will unprotect it; there is therefore no need for the programmer to remember to balance out the use of PROTECT and UNPROTECT as in CR.
The facilities of CR's pointer protection stack (using e.g. PROTECT and UNPROTECT) remain available, but the underlying implementation has been rewritten in C++ as part of the GCRootBase class. CXXR makes the additional requirement that when UNPROTECT or REPROTECT are applied to a pointer, this is carried out in the same context (RCNTXT) as that in which the pointer was PROTECTed. This is to help pick up mispairing between PROTECT and UNPROTECT.
Various CR header files, particularly Rinternals.h and Defn.h, contain macro definitions of the form
```
#define func Rf_func
```
These serve to avoid name clashes (at least at the linker level) with third-party packages; a similar purpose would be achieved in C++ by placing the function func in a namespace Rf. (In PhaseÂ 2 these macros were generally shifted into a separate header file Rf_namespace.h, but this change has now been reversed.) Using the preprocessor to modify program tokens in this way is something that many C++ programs will shun, especially since some of the tokens concerned (e.g. length) are likely to be widely used. However abolishing these macros altogether would break much existing code. Nevertheless, reliance on them is now deprecated within CXXR, and in particular all header files within src/include have been modified as necessary to include the Rf_ prefix explicitly where it is needed.

Phase 9: `0.09-2.6.1`

The primary objective of this phase was to update the program to parallel release 2.6.1 of R.

Other changes were as follows:

In previous work, the tendency has been progressively to move function prototypes from CR's header files into the relevant class-oriented (or at least data-type-oriented) header files within include/CXXR, and at the same time to add doxygen documentation. This has now been modified into a policy of copying the prototypes into the relevant CXXR header file, and adding documentation there, but leaving the prototype also in the CR header file. This will make it easier to track changes in function signatures when we upgrade to future releases of R. To this end a script allincludes.pl has been produced. This generates an (otherwise trivial) C++ source file that #includes all the header files under src/main and src/include; compiling this file checks that the prototypes in the CXXR header files are consistent with those in the CR headers.
In the light of this change, the policy regarding the Rf_ prefix described under PhaseÂ 8 has been modified. Whilst all header files in the CXXR directory should use the Rf_ prefix explicitly, header files derived from CR (e.g. Rinternals.h and Defn.h) should normally omit the prefix if the corresponding CR file does so.
All macros with arguments have been removed from the header files in the CXXR directory.
All C-style casts have been removed from the C++ code. (Unfortunately, under some Linuxen at least, standard signals such as SGN_DFL are defines as macros in terms of C-style casts, so main.cpp still gives warnings if compiled using gcc with -Wold-style-cast.)

PhaseÂ 10: `0.10-2.6.1`

The primary objective of this phase was to reimplement all vector data types as C++ classes derived (directly or indirectly) from RObject, rather than using vecsxp_struct within the RObject::u union. vecsxp_struct has not yet been eliminated entirely, however, because of some straggling uses of truelength.

Other changes were as follows:

The memory blocks allocated by R_alloc and kindred functions are no longer implemented as objects inheriting from RObject. Instead these blocks are managed separately via a new class RAllocStack. When the stack size is reduced using vmaxset, the memory blocks are released immediately, rather than being left to the garbage collector.
The levels of valgrind instrumentation have been modified somewhat, as explained in the porting page.
A concept of 'infant immunity' was introduced into garbage collection: see the GCNode documentation. Roughly speaking, this means that an object of a class derived from GCNode is immune from garbage collection while it is being constructed, leading to considerable simplification.
The templated class GCEdge was abolished: it was felt that the advantage of encapsulating the write barrier within a single class was outweighed by various knock-on obscurities.
Functions HASHASH, SET_HASHASH and SET_HASHVALUE abolished: the new class CXXR::String will compute and cache hash values automatically on demand.
CXXR generally prepared for its first public release, particularly by improving documentation.

PhaseÂ 11: `0.11-2.6.2`

The primary objective of this phase was to update the program to parallel release 2.6.2 of R. Errors and warnings given by make check-devel were also corrected.

Phase 12: `0.12-2.6.2`

The primary objective of this phase was to eliminate the RObject::u union completely, replacing its remaining elements with classes derived from RObject. This entailed the creation of the following classes: BuiltInFunction, ByteCode, Closure, DottedArgs, Environment, Expression, ExternalPointer, PairList, Promise, SpecialSymbol and Symbol. Several loose ends remain to be tied up, however; in particular, the remaining data members of RObject ought all to be private.

Other changes were as follows:

Class CXXR::Heap has been renamed CXXR::MemoryBank to avoid confusion with standard data structures called heaps.
Classes GCNode and GCRootBase are now initialized using a Schwarz counter, thus enabling certain standard objects (e.g. the 'not available' string, and the global environment) to be declared as static class members: it is no longer necessary to wait until InitMemory() has been called before creating them. This in turn simplifies the implementation of the garbage collection algorithm, which no longer has to treat these objects specially. Concomitant with this change, the R interpreter now terminates by throwing an exception of class ExitException, which ensures that all GCRoot objects are destroyed in the reverse order of their creation.
String objects now belong to one of two subclasses, CachedString and UncachedString, with the former being the preferred implementation. At any time, at most one CachedString with given text and encoding will exist; to enforce this, the class constructor is private, and instead clients use the static method obtain() (accessible from C via the function mkChar()) to get a pointer to a CachedString object with specified text and encoding. The implementation of the cache is different from that used in CR, and is based on the C++ standard library; it has the advantage that cached strings do not need any special handling by the garbage collector. There are no facilities for modifying the text or encoding of a CachedString once it has been created; in particular the function CHAR_RW() can be used only on UncachedString objects.

Phase 13: `0.13-2.6.2`

This phase was an attempt - less successful than was hoped! - to close the gap in speed between CR and CXXR. Principal changes were:

Small blocks of memory are now allocated from preallocated pools controlled by a new class CellHeap. CellHeap differs from CellPool (used previously for this purpose) in that whenever a memory block is requested from a CellHeap, the allocated block will always be the one with the lowest address among the available blocks. This is achieved using a skew heap data structure, and is intended to increase the spatial localisation of successively allocated blocks. Where the underlying OS provides posix_memalign(), the superblocks from which memory blocks are allocated are aligned with memory pages.
MemoryBank now uses CellHeaps with more closely spaced block sizes than were used previously, to avoid wasting space in cache lines.
When a GCNode object has its generation changed as a result of write barrier enforcement or by being exposed to the garbage collector, it is no longer immediately shifted to the list appropriate to its new generation. Instead this is deferred until the sweep phase of a garbage collection visits the node. This avoids pulling nodes into the processor cache unnecessarily, and paves the way for the following change.
The lists via which GCNode manages garbage collection are now singly-linked rather than doubly-linked. This and other changes mean that the size of a PairList node (cons cell) has been reduced (on 32-bit architecture) from 40Â bytes to 32Â bytes. DumbVector nodes have been reduced in size by 12Â bytes.
The garbage collection algorithm now endeavours as far as possible to deallocate nodes in the reverse order of allocation. This is because class CellHeap works particularly efficiently if memory blocks are released in decreasing address order.
The protocol by which newly created GCNode objects are exposed to the garbage collector has been simplified and streamlined to avoid pulling nodes into the cache unnecessarily. First, GCNode::expose() exposes only the node for which it is invoked; it does not look for unexposed descendants of this node. Secondly, protecting a node from the garbage collector (e.g. using GCRoot<T> or PROTECT()) no longer automatically exposes the node. (However, write barrier enforcement will continue to expose nodes if an exposed node is modified to refer to an unexposed node, and this exposure will propagate to descendants: this falls out automatically from the write barrier enforcement algorithm.)
A bug whereby certain nodes were never exposed to GC has been corrected.
GCNode::operator new no longer zeroes the memory it allocates.

Phase 14: `0.14-2.7.1`

The objective of this phase was to update CXXR to parallel release 2.7.1 of R. However, other changes are:

We have eliminated several uses of dynamic_cast from the 'glue layer' between code inherited from CR and new CXXR code. (dynamic_cast can be surprisingly slow.)
SET_TYPEOF() has been abolished.
R_NilValue is now defined as a macro expanding to NULL (which will in turn typically expand to (void*)0 in C and simply to 0 in C++). Previously it was defined as
```
SEXP R_NilValue = 0;
```
which necessitated unnecessary memory fetches.

Phase 15: `0.15-2.7.1`

The objective of this phase was to tidy up the class hierarchy rooted at RObject, and in particular to give RObject itself a more distinctive class identity, i.e. for it to be less of a ragbag for things that hadn't yet been accommodated elsewhere. Principal changes were:

Class RObject now controls attributes more closely. The attributes (if present) must now be a PairList, each of whose elements must have a distinct symbol as its tag. No attribute may have a null value. The m_has_class field is automatically set according to whether or not there is a class attribute; consequently SET_OBJECT() has been abolished. However, the class interface does not yet enforce all necessary consistency conditions on attributes; these are still applied by the code in attrib.cpp.
The m_debug field of RObject has been abolished. Instead the Closure and Environment classes each contain a field controlling debugging.
The m_trace field of RObject has been moved to a new class FunctionBase, from which the Closure and BuiltinFunction classes are now derived.
The m_flags field of RObject, which replaced the gp ('general purpose') field within sxpinfo_struct, has been abolished. It has been replaced by various special-purpose fields, placed as far down the class hierarchy as is practical at present. A virtual function packGPBits() is used to reconstitute the old gp ('levels') word for the sole purpose of serialization; virtual function unpackGPBits() is correspondingly used during deserialization. (However, not all of the fields that have replaced m_flags need to be serialized/deserialized.)
A new class HandlerEntry, defined locally within errors.cpp, is used to handle error handler entries, rather than using a ListVector for this purpose. This avoids the former use of the m_flags field here.
Code inherited from CR is apt to hand out non-const pointers to objects that really ought to be immutable, R_UnboundValue for example. To counter this, RObject now has a Boolean field m_frozen: non-const member functions in the RObject hierarchy can now apply a run-time check that their object has not been frozen. In particular, attempting to change the attributes of a frozen object gives rise to an error.
String is now an abstract class. CachedString objects are now frozen by the constructor. R_NaString is also frozen.
Class SpecialSymbol has now been merged into Symbol. Entities such as R_UnboundValue, which were formerly implemented as SpecialSymbol objects, are now implemented as frozen Symbols.

Phase 16: `0.16-2.7.2`

The objective of this phase was to update CXXR to parallel release 2.7.2 of R.

An innovation in carrying out this phase was the introduction of a Perl script uncxxr.pl. Where a source file inherited from CR - foo.c, say - has been adapted for CXXR (and changed into a C++ file foo.cpp in the process), this script endeavours as far as possible to reverse systematic changes (e.g. the conversion of C-style casts into C++ casts) to generate a quasi-C file foo.bakc. (We say 'quasi-C' file because the resulting file may not be syntactically correct C: it is intended for human eyes only.) Updating to a new release of R is facilitated by using a 3-way visual diff between the release of foo.c currently shadowed by CXXR, the new release of foo.c, and foo.bakc. This helps to highlight where the significant changes are in the new release of foo.c, and where they might conflict with changes made in CXXR. (A similar 3-way comparison using foo.cpp instead of foo.bakc throws up too much 'noise'.)
Some changes have been made to the CXXR files, particularly in the use of whitespace, to improve the effectiveness of uncxxr.pl. However, this has so far only been done for C++ source files that needed to be changed in any case as part of the upgrade to 2.7.2.

PhaseÂ 17: `0.17-2.7.2`

The primary purpose of this phase was to reimplement the functionality of duplicate1() in duplicate.cpp using class copy constructors and a virtual function RObject::clone(), reimplemented as necessary in derived classes. The following changes were associated with this:

GCNode::expose() is once again recursive in effect, thus reversing a change made in PhaseÂ 13. Cloning a node often requires cloning an entire subgraph of the node graph, via recursive calls of clone() to copy subobjects. The approach taken is that while the copy subgraph is under construction, none of its constituent nodes is exposed to the garbage collector: in particular clone() itself does not expose the objects it creates to the collector. Only when the copy subgraph is complete is the whole subgraph exposed, and to do this the code that called to 'topmost' clone() must then apply the newly-recursive expose() function to the pointer that clone() returned. (Trying to expose nodes individually as the construction proceeded meant that they were at risk of being snatched away by the garbage collector before the subgraph was complete: it is difficult to work around this in a way that sits easily with C++ programming idioms.)
GCNode::devolveAge(), used in enforcing the write barrier, has been renamed propagateAge(), and this function remains recursive in effect. However, at the time of call, propagateAge(const GCNode* node) changes the generation number only of node (if necessary); the recursive propagation of this change is deferred until the start of the next garbage collection. (Unfortunately the same technique cannot be applied to expose() for a reason explained in its documentation.)
Not all classes derived from RObject are clonable, and for unclonable types, clone() returns a null pointer. When a copy constructor copies a pattern object containing a subobject of an unclonable type, the object constructed will at the appropriate point simply contain a pointer to the subobject of the pattern object, rather than to a clone of that subobject. This copying logic is encapsulated in a templated 'smart pointer' type RObject::Handle<T>, and for example the 'car' pointer of a PairList object is now a Handle<RObject>. Similarly, the former templated class EdgeVector<T> has been replaced by HandleVector<T> which - as the name suggests - is implemented using a std::vector<CXXR::RObject::Handle<T>Â >.

Phase 18: `0.18-2.8.1`

The objective of this phase was to update CXXR to parallel release 2.8.1 of R.

The uncxxr.pl script (see Phase 16) has been somewhat further developed, and a larger number of C++ files derived directly from CR have been tweaked so that uncxxr.pl can back-convert them more accurately to their CR form.
Within C++ files derived directly from CR, reinterpret_cast has been replaced by static_cast wherever this possible without artifice. This has been facilitated by the introduction of a function CXXR_alloc, which does the same job as R_alloc, but - like malloc but unlike R_alloc - returns void* rather than char*. (uncxxr.pl converts CXXR_alloc back to R_alloc.)

Phase 19: `0.19-2.8.1`

The primary purpose of this phase was to refactor environments, to pave the way for introducing provenance-tracking features into R. The following changes were associated with this:

The C++ Symbol class now enforces the requirement that (except for certain special Symbols), there is at most one Symbol with a given name. (CR enforces a similar requirement, but less comprehensively, using the install() function.) To facilitate this, it is now a requirement that a Symbol's name be a CachedString object, rather than any String object.
In CR (and formerly in CXXR), SYMSXP objects contained a pointer to an arbitrary object, which was considered to be the Symbol's value within R's base environment and base namespace. Objects of the C++ Symbol class no longer contain such a pointer, and the base environment and base namespace are implemented in exactly the same way as other Environment objects.
Similarly, in CR (and formerly in CXXR), SYMSXP objects contained a pointer to an R object of a function type, which was used when the Symbol was used as the name of a function invoked via R's .Internal() interface. Objects of the C++ Symbol class no longer contain such a pointer; instead the relevant mapping is defined by the C++ class DotInternalTable.
The 'global cache' of Environments on the search path has been abolished, at least for the time being.
A new C++ class Frame has been introduced, inheriting from GCNode but not from RObject. A Frame defines a mapping from Symbol objects to arbitrary RObjects.
Each Environment object now contains a pointer to a Frame object, which defines its 'local frame'. The base environment and the base namespace have the same Frame.
Frame itself is an abstract class, allowing different implementations along the lines provided by the RObjectTables package to be achieved simply by class inheritance. In most cases, however, the concrete class StdFrame is used, in which the mapping from Symbols to RObjects is provided by a hash table, implemented using class unordered_map from the TR1 extensions to the C++ standard library. This implementational detail is not made visible to R code.
The interface to MemoryBank::allocate() has been changed to allow the caller to specify that the call shall not result in a garbage collection. Class CXXR::Allocator uses this to ensure that manipulations of standard containers using CXXR::Allocator do not result in reentrant calls to the standard library code, which might otherwise happen if the garbage collector attempted to delete objects handled by the container.

Phase 20: `0.20-2.8.1`

The purpose of this phase was extensively to reengineer garbage collection. This was to pave the way to experimentation with reference-counting approaches to garbage collection; however, release 0.20-2.8.1 itself still uses generational mark-sweep. A major change has been in the way of implementing 'infant immunity', whereby nodes that are under construction are not liable to garbage collection; the following is a summary of the way in which this has evolved. The phrase 'infant nodes' means nodes that are either under construction, or whose construction is complete but which have not yet been exposed to garbage collection by calling GCNode::expose().

In previous releases, infant nodes were simply ignored during the sweep phase of a mark-sweep collection, and so left in place. This had the disadvantage that the infant immunity did not automatically extend to subobjects of an infant node. In the PairList copy constructor, for example, the copied list was created working forwards along the pattern list, but then the whole structure of the copied list would then need to be traversed again to expose its nodes to garbage collection. (This was achieved by having GCNode::expose() automatically recurse to subobjects.)
An alternative approach explored in the development of release 0.20-2.8.1 was to regard infant nodes as reachable during mark-sweep. So, during a mark-sweep garbage collection, all the infant nodes and their descendants would automatically be marked. So the PairList copy constructor can expose the second and subsequent nodes of the copied list immediately it has created them, leaving only the head of the list unexposed, and thus conferring immunity from garbage collection on the whole structure. There is no longer any need for expose() to recurse to subobjects. The snag with this approach was that during the mark phase, the Marker visitor could invoke the visitReferents() method of objects whose construction is not yet complete, and which may therefore contain junk pointers. Obviously, if a visitor was directed to a junk address, that would probably crash the interpreter. The workaround for this was to have GCNode::operator new zero out the memory it allocated for new GCNode objects, so that instead of junk pointers, an object under construction would contain null pointers, which visitReferents() could readily detect. However, this zeroing of memory was time consuming (and wouldn't immediately be portable to some strange hardware architectures in which null pointers are not represented by binary zero).
The approach finally adopted is simply for class GCNode to keep a count of the number of infant nodes, and not to initiate a mark-sweep garbage collection while any infant nodes exist. This has the advantages of the second approach, but without the disadvantage: visitReferents() will never be called for a node whose construction is incomplete, and there is consequently no need for zeroing memory. It also simplifies the handling of the case where an exception is thrown within the constructor of an object derived from GCNode.

Other changes are as follows:

Templated class GCEdge<T> (which was abolished at PhaseÂ 10) has been reinstated, and encapsulates the write barrier. RObject::Handle<T> now inherits from GCEdge<T>.
The templated class GCRoot<T> has been renamed GCStackRoot<T>, and its implementation simplified. These objects remain subject to the restriction that they must be destroyed in the reverse order of their creation, and are therefore best suited to declaration as automatic variables (i.e. variables on the processor stack). A new templated class GCRoot<T> has been introduced: this does a similar job to GCStackRoot (i.e. it is a smart pointer providing protection from garbage collection), but is not subject to creation/destruction order restrictions. However, construction and destruction of GCRoots is more time consuming than for GCStackRoots, so the latter should be preferred where possible. CR's 'precious list' has been reimplemented as part of the base class of GCRoot. The ExitException class has been abolished, since the new GCRoots make it unnecessary.
Class MemoryBank no longer contains any logic related to garbage collection, and in particular there are no callbacks from MemoryBank into the garbage-collection code. The decision about whether to initiate a mark-sweep collection is now taken in GCNode::operator new.

Phase 21: `0.21-2.8.1`

This phase changes the approach used for garbage collection. Previous phases used a generational mark-sweep collector, like CR itself. As of PhaseÂ 21, the principal method of garbage collection is reference counting. The principal motivation for this is to make better use of the processor caches: with reference counting, the memory occupied by objects that become garbage is quickly recycled into productive use, very likely while this memory is still mapped in cache.

To implement reference counting, each GCNode object contains a one-byte reference count, which is automatically adjusted by the GCEdge<T>, GCRoot<T> and GCStackRoot<T> smart pointers, and by the traditional CR PROTECT/UNPROTECT mechanism. (If a node's reference count ever reaches 255, it sticks at that value, and that node can only be garbage-collected by the mark-sweep mechanism.) When a GCNode's reference count falls to zero, it is declared 'moribund'. When GCNode::operatorÂ new is called upon to allocate memory for a new GCNode object, it first looks through class GCNode's internal list of moribund nodes. Any nodes on the list which still have a reference count of zero are deleted; nodes whose reference count has risen back above zero - accounting for about one in four of the nodes on the moribund list - are returned to the 'live' list.

To cope with cycles in the node graph (i.e. the directed graph whose nodes are GCNodes and whose edges are GCEdges), this reference counting scheme is backed up by a simple (i.e. non-generational) mark-sweep scheme. However, this runs much more rarely than CR's garbage collections, and uses a simpler logic to manipulate the threshold at which mark-sweep collection takes place. Not having node generations means that there is no longer a need to implement the 'write barrier'; this in turn means that the GCEdge<T> templated class can have a C++ assignment operator defined, which enables it to be more freely used in connection with the container types in the C++ standard library.

Weak reference (WeakRef) objects need special handling during garbage collection, and consequently each WeakRef object now includes a pointer to itself, to stop it being deleted by the reference counting mechanism.

Phase 22: `0.22-2.9.1`

The purpose of this phase was to update CXXR to parallel release 2.9.1 of CR. (Unfortunately, it was overtaken by release 2.9.2 of CR.)

uncxxr.h now defines a macro CXXRconvert(type, expr), which expands to type(expr), but which uncxxr.pl replaces simply by expr. This macro is now widely used in code inherited from CR in cases where C++ requires an explicit type conversion but C does not.

Phase 23: `0.23-2.9.2`

The purpose of this phase was to update CXXR to parallel release 2.9.2 of CR. This proved straightforward.

Phase 24: `0.24-2.9.2`

This phase represented the first stage of refactoring the interpreter's evaluation logic into C++, and included the following principal changes:

A class CXXR::Evaluator has been introduced to carry out general services and housekeeping in support of evaluation. Rf_eval() is now simply a wrapper round Evaluator::evaluate().
Class RObject now defines a virtual function evaluate(), which Evaluator::evaluate() uses to evaluate a particular object. By default this simply returns a pointer to the RObject for which it was invoked, but this behaviour is overridden in various classes (e.g. Expression, Symbol and Promise) to provide substantive functionality.
The abstract class FunctionBase now defines an abstract virtual function apply(), which is invoked by Expression::evaluate() to apply a function to a specific set of actual arguments.
Class BuiltInFunction now has subclasses OrdinaryBuiltInFunction (corresponding to SEXPTYPE BUILTINSXP) and SpecialBuiltInFunction (SPECIALSXP). (It is possible that these classes will be abolished in the future, with their respective functionalities - which differ only slightly - being moved into BuiltInFunction.)
The functionality of BuiltInFunction::apply(), through to the invocation of the appropriate do_ function, is now fully handled within the CXXR core. do_internal() has also been absorbed into the CXXR core. For the time being, however, Closure::apply() is simply a wrapper round CR's Rf_applyClosure().
The function table, R_FunTab in CR, is now a private static data member of class BuiltInFunction. This class now uses a Schwarz counter, which automatically initialises the function table on program start-up.

Phase 25: `0.25-2.9.2`

This phase continued with refactoring the interpreter's evaluation logic into C++, and comprised the following principal changes:

Closure::apply() has now been reimplemented within the CXXR core, making use of a new class ArgMatcher to carry out argument matching. For the time being the function Rf_applyClosure() remains in existence, but it is now used only in connection with method dispatch.
As presaged in the description of the preceding phase, classes OrdinaryBuiltInFunction and SpecialBuiltInFunction have been abolished, and their functionalities absorbed into BuiltInFunction.
A policy, described in the documentation of class RObject, has been defined and put into practice regarding the use of const T*, where T is RObject or a class inheriting from it. This policy aims to resolve as far as possible an inherent tension between the way CR is implemented and the 'const-correctness' that forms part of C++ programming style.
The code relating to weak reference (WeakRef) objects has been improved and tidied up in various ways. In particular, when the key object of a WeakRef is found to be unreachable, it is now guaranteed that the weak reference's finalizer (if any) will be run as part of the same mark-sweep garbage collection that collects the key.

Phase 26: `0.26-2.10.1`

The purpose of this phase was to update CXXR to parallel release 2.10.1 of CR.

Phase 27: `0.27-2.10.1`

This phase comprised the following principal changes:

SET_ENCLOS() has been superseded by new mechanisms for manipulating the enclosing relationships of Environments, which ensure that acyclicity is preserved.
A 'global cache' for Symbol bindings found along the search list has been introduced, similar to that used in CR.
R_isMissing() reimplemented as CXXR::isMissingArgument(); unlike the previous CXXR implementation, it no longer requires any memory allocations.
The GCNode class can now optionally include diagnostic code to identify cycles within the GCNode/GCEdge graph.

Phase 28: `0.28-2.10.1`

This phase was concerned with refactoring contexts (CR's RCNTXT), and involved teasing apart the numerous distinct functions that this struct plays in CR:

Maintaining an 'Ariadne's thread' recording information about the stack of R function calls currently in progress. This function is now encapsulated in the CXXR class Evaluator::Context.
Conveying information about possible longjmp targets from the destination to the point where longjmp is called. C's setjmp and longjmp are incompatible with C++ exception handling, and were removed from CXXR at Phase 8. At that stage, however, they were simply replaced by an exception class JMPException, which was designed simply to ape the behaviour previously achieved with longjmp. JMPException has now itself been abolished, and replaced with three exception classes LoopException (servicing R functions break and next), ReturnException (which services the R function return and various other indirect flows of control) and CommandTerminated (raised in response to unhandled errors or user interrupts). These new exception classes are used in a way consistent as far as possible with C++ programming idioms; in particular, the class Evaluator::Context plays no direct role in controlling their propagation, and the CR function findcontext() no longer exists.
Saving information about the state of evaluation prior to an R function call, and then restoring the state as the function exits (whether via the normal flow of control or via longjmp). For the time being, this save/restore functionality has been retained within the Evaluator::Context class, though in some cases the functionality is achieved by incorporating an object of some other class, such as ProtectStack::Scope or RAllocStack::Scope, within an Evaluator::Context object.
In all cases this save/restore functionality is now achieved, following a standard C++ idiom, by the constructor of a stack-based object saving state, and then its destructor restoring it. This automatically copes both with the normal flow of control and with exceptions, so there is now no need for CR's R_restore_globals() function.

In the future, it is likely that some of the save/restore functions now carried out by the Evaluator::Context class will be factored out into new classes with more specific responsibilities.
Saving information about R on.exit expressions. This function is now also encapsulated within the Evaluator::Context class. Any on.exit expressions attached to a Context object are evaluated automatically by the object's destructor. This automatically copes both with the normal flow of control and with exceptions, so there is now no need for CR's R_run_onexits() function.
Verifying that R functions effecting an indirect flow of control (e.g. break, next and return) are used only in circumstances where there is an appropriate destination. In CXXR this is now accomplished using the classes Environment::LoopScope and Environment::ReturnScope.
Determining whether execution is currently within an R browser, and if so what the browsing depth is. In CXXR this is now accomplished using the class Browser.

Other changes in this phase were:

The pending Promise stack has been abolished, the necessary functionality now being achieved with C++ try-catch logic.
Several CR global variables have been abolished: R_RestartToken, R_ReturnedValue and R_Toplevel. (CR's TOPLEVEL contexts have been replaced by Evaluator objects.)

Phase 29: `0.29-2.10.1`

The primary purpose of this release was to define the baseline for the results on add-on packages reported at useR! 2010. The changes are mainly bugfixes, but with the following more substantive changes:

The code now allows for the possibility that the destructor of a class in the RObject hierarchy may evaluate R expressions. This has entailed a change to the implementation of PairList::construct(), which was previously not reentrant; in the new implementation, this function never gives rise to garbage collection.
Methods of class RObject concerned with setting and examining attributes are all now either virtual or implemented via calls to virtual functions. This means that classes within the RObject hierarchy can apply their own consistency checks to attribute settings, and also override or augment the way in which attribute values are stored within the C++ object.

Phase 30: `0.30-2.11.1`

The primary purpose of this phase was to update CXXR to parallel release 2.11.1 of CR. This included the following corrections to significant preexisting bugs:

Each of the functions COMPLEX(), INTEGER(), LOGICAL(), RAW(), REAL(), R_CHAR(), STRING_ELT(), SET_STRING_ELT(), VECTOR_ELT(), SET_VECTOR_ELT(), XVECTOR_ELT() and SET_XVECTOR_ELT() now verifies not only that its vector argument is a pointer to an RObject of the correct type, but also that this argument is not a null pointer. SET_STRING_ELT() also now verifies that the pointer to the new String value is not null. These changes bring the behaviour of these functions back into line with CR. These non-null checks are applied even if CXXR is built with the preprocessor variable UNCHECKED_SEXP_DOWNCAST defined (which causes the type checks to be elided).
Changes have been made to ensure that do_browser() correctly saves and restores the restart handler stack, and to ensure that the browser can be invoked at top-level. (There is however still a problem that typing Q into the browser does not work as described in the manual page: it simply returns to the browser prompt.)

Phase 31: `0.31-2.11.1`

This phase included extensive changes:

The process, started in Phase 28, of unbundling the various functions of CR contexts continues. The Evaluator::Context class is now the root of a hierarchy of classes. A Context object of some kind is now created for every R function invocation (this no longer depends on whether profiling is in progress), but the intention is that these Context objects are lightweight, and contain only information relevant to the particular function invocation.
In CR, indirect flows of control such as arise from the R return and break functions are handled by C setjmp/longjmp. Since these are incompatible with the orderly stack unwinding that C++ requires, at Phase 8 CXXR everywhere replaced invocations of longjmp by throwing C++ exceptions. Unfortunately the propagation of C++ exceptions incurs a considerable overhead.
An R function such as return is now implemented so that it creates an object of a class inheriting from Bailout. The basic idea is that this object is then passed as a return value up the chain from called function to caller, until it reaches the intended destination of the indirect flow of control. However, this passing up the call chain happens only if the caller has indicated, by wrapping its call in a BailoutContext, that it is able to propagate the Bailout object correctly. If that is not the case, then the called function will invoke the throwException() method of the Bailout object, which - as the name suggests - will complete the indirect flow of control by throwing a C++ exception.

This change has greatly reduced the number of C++ exceptions that are thrown, with corresponding benefits for performance.
There has been continued refactoring of the central evaluation logic, mainly with a view to making it clearer. This includes particularly the dispatching of S3 methods. There has been some progress towards concentrating all manipulations of argument lists in a new class ArgList. Rf_applyClosure() and R_execClosure() have been abolished, their functionality now being incorporated into the Closure class. However much remains to be done.
The approach to running CXXR under Valgrind (with the memcheck tool) has changed. Previously, CXXR optionally instrumented its own internal memory allocation scheme (based on classes MemoryBank and CellPool) using Valgrind client requests. This instrumentation was controlled by the preprocessor variable VALGRIND_LEVEL. Unfortunately the instrumented CXXR ran under Valgrind with glacial slowness, making it useless for practical purposes. Under the new approach, VALGRIND_LEVEL has been abolished. Instead, when Valgrind (+memcheck) is to be used, the file MemoryBank.cpp should be recompiled with the preprocessor variable NO_CELLPOOLS defined, and CXXR rebuilt. (Only this one file needs to be recompiled.) When NO_CELLPOOLS is defined, class MemoryBank routes all requests for memory blocks directly to ::operator new (which no doubt in turn calls malloc()). This means that Valgrind's internal malloc() substitute comes into play, and the result runs at an entirely usable speed.
CXXR has also been changed to carry out a more thorough clean-up at program exit; in particular all objects of a class derived from GCNode are deleted, and the tables of Symbols and CachedStrings are deleted. This suppresses a lot of the 'possibly lost' reports that Valgrind's leak check would otherwise report.

Phase 32: `0.32-2.11.1`

This phase consisted of changes to improve the speed of CXXR. The principal changes were as follows:

When the reference count of a GCNode falls to zero, it is designated as 'moribund'. Previously moribund nodes were moved onto a separate doubly-linked list of nodes (and moved back again if the reference count was found subsequently to have risen). Now instead the GCNode class maintains a vector of pointers to moribund nodes. Also, the moribund flag within a GCNode object is now incorporated into the same byte as the saturating reference count.
PairList objects have now been squeezed into 32 bytes (on 32-bit architecture) - with some resulting inelegances in encapsulation - and Frame::Binding objects have been reduced to 16 bytes (again on 32-bit architecture). Class CellPool now allocates its 'superblocks' on 4096-byte boundaries. These changes make for better utilisation of the processor caches.
A new class VectorFrame has been introduced, and used to implement the local Environments of Closure calls instead of the StdFrames used previously. As the name suggests, VectorFrame is an implementation of the Frame abstract type which holds its constituent Frame::Bindings as a vector. Although look-up time is asympotically linear in the number of Bindings, as compared with the logarithmic performance of StdFrame, it has a shorter construction and destruction time than StdFrame, and is better localised in memory. These factors make VectorFrame more efficient in implementing small Frames with a short lifetime.

Phase 33: `0.33-2.12.1`

The purpose of this phase was to update CXXR to parallel release 2.12.1 of CR. In the course of this, the use of UncachedString objects was largely replaced by the use of CachedString objects, a change that has lagged behind the corresponding change in CR.

Phase 34: `0.34-2.12.1`

This phase was marked by a wider use of C++ generic programming techniques, both to simplify the internal code, and to make this code available in a flexible form to add-on packages. In particular:

All the built-in vector types are now specialisations of the class template FixedVector.
Subscripting operations (subsetting and subassignment) are now carried out by algorithms implemented as C++ templates, so that they are applicable to generalised vectors of arbitrary element types, not just the R built-in vector types. (Class Subscripting and associated functions.)
Similarly unary functions and binary functions are now handled generically, using algorithms within the namespace VectorOps.
To support the generic algorithms, various function objects were introduced in the new ElementTraits namespace.

Phase 35: `0.35-2.12.1`

This release is intended to clear the decks prior to an upgrade to R 2.13.1, and includes only small changes in the development trunk:

The class Subscripting has now been extended to cover subassignment to matrices and arrays.
The implementation of class GCNode has been modified, reducing its administrative data to a single byte.

(The main activity in the period leading up to this release has been the introduction of the lazycopy branch, which is exploring methods for managing object duplication automatically via the RHandle smart pointer, and eliminating the need for NAMED() and SET_NAMED(). Verdict so far is mixed: it basically works, but has performance issues, and breaks somewhat more existing code than I'd like. A plus point is that it better achieves C++ 'const correctness' than the development trunk.)

Phase 36: `0.36-2.13.1`

The purpose of this phase was to upgrade CXXR to parallel release 2.13.1 of CR. This includes making bytecode interpretation available in CXXR for the first time, though not yet in the 'threaded code' implementation (which is the CR default when using gcc).

The code also now builds correctly when configured with --enable-memory-profiling. (Thanks to Doug Bates for pointing out that previously it didn't.) However, the functionality of tracemem and kindred R functions (untracemem and retracemem) is currently unavailable in CXXR even when it is configured with memory profiling enabled.

Phase 37: `0.37-2.13.1`

This release contains only minor changes:

The functionality of tracemem and kindred functions has been reinstated.
The 'threaded code' implementation of the bytecode interpreter is available, and is the default under gcc (as in CR).
Various efficiency improvements, particularly regarding bytecode, though much remains to be done here.

Phase 38: `0.38-2.13.1`

This release clears the decks prior to an upgrade of CXXR to R 2.14.1.

The principal change regards garbage collection. The reference-counted approach to garbage collection primarily used by CXXR can bring speed advantages when dealing with large datasets, but the housekeeping involved in diddling reference counts up and down as required is surprisingly time-consuming, and this is a major contributor to the speed penalty of CXXR compared with CR when dealing with small datasets, a penalty that has grown greater with the advent of the bytecode interpreter. This release incorporates the following changes:

Formerly CXXR would initiate a reference-count garbage collection (GCNode::gclite()) on every call to GCNode::operator new. This is still the case if CXXR is built with the preprocessor variable AGGRESSIVE_GC defined (as is the case in the default configuration), but otherwise gclite() is invoked only when the number of bytes allocated has risen by a certain margin (currently 10,000) since the previous call of gclite().
Smart pointers from the GCStackRoot class template are now in either a non-protecting or protecting state, with newly created GCStackRoots being non-protecting. Only if a GCStackRoot is in the protecting state does it increment the reference count of its target. GCNode::gclite() switches all GCStackRoots into the protecting state before starting garbage collection. Taken in conjunction with the first change, this means that many GCStackRoot pointers will complete their lifecycle without ever being switched into the protecting state.
Changes in a similar spirit have been made to the CR-style 'pointer protection stack' (class ProtectStack) and the bytecode intepreter's node stack, both of which are now implemented using the new class NodeStack.

A side effect of the above changes is that when AGGRESSIVE_GC is defined, CXXR's garbage collection is even more aggressive than it was in previous releases, and this has revealed a number of GC-protection gaps (e.g. in code inherited from CR) that had previously 'slipped through the net'.

Another significant change is that the CXXR distribution no longer holds the 'Recommended' packages in compressed tar form (.tar.gz), but instead contains the untarred package directories themselves. This will make it easier to carry forward any CXXR-specific tweaks to these packages from one R release to the next. (Such tweaks are rare, and often due to a latent GC-protection bug in the CR package code.)

Phase 39: `0.39-2.14.1`

The purpose of this phase was to upgrade CXXR to parallel release 2.14.1 of CR. This entailed substantial changes to the bytecode interpreter, both to track changes in CR and to correct errors in the previous CXXR implementation. In the course of preparing this release, numerous GC-protection gaps were discovered in the CR code (including the Recommended packages) and corrected within CXXR.

CXXR's bytecode interpreter does not yet implement the cache of symbol bindings used in CR.

Phase 40: `0.40-2.15.1`

The purpose of this phase was to upgrade CXXR to parallel release 2.15.1 of CR. In the course of this upgrade, the class UncachedString was abolished, and the functionality of class CachedString was merged into its parent class CXXR::String.

Phase 41: `0.41-2.15.1`

In this phase, the experimental provenance-tracking facilities and the experimental XML-based serialization facilities, both formerly in the provenance branch, have been merged into the development trunk. Beware that documentation and in particular the testing of these features is still not up to standard, and there are known gaps in the serialization capability. Moreover the interfaces of both are likely to change. To enable provenance-tracking it is necessary to define PROVENANCE_TRACKING within src/include/CXXR/config.hpp before building the program, as the documentation of this file explains.

Phase 42: `0.42-2.15.1`

This phase saw various extensions and corrections to the XML-based serialization facilities, including the introduction of automated tests, but beware that these are still subject to change. The release incorporates work by Chris Silles on adapting the autoconf-based configuration facilities to CXXR: this addresses particularly locating a suitable installation of Boost, and enabling or disabling provenance tracking. Previously there were some difficulties in building CXXR otherwise than in its source directory: these have now, it is hoped, been removed.

$Id: refactoring.html 1409 2013-10-01 13:57:22Z arr $

CXXR: Refactoring History

Phase 0: 0.00-2.5.0

Phase 1: 0.01-2.5.0

Phase 2: 0.02-2.5.0

Phase 3: 0.03-2.5.0

Phase 4: 0.04-2.5.1

Phase 5: 0.05-2.5.1

Phase 6: 0.06-2.5.1

Phase 7: 0.07-2.5.1

Phase 8: 0.08-2.5.1

Phase 9: 0.09-2.6.1

PhaseÂ 10: 0.10-2.6.1

PhaseÂ 11: 0.11-2.6.2

Phase 12: 0.12-2.6.2

Phase 13: 0.13-2.6.2

Phase 14: 0.14-2.7.1

Phase 15: 0.15-2.7.1

Phase 16: 0.16-2.7.2

PhaseÂ 17: 0.17-2.7.2

Phase 18: 0.18-2.8.1

Phase 19: 0.19-2.8.1

Phase 20: 0.20-2.8.1

Phase 21: 0.21-2.8.1

Phase 22: 0.22-2.9.1

Phase 23: 0.23-2.9.2

Phase 24: 0.24-2.9.2

Phase 25: 0.25-2.9.2

Phase 26: 0.26-2.10.1

Phase 27: 0.27-2.10.1

Phase 28: 0.28-2.10.1

Phase 29: 0.29-2.10.1

Phase 30: 0.30-2.11.1

Phase 31: 0.31-2.11.1

Phase 32: 0.32-2.11.1

Phase 33: 0.33-2.12.1

Phase 34: 0.34-2.12.1

Phase 35: 0.35-2.12.1

Phase 36: 0.36-2.13.1

Phase 37: 0.37-2.13.1

Phase 38: 0.38-2.13.1

Phase 39: 0.39-2.14.1

Phase 40: 0.40-2.15.1

Phase 41: 0.41-2.15.1

Phase 42: 0.42-2.15.1

Phase 0: `0.00-2.5.0`

Phase 1: `0.01-2.5.0`

Phase 2: `0.02-2.5.0`

Phase 3: `0.03-2.5.0`

Phase 4: `0.04-2.5.1`

Phase 5: `0.05-2.5.1`

Phase 6: `0.06-2.5.1`

Phase 7: `0.07-2.5.1`

Phase 8: `0.08-2.5.1`

Phase 9: `0.09-2.6.1`

PhaseÂ 10: `0.10-2.6.1`

PhaseÂ 11: `0.11-2.6.2`

Phase 12: `0.12-2.6.2`

Phase 13: `0.13-2.6.2`

Phase 14: `0.14-2.7.1`

Phase 15: `0.15-2.7.1`

Phase 16: `0.16-2.7.2`

PhaseÂ 17: `0.17-2.7.2`

Phase 18: `0.18-2.8.1`

Phase 19: `0.19-2.8.1`

Phase 20: `0.20-2.8.1`

Phase 21: `0.21-2.8.1`

Phase 22: `0.22-2.9.1`

Phase 23: `0.23-2.9.2`

Phase 24: `0.24-2.9.2`

Phase 25: `0.25-2.9.2`

Phase 26: `0.26-2.10.1`

Phase 27: `0.27-2.10.1`

Phase 28: `0.28-2.10.1`

Phase 29: `0.29-2.10.1`

Phase 30: `0.30-2.11.1`

Phase 31: `0.31-2.11.1`

Phase 32: `0.32-2.11.1`

Phase 33: `0.33-2.12.1`

Phase 34: `0.34-2.12.1`

Phase 35: `0.35-2.12.1`

Phase 36: `0.36-2.13.1`

Phase 37: `0.37-2.13.1`

Phase 38: `0.38-2.13.1`

Phase 39: `0.39-2.14.1`

Phase 40: `0.40-2.15.1`

Phase 41: `0.41-2.15.1`

Phase 42: `0.42-2.15.1`