JSRF-Decompilation/documentation/decompilingcpp.md

# Decompiling C++
Like most (all?) Xbox titles and most sixth-generation games more generally,
JSRF is not written in assembly or C as those before it were, but rather C++.
C++ introduces new features that both complicate the final machine code and
weaken the correspondence between said machine code and the original C++
source.

This guide will cover various C++ features appearing in JSRF, explaining how
they manifest in the game's executable and how to properly decompile them, to
the extent possible.  Basic familiarity with C features (e.g. functions,
structs) and how to decompile them is assumed.


## Name Mangling
Whenever you encounter symbol names actually produced by a C++ compiler, like
when recompiling decompiled code, they'll probably look garbled like
`??_GGameObj@@UAEPAXI@Z` or `_ZN7GameObjD1Ev` depending on the compiler.  These
are mangled names, used by compilers to prevent conflicts from overloaded
functions, communicate additional information about symbols, and so on.

Many tools can print these in human-readable form to produce e.g.
`` public: virtual void * __thiscall GameObj::`scalar deleting destructor'(unsigned int) ``,
and objdiff will do so by default.  When using the Ghidra delinking tool
specifically, it's important to keep in mind that the delinked symbol names do
_not_ get mangled, so they won't have the exact same names as in the recompiled
code, and corresponding symbols in the delinked and recompiled object files
will need to be associated by hand.


## Classes
C++ classes evolve the C struct to associate the data structure with code,
which are called methods in this context.  Classes can also inherit from one or
more other classes, sharing their data members and access to their methods.
Certain special methods called constructors and destructors can also be added
to a class, and these can be called implicitly when an instance of a class goes
in or out of scope.  Classes can also have fields and methods marked as
private, but these permissions are usually completely erased during
compilation and don't need to be respected by a decompilation.

### `class` vs. `struct`
The `struct` keyword can still be used in C++ and is equivalent to `class`,
except that the former makes all members public by default and the latter makes
all private by default.  Since there's not much reason to make anything private
in a decompilation, one will usually use `struct` declarations in
decompilations rather than `class`.

```c++
// These two declarations are equivalent
class SomeClass {
public: // Makes everything after public
    float    someMemberVariable;
    unsigned anotherMemberVariable;
};

struct SomeStruct {
    float    someMemberVariable;
    unsigned anotherMemberVariable;
};
```

A reasonable way to implement an inherited struct in Ghidra is to define the
base class normally, and then define the child with a first member called
`super` of the parent class type.  Members specific to the child class can then
be inserted afterwards.

### Class Methods
Methods are functions declared within a class's namespace, like so:
```c++
struct SomeClass {
    // Regular data members
    float    someMemberVariable;
    unsigned anotherMemberVariable;

    // Methods declared in class definition
    SomeClass(int anArgument); // Constructor
    ~SomeClass();              // Destructor

            void regularMethod(unsigned anArgument);
    virtual void virtualMethod(char *   anArgument);
    static  void staticMethod (char *   anArgument);

    // Can also provide entire definition in class
    float anotherMethod(float x) {
        this->someMemberVariable += x;
        return this->someMemberVariable;
    }
};

// Definition of a method declared in class
void SomeClass::regularMethod(unsigned anArgument) {
    this->anotherMemberVariable -= anArgument;
}
```

Methods can then be accessed and called with member access syntax, like
`classInstance.regularMethod(3)` and `instancePtr->anotherMethod(1.2)`.

Static methods are indistinguishable from regular functions in compiled code,
so they probably won't see much use in decompilations.  They don't have access
to the `this` pointer that other types of methods can use.

Regular methods are similar to regular functions, but have an implicit first
argument called `this` representing a pointer to the object that the method
was called from.  Some C++ implementations use a different calling convention
for method calls, such as Microsoft's implementation for the Xbox using the
`__thiscall` convention where the `this` pointer is passed in the ECX register
while all other arguments are passed on the stack.

Constructors and destructors function largely like regular methods, but
implicitly return the `this` pointer.  C++ makes certain guarantees about
objects that have constructors and destructors that obligate the compiler to
insertt additional code in certain circumstances: be aware, for instance, that
constructor calls will often be wrapped with stack unwinding code in case an
exception is thrown from within the constructor (see the exception handling
section).  An object's destructor is also automatically called at the end of
its lifetime (e.g. it goes out of scope), which can lead to inclusion in
exception handling code or just being called at the end of a code block even if
the source code doesn't invoke it explicitly.  This automatic resource
management is often called part of C++'s RAII (resource acquisition is
initialization) design.

Virtual methods are methods that can be overridden on child classes.  They're
not called directly, but instead called through a hidden first member that
points to an array of method function pointers, usually called a vtable (Visual
C++ 7 calls it `` ClassName::`vftable' ``).  If a destructor specifically is
made virtual, additional "deleting destructors" may be generated as well, which
are methods taking one `unsigned` argument that call the destructor and then,
depending on the argument, free the object's memory.

Ghidra has somewhat obscure support for classes and regular methods, and
virtual methods can be made to work with some admittedly tedious effort.

A class can be defined right-clicking on "Classes" in the symbol tree window
and selecting "Create Class."  Symbols (e.g. methods) can then be added to this
class by putting them in the class's namespace, i.e. opening the Add/Edit Label
or Rename Function window (usually from right-clicking something or its name)
and adding the class name as a prefix, e.g. `ClassName::someSymbol`.  Be aware
that certain windows like the Edit Function window have no awareness of
namespaces, and trying to add the namespace prefix will just modify the symbol
name directly without actually adding it to the namespace.  For Microsoft code
(e.g. Xbox), applying the appropriate `__thiscall` calling convention enables a
special behaviour where the first argument passed in ECX is forcibly named
`this` and has a fixed pointer type (by default `void *`).  If the method is
placed in a class's namespace, however, and a struct of the same name exists,
the `this` pointer's type will be set to that struct.

Since virtual method calls go through pointers rather than calling a function
at a fixed address, they show up in Ghidra as unsightly member accesses like
`(**(code **)(*g_graphics + 0x160))(g_graphics,0)`.  One can however
simulate vtables by hand to get these calls resolving to something somewhat
more manageable like `(*g_graphics->vtable->setFogEnable)(g_graphics,0)` with
the correct number and types of arguments.  The class's first member (which is
a link to its vtable) can be set to a pointer to a new struct type whose
members are pointers to functions defined in the data type manager (right click
and then `New > Function Definition...`).  To actually access the methods'
definitions (keeping in mind there are likely multiple for different classes
inheriting from the same base class), it will be necessary to either find
where the vtable is assigned (the class constructor is a good choice) or
potentially examine the first member of an instance of the class at runtime
with the help of Cheat Engine or an emulator's memory viewer.

### Inheritance
Child classes can be used in most places that their parent class can be used:
```c++
// Class inheriting from SomeStruct
struct SomeStructChild : SomeStruct {
    // Inherits these from SomeStruct:
    //     float    someMemberVariable;
    //     unsigned anotherMemberVariable;
    char * additionalMemberVariable;
};

// Could call this with either a SomeStruct* or SomeStructChild* argument
float getSomeMemberVariable(SomeStruct const * const ss) {
    return ss->someMemberVariable;
}
```


## The `new` and `delete` Operators
One way to allocate an object in C++ is using `new` and `delete`.  The former
can both allocate and construct the object, while the latter is analogous to
calling `free()`.  Each has a corresponding `operator new()`  or
`operator delete()` function called implicitly.

The generated code for a use of `new` with a constructor (like
`SomeStruct ss = new SomeStruct(7)`) performs the allocator and constructor
calls separately, roughly as follows (as it would appear in Ghidra; note that
Ghidra shows explicitly the passing of the `this` pointer):
```c++
SomeStruct *ss;
ss = (SomeStruct *)operator_new(0xc);
if (ss == NULL) {
    ss = NULL; // No, I'm not sure what the point of reassigning NULL is
}
else {
    SomeStruct::SomeStruct(7);
}
```


## Exception Handling
C++ offers the ability to throw and catch exceptions, which have highly
platform-specific implementations that require some sophistication to uphold
the language's guarantees about object initialization and destruction.  In
particular, some hidden bookkeeping needs to be done to implement `try` and
`catch` blocks, as well as keep track of what cleanup needs to be done if an
exception is thrown (part of a process known as stack unwinding, i.e. walking
back up the call stack until the exception is caught or the top is reached).

We'll focus here on the Microsoft implementation found in Xbox games.  The FS
register holds the last item of a linked list of structures with exception
handling information, defined thusly:
```c++
struct EXCEPTION_REGISTRATION_RECORD {
    EXCEPTION_REGISTRATION_RECORD * next;    // Next item in linked list
    EXCEPTION_ROUTINE             * handler; // Function pointer
};
```

Functions with any exception handling or stack unwinding will have a prologue
like the following in Ghidra:
```c++
undefined4 *unaff_FS_OFFSET;
undefined4 local_c;
undefined *puStack_8;
undefined4 local_4;

local_4 = 0xffffffff;
puStack_8 = &LAB_00186c4b;
local_c = *unaff_FS_OFFSET;
*unaff_FS_OFFSET = &local_c;
```

One might clean this up a bit, revealing that the code is adding a new entry to
the list (here from the JSRF `Game::Game()` constructor):
```c++
  EXCEPTION_REGISTRATION_RECORD *_tib; // "thread information block"
  EXCEPTION_REGISTRATION_RECORD _err;
  int _trylevel;

  _err.Next = _tib->Next;
  _trylevel = -1;
  _err.Handler = Game_handler;
  _tib->Next = &_err;
```

As the name suggests `_trylevel` will be incremented when a new block of code
requiring exception handling or stack unwinding is encountered, e.g. around
constructors whose memory must be freed if they throw.  The function will end
by dropping item that was added to the exception handling list
(`_tib->Next = _err.Next`).

To actually see what the exception handling or stack unwinding code will do, we
need to look at the `.Handler` function that was assigned.  It usually looks
something like this:
```c++
void Game_handler(EHExceptionRecord *param_1,EHRegistrationNode *param_2,void *param_3,
                 DispatcherContext *param_4) {
  ___CxxFrameHandler(param_1,param_2,param_3,param_4,&Game_funcinfo);
  return;
}
```

What we care about here is the last argument passed to `__CxxFrameHandler()`,
which is a pointer to a `FuncInfo` structure defined as follows:
```c++
struct FuncInfo {
    DWORD              magicNumber;
    int                maxState;
    UnwindMapEntry   * pUnwindMap;
    DWORD              nTryBlocks;
    TryBlockMapEntry * pTryBlockMap;
    DWORD              nIMapEntries;
    void             * pIPtoStateMap;
};
```

Here we can finally distinguish between unwinding code (called as an exception
raises up the call stack) and catching code (also called if an exception is
raised, but it can stop the exception from elevating any further): the former
gets entries in `pUnwindMap` (the number of entries being given by `maxState`),
while the latter gets entries in `pTryBlockMap` (the number of entries being
given by `nTryBlocks`).

The unwind map is the simpler of the two, with each entry being as follows:
```c++
struct UnwindMapEntry {
    int  toState;
    void (*action)();
};
```

The `toState` member describes which value `_trylevel` will assume after the
function in the second member is called.  The second member points to the
actual unwinding code, which will tend to decompile to something simple but
unpleasant like this:
```c++
void Game_handler_unwind1(void) {
  int unaff_EBP;

  operator_delete(*(void **)(unaff_EBP + 8));
  return;
}
```

Clearly this is freeing memory (in fact, it frees a particular object's memory
if its constructor throws), but what is the argument?  EBP here holds the stack
pointer for the function that this code applies to, so you'll have to look at
the stack layout when this handler is active.  While it's easy to guess much of
the time based on what code is being wrapped, one could look to confirm in this
case that `ESP + 8` in the function holds a pointer to memory that was just
allocated and is being passed to a constructor that's being guarded (shown by a
`CALL operator_new` followed by `dword ptr [ESP + 8],EAX` in the disassembly;
make sure you know your registers and calling conventions!).

Try blocks aren't too much different in reality, with entries defined like
this:
```c++
struct TryBlockMapEntry {
    int           tryLow;
    int           tryHigh;
    int           catchHigh;
    int           nCatches;
    HandlerType * pHandlerArray;
};
```

The `tryLow` and `tryHigh` specify the `_trylevel` values that this handler
applies to, and `nCatches` indicates how many `catch` blocks there are (which
are in an array pointed to by `pHandlerArray`).  `HandlerType` is our final
structure to define:
```c++
struct HandlerType {
    DWORD            adjectives;
    TypeDescriptor * pType;
    int              dispCatchObj;
    void           * addressOfHandler;
};
```

The first three members specify what kinds of exceptions are being caught
(either by type in the first two members' case or a stack offset to an
exception object in the third's), and the final member is the actual exception
handling code, which again uses EBP to reference data on the original
function's stack.

If you'd like another more thorough treatment of reverse engineering
exceptions, also take a look at
[this article](https://www.openrce.org/articles/full_view/21), or if you'd
really like the whole implementation spelled out in excruciating detail,
[this one](https://web.archive.org/web/20101007110629/http://www.microsoft.com/msj/0197/exception/exception.aspx)
is unparalleled.