diff --git a/documentation/gettingstarted.md b/documentation/gettingstarted.md index a198419..626f25e 100644 --- a/documentation/gettingstarted.md +++ b/documentation/gettingstarted.md @@ -83,38 +83,54 @@ executable where objdiff doesn't expect them to be, which will mess up our diffs. To correct this, open the memory map (`Window > Memory Map`) and uncheck the "X" column for `.rdata`, `.data`, and `DOLBY`. -Now we'll import data types from the decompilation. Open a shell in the -`ghidra/` directory of your copy of the repository and run `make_header.sh`, -which will produce a `jsrf.h` in the same directory with the combined contents -of every header in a format suitable for Ghidra. Then, in Ghidra, select -`File > Parse C Source...` to open a window for importing C headers. Remove -everything from the "Source files to parse" and "Parse options" boxes, and add -`jsrf.h` to the former (click the green + symbol on the right and select the -`jsrf.h` file). Click the "..." on the "Program Architecture:" box and select -the row with the values "x86," "default," "32," "little," and "Visual Studio." -Finally, click the "Parse to Program" button, "Continue" to confirm, and -"Don't Use Open Archives" (the header is completely self-contained and doesn't -need any information from any other data type archives). You should then see a -window reporting successful import, and you'll be able to find `jsrf.h` with -all of its definitions under `default.xbe` in the Data Type Manager window in -the bottom left. +Now we'll import data types from the decompilation. Open a Unix-style shell +(e.g. Git Bash if on Windows) in the `ghidra/` directory of your copy of the +repository and run `make_header.sh`, which will produce a `jsrf.h` in the same +directory with the combined contents of every header in a format suitable for +Ghidra. Then, in Ghidra, select `File > Parse C Source...` to open a window +for importing C headers. Remove everything from the "Source files to parse" +and "Parse options" boxes, and add `jsrf.h` to the former (click the green + +symbol on the right and select the `jsrf.h` file). Click the "..." on the +"Program Architecture:" box and select the row with the values "x86," +"default," "32," "little," and "Visual Studio." Finally, click the "Parse to +Program" button, "Continue" to confirm, and "Don't Use Open Archives" (the +header is completely self-contained and doesn't need any information from any +other data type archives). You should then see a window reporting successful +import, and you'll be able to find `jsrf.h` with all of its definitions under +`default.xbe` in the Data Type Manager window in the bottom left. -Lastly, we'll import symbols from the JSRF decompilation repository. Open the -script manager (`Window > Script Manager`) and select the "Data" folder in the -left pane. Double click the script titled `ImportSymbolsScript.py`, and a file -picker will open after a moment. Select `symboltable.tsv` from the `ghidra/` -directory of your cloned JSRF decompilation repository, and you should see a -bunch of `Created function...` and `Created label...` printed to the scripting -console window. Save your changes (save icon in the top left of the -CodeBrowser window), and your Ghidra project should be all ready for creating -object files for objdiff. +Much of our work with Ghidra will make use of some custom scripts we've +written, so we'll have to tell it where to find them. Open up the Script +Manager (`Window > Script Manager`) and then open the Bundle Manager by +clicking the "manage script directories" button (it looks sort of like a +bulleted list). Click the green + in the top right to add a new directory and +select the `ghidra/ghidra_scripts` directory in this repository. + +The first script we'll want to run is the symbol importer to get known data and +functions into your Ghidra project. In the Script Manager window, select the +"Import" category in the left pane and double click the `EnhancedImport.java` +script in the right pane to run it. You'll then be asked for an input file; +select `ghidra/symboltable.tsv` from this repository. Afterwards, you'll see a +bunch of "Importing ..." messages in a console in the main CodeBrowser window, +some of which may have "can't find data type X" added on if something's marked +with a type that hasn't made its way into our decompiled code yet, and there'll +be a bunch of new functions and labels defined. + +While we imported a bunch of data types earlier, Ghidra's C parser leaves out +some important information that we'll have to fill in with another script. In +the Script Manager, run `ClassFixup.java` from the "Data Types" category, and +you should see some "Converting X to class" and "Fixing calling convention of +X" messages in the console. + +Now you've got a Ghidra project containing everything we know about JSRF's +code! Make sure you save your Ghidra project now that everything's set up. ### Producing Object Files -Close all of your Ghidra windows and open a Unix-style shell (e.g. Git Bash if -on Windows) in the decompilation repository's `ghidra/` directory. The -`delink.sh` script is our automated tool for extracting all the object files -that have been identified so far. Invoke it with three arguments: +Close all of your Ghidra windows and open a Unix-style shell in the +decompilation repository's `ghidra/` directory. The `delink.sh` script is our +automated tool for extracting all the object files that have been identified so +far. The easiest way to run it is to invoke it with three arguments: - The path to your Ghidra installation (the directory with files like `ghidraRun` and `ghidraRun.bat`, and directories like `docs/` and @@ -128,27 +144,30 @@ Unix-style paths. Make sure the paths are surrounded by quotes, too (e.g. `'C:\path\to\whatever'`), else the shell won't understand the backslashes correctly. +If you find typing out these arguments to be too much of a pain, you can also +define the environment variables `$GHIDRA_HOME`, `$JSRFDECOMP_PROJECTPATH`, and +`$JSRFDECOMP_PROJECTNAME` and invoke the script without arguments. + There are a couple errors you might get here: - `Unable to lock project!`: This means that Ghidra isn't fully closed. Make sure you've completely closed every Ghidra window before running `delink.sh`. -- `Script not found: DelinkProgram.java` and - `Invalid script: DelinkProgram.java`: This means that the either the Ghidra - delinker extension isn't properly installed, or you've somehow invoked the - script in a way that can't see the extension (e.g. installing Ghidra on - Windows and then invoking the script from WSL). Ensure it's installed and - enabled first, and that you're not running in some kind of environment - different from where you installed Ghidra. +- `Script not found` and `Invalid script`: This means that you haven't added + the repository's `ghidra_scripts` directory to the script search path as + described in the previous section (particulary if it mentions + `MSVC7Mangle.java`), the Ghidra delinker extension isn't properly installed + (particularly if it mentions `DelinkProgram.java`), or you've somehow invoked + the script in a way that can't see the scripts (e.g. installing Ghidra on + Windows and then invoking the script from WSL). - `java.lang.RuntimeException: Failed to export ...`: This means that the delinker extension doesn't like something about what it was told to delink. One known cause is duplicate symbol names. If you haven't modified `objects.csv` or `symboltable.tsv`, let other people on the project know so that they can look into fixing it. -If all goes well, you'll see the message `Delinking complete!` at the end of -the script's output, and the extracted object files will be in the -`decompile/target/` directory of the repository. Now we're ready to start -recompiling and diffing code with objdiff. +If all goes well, the extracted object files will be in the `decompile/target/` +directory of the repository. Now we're ready to start recompiling and diffing +code with objdiff. ### Setting Up objdiff @@ -167,9 +186,11 @@ correctly set up on your `PATH`. One important piece of information, to make sure you get the correct match percentages: set `Diff Options > Function relocation diffs` to "None." -Otherwise, approximately all references to functions and non-local variables -will be marked as nonmatching (this has to do with the delinking process not -applying name mangling, which isn't expected to be fixed). +Otherwise, some references to non-local variables will be marked as nonmatching +(this is because it's sometimes not possible to make certain things named +variables in Ghidra, particularly thread-local storage, and other times it's +not possible to assign a fixed name to certain implicitly generated output in +the recompiled code). ### Using objdiff @@ -180,14 +201,13 @@ them. In the best case, corresponding functions in each file will have the same name and be in the same section, at which point objdiff can link them automatically. Otherwise, one has to click on one of the corresponding functions in one pane and the other function in the other pane to tell objdiff -to link them. Common cases of this are class methods (the names won't match) -and implicitly generated functions, such as exception handling code placed in -`.text$x` in the recompiled object file. Keep in mind that objdiff's matching -does not appear fully reliable in some cases, particularly when diffing data -with external pointers (which appear as `?? ?? ?? ??`) that aren't explicitly -marked as non-matching but still somehow reduce the match percentage, so you'll -have to use a tiny amount of judgement to determine when you actually have a -match. +to link them. The most common cases of this are implicitly generated functions +and data, such as exception handling code placed in `.text$x` in the recompiled +object file. Be aware that objdiff's matching does not appear fully reliable +in some cases, particularly when diffing data with external pointers (which +appear as `?? ?? ?? ??`) that aren't explicitly marked as non-matching but +still somehow reduce the match percentage, so you'll have to use a tiny amount +of judgement to determine when you actually have a match. Clicking on a function that's been linked across both object files shows a diff of the disassembly of both versions of the function, with any differences @@ -197,8 +217,20 @@ reaches 100%. Depending on how you configure objdiff, it will rebuild automatically whenever you save a change to a source file, or you can manually rebuild with the "Build" button at the top of the right pane. -There are no concrete instructions to give for writing decompiled code. Try -importing headers from `decompile/src/` into Ghidra +When viewing and editing decompiled source files, be mindful of the +`// Status:` annotation above each function, which has the following meanings: +- `unimplemented`: The decompiled function does not yet reproduce the behaviour + of the original +- `nonmatching`: The decompiled function is believed to behave the same as the + original, but it does not fully match in objdiff +- `matching`: The decompiled function perfectly matches the original in objdiff +Be sure to update them as you decompile if appropriate. Some functions may +also have other annotations describing nontrivial effects of link-time code +generation (LTCG), such as a nonstandard calling convention or multiple +functions being merged into one. + +Otherwise, there are no concrete instructions to give for writing decompiled +code. Try importing headers from `decompile/src/` into Ghidra (`File > Parse C Source...`) to get access to JSRF classes, and use Ghidra's decompilation of the function in the CodeBrowser as a starting point for writing your matching function, exercising whatever C++ and x86 assembly @@ -223,46 +255,11 @@ whole executable in Ghidra. ### Updating `symboltable.tsv` -If you have got a bunch of symbols you'd like to add to `symboltable.tsv`, a -workflow has been devised to generate it from your Ghidra project. Before -regenerating the table, however, make sure that you have all of it symbols -already in your project so that you don't end up deleting any. One option is -to import `symboltable.tsv` into your project with the `ImportSymbolsScript.py` -script as mentioned under "Creating a JSRF Ghidra Project," but be aware that -this will overwrite any names you've assigned to the same symbols. You will -also have to ensure that no two symbols share the same name. This can be -avoided by using namespaces if need be (i.e. `X::symbol` and `Y::symbol` may -coexist), but function overloading must be avoided (you may not have one -function with the signature `void X::f(int)` and another with the signature -`void X::f(float)`), else errors can arise when delinking, as the delinker -extension does not mangle symbol names. Thunked functions can also cause -problems because Ghidra does not include them alongside other functions in the -symbol table, so convert them to regular functions (right click on the thunked -function in the symbol tree and unset it as a thunk in the `Function` submenu). - -Once you're ready to export your symbols, open the symbol table -(`Window > Symbol Table`). Open the symbol filter window (cog button near the -top right), and uncheck everything but "User Defined" under "Symbol Source," -"Data Labels" and "Function Labels" under "Symbol Types," "Use Advanced -Filters," and "Non-Externals" under "Non-Externals." This ensures that you -only export symbols that you've defined and that are useful for delinking. - -Now we need to configure the columns that we want to export. Right-click on -one of the colum headers, click "Add/Remove Columns..." to open the "Select -Columns" window, and in it check only "Location," "Name," "Namespace," and -"Type." Click "OK" to close the window and ensure that the column order is -"Location," "Namespace," "Name," "Type" (you can drag the column headers to -reorder them if needed). - -Now, to actually export the table, right-click on one of the table cells, click -"Select All," and then right-click again on a cell to select "Export > Export -to CSV..." before selecting where to save your exported symbol table. - -The final step is converting this CSV file to the format expected by -`ImportSymbolsScript.py`. Open a shell in the repository's `ghidra/` directory -and run `make_symboltable.sh` with the path of your exported CSV as an -argument, and `symboltable.tsv` will be overwritten with a new table containing -your exported symbols. +If you have got a bunch of symbols you'd like to add to `symboltable.tsv`, you +can generate a new copy from your Ghidra project by running the +`EnhancedExport.java` script from the "Export" category. If you want to merge +the new table into the repository, make sure to take a look at the diff first +to ensure you're not inadvertently deleting anything. ### Updating `make_header.sh` @@ -314,12 +311,16 @@ correctly (exception-handling code might be appended onto another function, for example). Because `symboltable.tsv` should only be populated with symbols that have been manually defined as per the previous section, this means that you need to define variable names and labels in Ghidra for everything therein (and -ideally everything referenced externally, as well). Do try to maintain basic +ideally everything referenced externally, as well). Strive to maintain basic consistency with the rest of the codebase: functions and methods begin with lowercase letters, for instance, while class/struct/enum names begin with capital letters, and special methods like constructors and destructors should have the names they would have in real C++ code (i.e. `Class::Class` and -`Class::~Class`, respectively). +`Class::~Class`, respectively). Special class methods and members like +constructors and vtables must follow their established naming conventions for +our tooling to work properly. Also note that you can (mostly) disable name +mangling for a symbol by making it a member of the `extern_"C"` namespace, +which applies C-style name mangling as used by some symbols. Once an object is ready for extracting, its `Delink?` column should be set to `true` and the `objdiff.json` file in the `decompile/` directory should be