darkpanda
Dark Prince
- Joined
- Oct 28, 2007
- Messages
- 844
This thread is PART 2 of a tutorial about reverse-engineering CIV.EXE with the help of IDA. You can directly access other parts using the links below:
Note that this tutorial is split between different posts/threads because of the limitation of this forum's 30,000 characters per post limitation
3.1. First assembly instruction: move!
In the previous step, we ended up creating our first cross-reference, between a well-know game string "...Break treaty...", and seemingly the only portion of the code that uses this string.
Now we're going to look at this portion of the code in more details:
The cross-reference we created is as follows (in Text view):
This line is an atomic assembly instruction:
Whichever way you start taking on assembly, the mov instruction will be one of the first you'll see: it's basically an assignment operation, meaning it assigns a value to... something.
In IDA, the direction of data between an instruction's operands is from right to left.
So it must be read: mov <destination>, <source>, which means "copy the value of <source> into <destination>".
Note: this convention is often called the Intel snytax, and follows most programming language conventions where variables initializations look like "a = 1".
Next, the <source> of our instruction is offset aCancelAction_B. Obviously, this means the offset of the value named aCancelAction_B - this is how IDA directly exposes the cross-reference to you. But in the actual code, as we saw before creating the cross-reference, this was just the plain hex offset value 2ADAh.
Such operands, which are plain hex values are called immediate values, meaning that the operand value is directly hard-coded in the instruction.
Finally, the <destination> of this instruction is ax... So, what is that?
All you really need to know is that ax is a register, which is a tiny memory area located inside the processor.
Also, don't get confused by the mnemonic 'mov': what's really happening is a copy of the value from <source> to <destination>, not a move.
3.2. Second assembly instruction: push and the stack
3.2.1. The Hex view
Let's now look at the next instruction:
We see that:
First, we see that that this instruction's segmented address is 3 bytes after the previous instruction address:
This is a good opportunity to discover the Hex view of the code, by clicking on the tab Hex view A:
This brings us to the hex view of the current code, with the currently selected instruction's bytes highlighted:
We can do the same with the previous instruction, mov ax, offset aCancelAction_B, by selecting it in the text view:
... and then showing the hex view:
Here we can see 2 things:
Now let's look a little closer at the actual instruction: push ax
What does 'push <something>' do? The blunt answer is: it puts <something> at the top of the stack.
Quite clear isn't it? What is the 'stack', you ask? Well...
3.2.2. darkpanda's invaluable, public domain, yet very awkward, 'room metaphor' of processor architecure
Ok, I don't think there's any eay way around this one, so let's be creative... Consider this (very awkward) metaphor:
Now, all workers have a dedicated task, that they know how to perform considering given task requirements. For example:
The art director is in the room: he needs something to be drawn, and for this he will need to call the artist... But how can he tell the artist what to draw and where when he cannot be in the same room at the same time or talk to him outside the room?
Well, here's what he does:
Now the artist comes in the room:
The art director comes back into the room, and writes new pieces of paper for further things to draw, etc.
If you didn't understand the example above, feel free to comment or contact me privately.
Otherwise, let me clarify what the metaphor elements are:
3.2.3. The stack
More formally, the stack is a memory area, whose main purpose is to implement the functional programming paradigm. In this paradigm, different parts of the program - the functions - interact with each other by exposing their signature (input arguments and output type) while hiding their internal mechanism from the other functions. In short, the stack is essentially used for function calls (not only, though, but never mind...)
Getting back to our assembly instruction, we saw that push <something> is used to put something at the top of the stack.
Conversely, getting what's currently on top of the stack is done using instruction pop <somewhere>, which actually puts the top-of-the-stack value, removesit from the stack, and puts it into <somewhere>.
To make more sense out of those mnemonics, you may as well see the stack as a spring-actioned plastic toy-gun: you push ammo through the barrel, arming the spring, and then you pop it with the trigger, the last ammo pushed in begin the first one to pop out...
Note that the name 'stack' was to chosen to illustrate this 'add on top'/'take from top' access mechanism, which is more formally known as 'Last In, First Out' or LIFO.
3.2.4. The CPU view of the stack
The x86 processor has an explicit notion of where the stack is in memory, through the values stored in 2 other registers:
Note that sp is negative in general: adding values on the stack (with push) makes sp decrease, while removing values from the stack (with pop) makes sp increase.
This may look very strange, until you discover that the stack is located at the end of the program's memory, as shown below:
The white area between the program's data and the stack is unused memory, in which the stack can expand more or less freely as functions are being called, and as those calls add arguments on the stack. You may notice that if too many functions are called in a row, the stack may use up all the unused memory and reach the code memory... This can typically happen in case of programming bugs where endless function loop call themselves forever. This results in an error called, you guessed it, a stack overflow.
3.3. Gearing up with the 3 next assembly instructions: call a sub-routine!
Now we're going to look at the next 3 instructions:
We already know about the first 2 instructions, they're identical to previous 2 instructions: mov a value into ax, then push the value of ax on top of the stack.
You may notice that the current stack height has increased by 2, from 4 to 6, after the previous push: since ax is a 16-bit register (did I mention that?), it contains 2 bytes, and when pushing it onto the stack, the stack height thus increases by 2.
Similarly, after the second push, the stack height increases from 6 to 8.
Now what we're REALLY interested in is the final instruction:
seg011:0CE1 008 call sub_223A6
Since you have already mastered the 'room metaphor' above, you already guessed that the call instruction is performing a sub-routine call, otherwise known as a function call.
You can also see that IDA has a named cross-reference for the called sub-routine, sub_223A6, derived from the sub-routine absolute address. Let's double-click the sub-routine name to have a glance of where it takes us:
Let's not bother right now with this sub-routine detailed contents, however we're going to look at the calls to this sub-routine, by selecting the sub-routine name, and pressing CTRL+X:
You see that IDA already detected 249 different calls to this sub-routine - which is quite a lot! This gives us a hint that this sub-routine must be quite a generic low-level function...
For the sake, we'll just double-click randomly on one of the other calls to this sub-routine in the calls list:
Quite interestingly, we stumble upon the same assembly block as the one we were just analyzing... The only difference is the value of the immediate operand of the first mov instruction. Even the second mov has the same operand as before.
We're quite safe to deduct that every such assembly block is a call to a function that takes a string and 'something else' as its arguments.
Because the stack access mechanism is LIFO, the first argument of the called sub-routine is actually the last value to be pushed onto the stack. So in our situation, we have:
From wherever you are, press ESC to get back to sub_223A6 (or double-click on the cross-reference), to have a second look at the sub-routine code:
You see that IDA did identify that the sub-routine has 2 arguments, represented at the beginning of the sub-routine. Note that this part of the disassembly is not part of the original code, but rather meta-code information generated by IDA for better readability.
From what we discovered in the last step, let's cross-reference other strings used as arguments to this function:
At address seg003:0706:
At address seg003:0C86:
At address seg003:139E:
At address seg007:4CB9:
At address seg007:4CDF:
Manually creating such cross-references is actually a large part of the analysis process.
But what is this 0C926h, and more importantly, what is this function?
In this step, we're going to completely explain the subroutine we identified previously, as an exercice to learn a handful of assembly instructions.
First, let's have a look at the entire sub-routine again:
It is quite small, compared to other gigantic functions, but still it contains interesting material.
We see that its address is seg019:1E26. As a matter of fact, IDA determined that this is where the sub-routine started precisely because it found call instructions referring to this address.
We also see that its ending address is seg019:1E64: IDA determined it by following the control flow of the sub-routine, until it reached the typical final instruction of a sub-routine: retf.
We will see later that IDA is not always very efficient in finding the proper boundaries of sub-routines, and we often have to manually indicate where the end of a sub-routine is...
4.1. Stack frame management (part 1)
Let's look at the first two instructions, as well the last two instructions:
In plain English it means:
First, you've guessed it: bp is another register featured in the x86 architecture. Its name stands for base pointer.
You will find those 4 instructions, 2 at the start and 2 at the end, in nearly every sub-routine. Quite often, there will be one additional instruction at the beginning of a sub-routine (sub sp, <xx>) but let's keep it for later.
The basic principle of the base pointer is that it is used by a sub-routine as its memory reference point: it is its 'base' to access the sub-routine's local memory, which contains its arguments, and possibly its local variables. We will see this in further instructions below.
This memory area is called the function's stack frame.
4.2. More registers
If we look at the 2 subsequent instructions in the sub-routine (as well at the 2 instructions before the last 2), we discover a handful of new registers:
First up, we have 2 mov instructions that copy values between registers:
Those 4 registers are very similar to ax by nature: they contain 16 bits (2 bytes) and can be manipulated freely by many instructions.
Also, di and si have an added privilege of being used in more specific ways by some instructions. Typically, string manipulation instructions will use si as the character position in the source string, and di as the character position in the destination string. This is what there names actually stand for:
Then, looking at the last 4 instructions, we see that right before the end of the sub-routine, si and di recover the values they had before the sub-routine started:
We're only missing one last register, cx, to complete the happy family of the eight 16-bit general purpose registers of the x86 architecture:
Those are called "general purpose" in the sense that most instructions can modify their value or use them as they see fit, which is not the case of other registers, that we will see below.
4.2. Segment registers
Let's continue with the next 2 instructions:
Ha! Here are an additional 2 registers we didn't see before: ds and es... But there is something strange here: the first instruction copies ds into ax, and then the second instruction copies ax into es...
Why not directly copy ds into es?
Well, this what I stated above: ds and es are not general purpose registers, they are are segment registers.
As such, it is impossible to make direct copies between their values using mov. However, it is still possible to copy the value of one into the other by using an intermediate general purpose register such as ax.
They are called segment registers because their values are used by the processor as segment bases:
While we're at it, let's bring over the other 2 members of the good old 16-bit segment registers family, cs and ss:
After previous explanations about the stack, you already know that ss contains the stack's segment base.
But what about the 3 others?
Well, their names certainly gives a clue:
Tutorial continues in next post
- PART 1
- PART 2
Note that this tutorial is split between different posts/threads because of the limitation of this forum's 30,000 characters per post limitation
STEP 3: First bits of assembly
3.1. First assembly instruction: move!
In the previous step, we ended up creating our first cross-reference, between a well-know game string "...Break treaty...", and seemingly the only portion of the code that uses this string.
Now we're going to look at this portion of the code in more details:

The cross-reference we created is as follows (in Text view):
Rich (BB code):
seg011:0CD9 004 mov ax, offset aCancelAction_B ; "!\n Cancel action.\n Break treaty.\n"
This line is an atomic assembly instruction:
- We already know that seg011:0CD9 is the segmented address of the instruction
- Let's forget about 004 for the moment (but if you're really curious, it is the current stack height)
- Then there is mov which is the instruction's mnemonic, meaning a user-friendly way to remember what the instruction does
- Afterwards, ax, offset aCancelAction_B are the instruction's operands
- Finally, the line has an automatically generated comment which previews the string contents
Whichever way you start taking on assembly, the mov instruction will be one of the first you'll see: it's basically an assignment operation, meaning it assigns a value to... something.
In IDA, the direction of data between an instruction's operands is from right to left.
So it must be read: mov <destination>, <source>, which means "copy the value of <source> into <destination>".
Note: this convention is often called the Intel snytax, and follows most programming language conventions where variables initializations look like "a = 1".
Next, the <source> of our instruction is offset aCancelAction_B. Obviously, this means the offset of the value named aCancelAction_B - this is how IDA directly exposes the cross-reference to you. But in the actual code, as we saw before creating the cross-reference, this was just the plain hex offset value 2ADAh.
Such operands, which are plain hex values are called immediate values, meaning that the operand value is directly hard-coded in the instruction.
Finally, the <destination> of this instruction is ax... So, what is that?
All you really need to know is that ax is a register, which is a tiny memory area located inside the processor.
Also, don't get confused by the mnemonic 'mov': what's really happening is a copy of the value from <source> to <destination>, not a move.
3.2. Second assembly instruction: push and the stack
3.2.1. The Hex view
Let's now look at the next instruction:
Rich (BB code):
seg011:0CDC 004 push ax
We see that:
- its segmented address is seg011:0CDC
- the stack height is the same: 004
- its mnemonic is push
- it has a single operand: ax
First, we see that that this instruction's segmented address is 3 bytes after the previous instruction address:
Rich (BB code):
seg011:0CD9 + 3 = seg011:0CDC
This is a good opportunity to discover the Hex view of the code, by clicking on the tab Hex view A:

This brings us to the hex view of the current code, with the currently selected instruction's bytes highlighted:

We can do the same with the previous instruction, mov ax, offset aCancelAction_B, by selecting it in the text view:

... and then showing the hex view:

Here we can see 2 things:
- Instructions can have a various number of bytes: mov ax, offset aCancelAction_B has 3 bytes (B8 DA 2A), while push ax has just 1 byte (50)
- The currently highlighted instructions has all its bytes highlighted when selected; this helps make out groups of bytes belonging to individual instructions when looking at the hex view
Now let's look a little closer at the actual instruction: push ax
What does 'push <something>' do? The blunt answer is: it puts <something> at the top of the stack.
Quite clear isn't it? What is the 'stack', you ask? Well...
3.2.2. darkpanda's invaluable, public domain, yet very awkward, 'room metaphor' of processor architecure
Ok, I don't think there's any eay way around this one, so let's be creative... Consider this (very awkward) metaphor:
- the computer's processor is as a very small room where at most 1 person can fit at any time.
- it's the only room that has light, thus the only room where people can work, and the only room that has air (I said 'very awkward'), thus the only room where people can speak.
- outside the room everything is black and empty, except for a light sign on the door who summons a person to come into the room
Now, all workers have a dedicated task, that they know how to perform considering given task requirements. For example:
- One worker is an 'artist', he knows how to draw: just tell him what he should draw, and where he should draw it (on a given piece of paper)
- Another worker is the 'art director': he doesn't know how to draw, but he decides all the things that should be drawn, and where they should be drawn
The art director is in the room: he needs something to be drawn, and for this he will need to call the artist... But how can he tell the artist what to draw and where when he cannot be in the same room at the same time or talk to him outside the room?
Well, here's what he does:
- He writes down what must be drawn on a piece of paper, and puts the piece of paper on top of a paper pile
- He also writes down where it should be drawn on a piece of paper, and puts the piece of paper on top of that very same a paper pile
- He then lights on the door sign with the 'artist' name, for the artist to come in
- Just before leaving, he puts a last piece of paper on the pile with his own name
- Finally he leaves the room
Now the artist comes in the room:
- He takes the first paper at the top of the pile, and puts it aside for later
- He takes the next paper at the top of the pile, and by convention he knows that this is the position where he should start drawing
- He then takes the next paper on the pile, and knows this is what he should draw
- He draws...
- When he's finished, he looks at the first paper he put aside before and, still by convention, knows he should summon into the room the person whose name is on the paper: he lights up his name on the room door
- He leaves the room
The art director comes back into the room, and writes new pieces of paper for further things to draw, etc.
If you didn't understand the example above, feel free to comment or contact me privately.
Otherwise, let me clarify what the metaphor elements are:
- Obviously, the room is the processor (somehow)
- The workers, as you may have guessed, are program functions, also called sub-routines in assembly (and in IDA)
- The pile of paper is, you've guessed it, the stack
- Finally, the pieces of paper are, respectively, the function's arguments, and the caller ID
3.2.3. The stack
More formally, the stack is a memory area, whose main purpose is to implement the functional programming paradigm. In this paradigm, different parts of the program - the functions - interact with each other by exposing their signature (input arguments and output type) while hiding their internal mechanism from the other functions. In short, the stack is essentially used for function calls (not only, though, but never mind...)
Getting back to our assembly instruction, we saw that push <something> is used to put something at the top of the stack.
Conversely, getting what's currently on top of the stack is done using instruction pop <somewhere>, which actually puts the top-of-the-stack value, removesit from the stack, and puts it into <somewhere>.
To make more sense out of those mnemonics, you may as well see the stack as a spring-actioned plastic toy-gun: you push ammo through the barrel, arming the spring, and then you pop it with the trigger, the last ammo pushed in begin the first one to pop out...
Note that the name 'stack' was to chosen to illustrate this 'add on top'/'take from top' access mechanism, which is more formally known as 'Last In, First Out' or LIFO.
3.2.4. The CPU view of the stack
The x86 processor has an explicit notion of where the stack is in memory, through the values stored in 2 other registers:
- ss, which stands for stack segment, is the segment base of the stack in memory
- sp, which stands for stack pointer, is the offset of the top of the stack relative to the stack segment base (ss)
Note that sp is negative in general: adding values on the stack (with push) makes sp decrease, while removing values from the stack (with pop) makes sp increase.
This may look very strange, until you discover that the stack is located at the end of the program's memory, as shown below:

The white area between the program's data and the stack is unused memory, in which the stack can expand more or less freely as functions are being called, and as those calls add arguments on the stack. You may notice that if too many functions are called in a row, the stack may use up all the unused memory and reach the code memory... This can typically happen in case of programming bugs where endless function loop call themselves forever. This results in an error called, you guessed it, a stack overflow.
3.3. Gearing up with the 3 next assembly instructions: call a sub-routine!
Now we're going to look at the next 3 instructions:
Rich (BB code):
Bytes: | Text view:
B8 DA 2A | seg011:0CD9 004 mov ax, offset aCancelAction_B
50 | seg011:0CDC 004 push ax
B8 26 C9 | seg011:0CDD 006 mov ax, 0C926h
50 | seg011:0CE0 006 push ax
9A 26 1E 58 20 | seg011:0CE1 008 call sub_223A6
We already know about the first 2 instructions, they're identical to previous 2 instructions: mov a value into ax, then push the value of ax on top of the stack.
You may notice that the current stack height has increased by 2, from 4 to 6, after the previous push: since ax is a 16-bit register (did I mention that?), it contains 2 bytes, and when pushing it onto the stack, the stack height thus increases by 2.
Similarly, after the second push, the stack height increases from 6 to 8.
Now what we're REALLY interested in is the final instruction:
seg011:0CE1 008 call sub_223A6
Since you have already mastered the 'room metaphor' above, you already guessed that the call instruction is performing a sub-routine call, otherwise known as a function call.
You can also see that IDA has a named cross-reference for the called sub-routine, sub_223A6, derived from the sub-routine absolute address. Let's double-click the sub-routine name to have a glance of where it takes us:

Let's not bother right now with this sub-routine detailed contents, however we're going to look at the calls to this sub-routine, by selecting the sub-routine name, and pressing CTRL+X:

You see that IDA already detected 249 different calls to this sub-routine - which is quite a lot! This gives us a hint that this sub-routine must be quite a generic low-level function...
For the sake, we'll just double-click randomly on one of the other calls to this sub-routine in the calls list:

Quite interestingly, we stumble upon the same assembly block as the one we were just analyzing... The only difference is the value of the immediate operand of the first mov instruction. Even the second mov has the same operand as before.
We're quite safe to deduct that every such assembly block is a call to a function that takes a string and 'something else' as its arguments.
Because the stack access mechanism is LIFO, the first argument of the called sub-routine is actually the last value to be pushed onto the stack. So in our situation, we have:
- function sub_223A6
- argument 1: 'something' (always 0C926h)
- argument 2: offset to a string
- argument 1: 'something' (always 0C926h)
From wherever you are, press ESC to get back to sub_223A6 (or double-click on the cross-reference), to have a second look at the sub-routine code:

You see that IDA did identify that the sub-routine has 2 arguments, represented at the beginning of the sub-routine. Note that this part of the disassembly is not part of the original code, but rather meta-code information generated by IDA for better readability.
From what we discovered in the last step, let's cross-reference other strings used as arguments to this function:
At address seg003:0706:





STEP 4: A complete sub-routine
But what is this 0C926h, and more importantly, what is this function?
In this step, we're going to completely explain the subroutine we identified previously, as an exercice to learn a handful of assembly instructions.
First, let's have a look at the entire sub-routine again:

We see that its address is seg019:1E26. As a matter of fact, IDA determined that this is where the sub-routine started precisely because it found call instructions referring to this address.
We also see that its ending address is seg019:1E64: IDA determined it by following the control flow of the sub-routine, until it reached the typical final instruction of a sub-routine: retf.
We will see later that IDA is not always very efficient in finding the proper boundaries of sub-routines, and we often have to manually indicate where the end of a sub-routine is...
4.1. Stack frame management (part 1)
Let's look at the first two instructions, as well the last two instructions:
Rich (BB code):
Bytes:| Text view:
55 | seg019:1E26 000 push bp
8B EC | seg019:1E27 002 mov bp, sp
...
5D | seg019:1E63 002 pop bp
CB | seg019:1E64 000 retf
In plain English it means:
- push bp: put the value of 'bp' onto the stack
- move bp, sp: copy the value of 'sp' into 'bp'
- ...
- pop bp: put the value at the top of the stack into 'bp'
- retf: terminate function, i.e. go back to the calling function
First, you've guessed it: bp is another register featured in the x86 architecture. Its name stands for base pointer.
You will find those 4 instructions, 2 at the start and 2 at the end, in nearly every sub-routine. Quite often, there will be one additional instruction at the beginning of a sub-routine (sub sp, <xx>) but let's keep it for later.
The basic principle of the base pointer is that it is used by a sub-routine as its memory reference point: it is its 'base' to access the sub-routine's local memory, which contains its arguments, and possibly its local variables. We will see this in further instructions below.
This memory area is called the function's stack frame.
4.2. More registers
If we look at the 2 subsequent instructions in the sub-routine (as well at the 2 instructions before the last 2), we discover a handful of new registers:
Rich (BB code):
Bytes:| Text view:
55 | seg019:1E26 000 push bp
8B EC | seg019:1E27 002 mov bp, sp
8B D7 | seg019:1E29 002 mov dx, di
8B DE | seg019:1E2B 002 mov bx, si
...
8B F3 | seg019:1E5F 002 mov si, bx
8B FA | seg019:1E61 002 mov di, dx
5D | seg019:1E63 002 pop bp
CB | seg019:1E64 000 retf
First up, we have 2 mov instructions that copy values between registers:
- di gets copied into dx,
- and si copied into bx.
Those 4 registers are very similar to ax by nature: they contain 16 bits (2 bytes) and can be manipulated freely by many instructions.
Also, di and si have an added privilege of being used in more specific ways by some instructions. Typically, string manipulation instructions will use si as the character position in the source string, and di as the character position in the destination string. This is what there names actually stand for:
- si = source index
- di = destination index
Then, looking at the last 4 instructions, we see that right before the end of the sub-routine, si and di recover the values they had before the sub-routine started:
- dx gets copied back into di,
- and bx copied back into si.
We're only missing one last register, cx, to complete the happy family of the eight 16-bit general purpose registers of the x86 architecture:
- ax
- bx
- cx
- dx
- bp = "base pointer", pointer to sub-routine's stack frame
- si = "source index", source pointer for string operations
- di = "destination index", destination pointer for string operations
- sp = "stack pointer", pointer to the current top of the stack
Those are called "general purpose" in the sense that most instructions can modify their value or use them as they see fit, which is not the case of other registers, that we will see below.
4.2. Segment registers
Let's continue with the next 2 instructions:
Rich (BB code):
Bytes:| Text view:
55 | seg019:1E26 000 push bp
8B EC | seg019:1E27 002 mov bp, sp
8B D7 | seg019:1E29 002 mov dx, di
8B DE | seg019:1E2B 002 mov bx, si[/color]
8C D8 | seg019:1E2D 002 mov ax, ds
8E C0 | seg019:1E2F 002 mov es, ax
| seg019:1E31 assume es:dseg
...
8B F3 | seg019:1E5F 002 mov si, bx
8B FA | seg019:1E61 002 mov di, dx
5D | seg019:1E63 002 pop bp
CB | seg019:1E64 000 retf
Ha! Here are an additional 2 registers we didn't see before: ds and es... But there is something strange here: the first instruction copies ds into ax, and then the second instruction copies ax into es...
Why not directly copy ds into es?
Well, this what I stated above: ds and es are not general purpose registers, they are are segment registers.
As such, it is impossible to make direct copies between their values using mov. However, it is still possible to copy the value of one into the other by using an intermediate general purpose register such as ax.
They are called segment registers because their values are used by the processor as segment bases:
- ds = data segment
- es = extra segment
While we're at it, let's bring over the other 2 members of the good old 16-bit segment registers family, cs and ss:
- cs = code segment
- ss = stack segment
After previous explanations about the stack, you already know that ss contains the stack's segment base.
But what about the 3 others?
Well, their names certainly gives a clue:
- the code segment register cs contains the segment base of the code being currently executed
- the data segment register ds contains the segment base of the default data memory area to use for data operations
- the extra segment register es contains the segment base of... well, an extra data memory area to use together with ds for multi-data operations
Tutorial continues in next post
Last edited: