Lua, RIFF, Source Code and Bytecode

It was around 1996 when after the launch of Windows 95 many saw the new .ani format, animated cursors. Be it the walking dinosaur, the drum, the metronome or the hourglass. This was quite a new and interesting territory at the time. The ani file was nothing more than a series of .ico files that were put together into the ani file format. A rather specialized form of the RIFF format. Now many of you might not have heard of RIFF, in fact RIFF format might be older than many of the developers today. My interactions with this file format was via an applications on Windows 3.1, now I am not 100% if that was due to Painter or was it BitEdit and PalEdit that also offered the saving in RIFF. RIFF just means Resource Interchange File Format, it is a generic format that can help encapsulate a lot of other files/data which the file format refers to Chunks and sub-chunks.

It is in fact a very simple file format, almost like a file index on a filesystem. It has an identifier for a chunk, followed by the size and then the data, which in turn can be further chunks or sub chunks of data. Even in the days of the 16-bit systems, these were WORD aligned (when ints were 8-bits, WORD were 16-bits) which simply means that the size of the chunk was always even, had a trailing 00 (an extra byte) for padding.

Common Usages

ANI - Animated Cursors
AVI - Video (Audio Video Interleave)
PAL - Palette information
MID - Midi sound/tracks
WAV - Wav, raw digital audio file

Format

ID     4 bytes    a four character identifier generally padded with space
SIZE   4 bytes    size of the data
DATA   size_bytes The data

Here (http://www.johnloomis.org/cpe102/asgn/asgn1/riff.html) and here (http://www.daubnet.com/en/file-format-riff) are some wonderful explanations about RIFF. This article is not about RIFF and animated cursors, but about how it is used (although adapted or modified) in practical applications.

If you are working with Lua, you might well be aware that there are two wonderful lua utilities luac and luadec. Luac is a compiler that converts all of lua code into bytecode, much like Java, Python and even dotNet, and it is equally easy to decompile to get close to the source code output using luadec. In fact if you are a bit more adventurous and/or have been used to working with Assembly, the disassembler for lua (also available in ludec using the -dis option) is close to home. The issue that this brings is that the source code is not secure and can be decompiled or to a large extent the source file can be rebuilt.

Building the bytecode

When we compile a simple line like
 local a = 1
it is converted into a series of instructions that the LuaVM can understand, so the same text above, to the compiler would look like
; x86 standard (32-bit, little endian, doubles)

; function [0] definition (level 1)
; 0 upvalues, 0 params, 2 stacks
.function  0 0 2 2
.local  "a"  ; 0
.const  1  ; 0
[1] loadk      0   0        ; 1
[2] return     0   1      
; end of function

This bytecode has no function definitions, 0 upvalues, 0 params and a stack size of 2. There is one local variable and the index 0 is assigned to the same and it is called "a", there is 1 constant, the first one is indexed as 0 and has the value of 1. Then the instruction loadk loads a constant number Bx into register A. so the instruction
loadk 0 0

just says that load the first constant indexed by 0 into the register 0 (which in this case is the local "a" and the value of the constant 0 is 1) so this is our equivalent of local a=1

and in the compiled file, it will look like

This is the hex dump of the bytecode file, and you can see that all compiled Lua files have a signature, that identifies the way the file is compiled and other attributes

HEADER       4 bytes    ESC + Lua or 0x1B4C7561
VERSION      1 byte     Q or 0x51 (version 5.1)
FORMAT       1 byte     0 = Official Version
EDIANNESS    1 byte     0 = Big Edian, 1= Little Edian
SIZEOFINT    1 byte     Default 4
SIZE_T       1 byte     Default 4
SIZE_INSTR   1 byte     Default 4
SIZE_NUMBER  1 byte     Default 8
INTEGRALFLAG 1 byte     0=Floating Point, 1=Integral number type

so on an x86 platform, the default header will look like
1B4C7561 51000104 04040800

this chunk header is always 12 bytes long. This is checked to determine if the block can be used or not, if all the 12 bytes do not match the header of the platform, it cannot run.

This is followed by a top level function chunk, which has the structure as follows
Source Name              STRING
Line defined             INTEGER
Last line defined        INTEGER
No. of Upvalues          1 BYTE
No. of Params            1 BYTE
is_vararg Flag           1 BYTE
Max Stack Size           1 BYTE
List of Instructions     LIST
List of Constants        LIST
List of Functions Proto  LIST
Source Line Positions    LIST
List of Locals           LIST
List of Upvalues         LIST

Where STRING is actually a structure that has two elements
SIZE_T    String Data Size
BYTES     The string data, terminated with a NUL (ASCII 0)

The source name is generally populated only in the Top-level function, in the rest Source name has a size_t of value 0.
If you refer to the graphic (hex dump above, you can see that the bytes after the first 12 (header bytes) are as
0A000000                This is 10 (decimal)
406D6169 6E2E6C75 6100  This equates to main.lua\0

The Instructions List is defined as follows
INTEGER      Size of code
ISNTRUCTION  VM Instructions

The Constant List is defined as follows
INTEGER    Size of Constant List
[
1 byte     Type of constant
Constant   The constant itself
]
Where the Type of constants are
0 = LUA_TNIL
1 = LUA_TBOOLEAN
2 = LUA_TNUMBER
4 = LUA_TSTRING
The constant field does not exist if the constant type is 0, 0 if the constant type is 1, a number if the type is 3 and a string if it is 4. The number is in IEEE 754 64-bit double format and are all edian-sensitive.

Function prototype List is defined as follows
INTEGER      Size of the function prototype
[Functions]  The function prototype bytecode data

This is followed by a Source line position list, this corresponds to the source line number for each instruction in a function. This information is used by error handlers or debuggers.
INTEGER     Size of source line position list
[INTEGER]   list index corresponds to the instruction position

Local List, each local variable has three fiels, a string and two integers.
INTEGER     size of local list
[
 STRING     Name of local variable
 INTEGER    Start of local variable scope
 INTEGER    End of local variable scope
]

and lastly the upvalue list
INTEGER     Size of upvalue list
[
 STRING    Name of upvalue
]

Now if we have a look at the hex dump above,
0000                     ** global header start **
0000  1B4C7561           header signature: "\27Lua"
0004  51                 version (major:minor hex digits)
0005  00                 format (0=official)
0006  01                 endianness (1=little endian)
0007  04                 size of int (bytes)
0008  04                 size of size_t (bytes)
0009  04                 size of Instruction (bytes)
000A  08                 size of number (bytes)
000B  00                 integral (1=integral)
                         * number type: double
                         * x86 standard (32-bit, little endian, doubles)
                         ** global header end **
                         
000C                     ** function [0] definition (level 1)
                         ** start of function **
000C  0A000000           string size (10)
0010  406D61696E2E6C75+  "@main.lu"
0018  6100               "a\0"
                         source name: @main.lua
001A  00000000           line defined (0)
001E  00000000           last line defined (0)
0022  00                 nups (0)
0023  00                 numparams (0)
0024  02                 is_vararg (2)
0025  02                 maxstacksize (2)
                         * code:
0026  02000000           sizecode (2)
002A  01000000           [1] loadk      0   0        ; 1
002E  1E008000           [2] return     0   1      
                         * constants:
0032  01000000           sizek (1)
0036  03                 const type 3
0037  000000000000F03F   const [0]: (1)
                         * functions:
003F  00000000           sizep (0)
                         * lines:
0043  02000000           sizelineinfo (2)
                         [pc] (line)
0047  01000000           [1] (1)
004B  01000000           [2] (1)
                         * locals:
004F  01000000           sizelocvars (1)
0053  02000000           string size (2)
0057  6100               "a\0"
                         local [0]: a
0059  01000000             startpc (1)
005D  01000000             endpc   (1)
                         * upvalues:
0061  00000000           sizeupvalues (0)
                         ** end of function **

0065                     ** end of chunk **


sidenote According to the IEE 754 number format, the following numbers are represented as (in 64-bits)
Number  Sign[1]   Exponent[11]       Significand[52]
 1      0 (+)     01111111111 (0)    1.0000000000000000000000000000000000000000000000000000 (1.00)  0x3FF0000000000000 
 2      0 (+)     10000000000 (+1)   1.0000000000000000000000000000000000000000000000000000 (1.00)  0x4000000000000000 
 3      0 (+)     10000000000 (+1)   1.1000000000000000000000000000000000000000000000000000 (1.50)  0x4008000000000000 
 4      0 (+)     10000000001 (+2)   1.0000000000000000000000000000000000000000000000000000 (1.00)  0x4010000000000000 
 5      0 (+)     10000000001 (+2)   1.0100000000000000000000000000000000000000000000000000 (1.25)  0x4014000000000000 
 6      0 (+)     10000000001 (+2)   1.1000000000000000000000000000000000000000000000000000 (1.50)  0x4018000000000000 
 7      0 (+)     10000000001 (+2)   1.1100000000000000000000000000000000000000000000000000 (1.75)  0x401C000000000000 
 8      0 (+)     10000000010 (+3)   1.1000000000000000000000000000000000000000000000000000 (1.00)  0x4020000000000000 
 9      0 (+)     10000000010 (+3)   1.0010000000000000000000000000000000000000000000000000 (1.125) 0x4022000000000000
10      0 (+)     10000000010 (+3)   1.0100000000000000000000000000000000000000000000000000 (1.25)  0x4024000000000000 
11      0 (+)     10000000010 (+3)   1.0110000000000000000000000000000000000000000000000000 (1.375) 0x4026000000000000
-1      1 (-)     01111111111 (0)    1.0000000000000000000000000000000000000000000000000000 (1.00)  0xBFF0000000000000
So, you can see that the number 1 is represented as 0x3FF0000000000000 and in the code above, it is displayed as 000000000000F03F

Now as you can see that it is not very difficult to understand how the lua bytecode works, and in many cases a simple compiled bytecode file can be converted into source code, however note that on Windows this is best and easily achieved, however on the Mac, there are issues on the 32-bit and 64-bit versions, so while luadec is not available in compiled/binary form, compiling luadec into binary is abit of a pain and in many cases frustrating.

If we were not to use the local and instead just used
 a = 1

it would compile to

; x86 standard (32-bit, little endian, doubles)

; function [0] definition (level 1)
; 0 upvalues, 0 params, 2 stacks
.function  0 0 2 2
.const  "a"  ; 0
.const  1  ; 1
[1] loadk      0   1        ; 1
[2] setglobal  0   0        ; a
[3] return     0   1      
; end of function

As you can note there is an extra instruction that is added to the mix, setglobal.

and if we were to have two variables, one number and one string,
a,b = 1,"ball"
it would look like
; x86 standard (32-bit, little endian, doubles)

; function [0] definition (level 1)
; 0 upvalues, 0 params, 2 stacks
.function  0 0 2 2
.const  "a"  ; 0
.const  "b"  ; 1
.const  1  ; 2
.const  "ball"  ; 3
[1] loadk      0   2        ; 1
[2] loadk      1   3        ; "ball"
[3] setglobal  1   1        ; b
[4] setglobal  0   0        ; a
[5] return     0   1      
; end of function

and a combination of local and global would look like
local a= 1
b = "ball"

and then

; x86 standard (32-bit, little endian, doubles)

; function [0] definition (level 1)
; 0 upvalues, 0 params, 2 stacks
.function  0 0 2 2
.local  "a"  ; 0
.const  1  ; 0
.const  "b"  ; 1
.const  "ball"  ; 2
[1] loadk      0   0        ; 1
[2] loadk      1   2        ; "ball"
[3] setglobal  1   1        ; b
[4] return     0   1      
; end of function

Nevertheless, the source code can be decompiled or disassembled, for those that want to, they will, but you can save your source code from prying eyes by compiling it. The second problem is that when you have a couple of lua files, they can all be compiled into bytecode but managing it can get a bit difficult, as there are so many of them. Some documentation suggests that you can use a command like

luac main.lua lib1.lua lib2.lua lib3.lua > myapplua.out

where as some other suggests that if you have dependencies, compile them first, so the same would look like
luac lib1.lua lib2.lua lib3.lua main.lua > myapplua.out

To manage that, one suggested way is to place all of the compiled lua files into a single file like in the riff format, or like the ZIP format, that looks like
Header                 4 bytes   (0x04034B50)
version required       2 bytes
general purpose flag   2 bytes
compression method     2 bytes
last mod file time     2 bytes
last mod file date     2 bytes
crc-32                 4 bytes
compressed size        4 bytes
uncompressed size      4 bytes
filename length        2 bytes
extra field length     2 bytes

file name (variable size)
extra field (variable size)

This is then followed by the file data that is repeated for each file in the archive. So we could have something similar, so let's say we have a lua project with the following files main.lua, lib1.lua, lib2.lua and lib3.lua

we could have something like
HEADER               4 bytes   (CLUA)
BLOCK IDENTIFIER     4 bytes   (FILE_BLOCK)
SIZE OF BLOCK        4 bytes   (SIZE OF BLOCK)
NO. OF RECORDS       4 bytes   (ENTRIES)

followed by with the filename data like in the ZIP format

BLOCK IDENTIFIER     4 bytes   (FILE_NAME)
START_POSITION       4 bytes   (Position in the file where the data starts)
FILENAME_LENGTH      4 bytes   (length of the filename)
FILENAME             STRING    name followed by a \0

which are then followed by the file data as

BLOCK_IDENTIFIER    4 bytes   (FILE_DATA)
END_OF_BLOCK        4 bytes   (Position where the block ends)
BLOCK_LENGTH        4 bytes   (Size of the compiled file)
DATA                Variable Length

Now I have tired to see if Lua would run such a compiled code, but it would not as the lua interpretter does not understand custom RIFF type file wrapping. This is the reason why there are some special fields like the position in file, this helps to quickly seek to the position and pick up the block of code. The idea is that the compiled file can be decompiled, but if it is placed in a file wrapper like so, it cannot be easily decompiled and there is an added layer of protection to the code. If you are using this from within a C/C++ or an Objective-C app, these files can be extracted to a /tmp location and then executed, to add more security, these can be encrypted, so that the extraction will work only with a particular key that you set in your app. The only question and thing left to try is if the file is extracted at runtime into a /tmp space, can it be executed? and will it refer to the resources in the resources directory?

Keep tuned, if this interests you for the time when I try to answer the question posed, will it work if executed from the /tmp directory with the resources in a /resource directory and if there is a better protection methodology than just this or encryption.


Sources

Zip format
http://www.pkware.com/documents/casestudies/APPNOTE.TXT

Lua VM Instructions - Kein-Hong Man

RIFF Format
http://www.johnloomis.org/cpe102/asgn/asgn1/riff.html
http://www.daubnet.com/en/file-format-riff

IEEE 754 64-Bit double numbers
http://babbage.cs.qc.cuny.edu/IEEE-754.old/References.xhtml
http://speleotrove.com/decimal/
http://en.wikipedia.org/wiki/IEEE_754-2008





Comments

Popular Posts