Or at least, to us.
The kernel recognises a dynamic ELF executable by the fact it has a
phdr which points to the dynamic linker. See Process creation
for more info. However, it turns out the dynamic linker is just a regular,
statically linked ELF executable, and you can simply give it the program to run
as an argument. When using a shell dropper, this technique spares out a few
When the kernel loads our program and the dynamic linker (
ld.so), it makes
the process jump to the dynamic linker. On glibc, the linker's entrypoint is
It then loads the main program's
PT_DYNAMIC phdr, which
contains a key-value table which describes which libraries we depend on
DT_NEEDED), where the symbol and string tables are (
), etc. The dynamic linker then loads these libraries, resolves the needed
symbols, and performs the required relocations to glue all code together.
However, there's one entry,
DT_DEBUG, which isn't documented (the docs say
"for debugging purposes"). What it actually does, is that the dynamic linker
places a pointer to its
r_debug struct in the value field. This behavior
is mostly portable (as in, it works on at least glibc, musl and FreeBSD). If
you look at your system's
link.h file (eg. in
/usr/include), you can see
the contents of this struct. The second field is a pointer to the root
link_map. More about this one later.
After all the dependencies are loaded etc., the runtime linker will jump to our
_start), without clearing the registers or cleaning up the stack.
Normal dynamic linking needs a lot of (large) ELF header stuff. The trick is to bypass all this, and do the necessary minimal amount of linking ourselves.
Now let's have a look at the
link_map. (It's in the same
It's a linked list containing information of every ELF file loaded by the rtld.
(The first entry is our binary, the second is the vDSO
. Usually, the third one is
link_map contains the path, base address and a pointer to its
phdr. We can use the latter to traverse all the symbol tables and figure out
what the address of every symbol is. This is what bold and dnload do,
except they save a hash of every symbol name, instead of the symbol names
However, there are a few ways to save even more bytes:
First of all, instead of using the
DT_DEBUG trick, we can read some of the
internal linker state that's leaked to our entry point. On
link_map ends up in
_dl_start_user loaded it in there, and it
is retained after a function call because of the calling convention.
x86_64, it's a bit more convolved: the register the
link_map is stored
in doesn't get preserved, but the return address to some place in
_dl_start_user is placed on the stack when calling some internal rtld function.
We can read this address, to which we add an offset (to the instruction that
link_map, which is placed in a global variable) to decode the
offset to the
link_map, and voila.
The code that fetches the pointer to the
link_map is then as follows:
mov r12, [rsp - 8] ; return address of _dl_init mov ebx, dword [r12 - 20] ; decode part of 'mov rdi, [rel _rtld_global]' ('movq _rtld_global(%rip), %rdi') mov r12, [r12 + rbx - 16] ; ??? ; r12 is now the link_map pointer
Secondly, instead of iterating over the symbol tables (and having to compute
the hashes of every symbol), we can use the internal
link_map data in glibc
the calculated symbol hash tables, so we only have to do a hashtable lookup to
get the address of a symbol.
However, the offset to the hashtable in the
link_map tends to vary between
glibc versions. But, note the presence of the
l_entry field somewhere down
there. This value is known at (static) link time, we control it, so we can
simply scan the struct for the value, and use the difference between the
address of the
link_map and the
l_entry fields to compute the offset
between the "near" and "far" fields of the
link_map, and use that to
read the hashtables.
Smol uses these two tricks to achieve an even smaller binary size.
Most of the tricks presented here don't depend on the processor architecture.
Getting hold of the
link_map, however, needs yet another hack to get it to work.
@ call internal init stuff w/ link_map pointer ldr r0, .L_LOADED ldr r0, [sl, r0] bl _dl_init(PLT) @ load _dl_fini, jump to entrypoint ldr r0, .L_FINI_PROC add r0, sl, r0 mov pc, r6
This has a few problems:
link_mapstruct) always gets overwritten
lrinstead of being written to the stack. This means we can't use the stack trick as in x86_64
The only useful thing that gets passed to our entrypoint is the
sl register. The address to
.L_LOADED would still be needed, though.