Dynamic linking

Dynamic linking is the process of loading code and data from shared libraries. For an overview, see this and this page. For a series of very in-depth articles, see this page etc.

Here's a page on how to debug ld.so, and here's one on the glibc startup process (GNU Hurd, but mostly applicable to Linux as well).

The interesting bits

Or at least, to us.

The kernel recognises a dynamic ELF executable by the fact it has a PT_INTERP phdr which points to the dynamic linker. See Process creation for more info. However, it turns out the dynamic linker is just a regular, statically linked ELF executable, and you can simply give it the program to run as an argument. When using a shell dropper, this technique spares out a few bytes.

When the kernel loads our program and the dynamic linker (ld.so), it makes the process jump to the dynamic linker. On glibc, the linker's entrypoint is _dl_start_user (i386 , x86_64 ). It then loads the main program's PT_DYNAMIC phdr, which contains a key-value table which describes which libraries we depend on (DT_NEEDED), where the symbol and string tables are (DT_SYMTAB, DT_STRTAB ), etc. The dynamic linker then loads these libraries, resolves the needed symbols, and performs the required relocations to glue all code together.

However, there's one entry, DT_DEBUG, which isn't documented (the docs say "for debugging purposes"). What it actually does, is that the dynamic linker places a pointer to its r_debug struct in the value field. This behavior is mostly portable (as in, it works on at least glibc, musl and FreeBSD). If you look at your system's link.h file (eg. in /usr/include), you can see the contents of this struct. The second field is a pointer to the root link_map. More about this one later.

After all the dependencies are loaded etc., the runtime linker will jump to our entrypoint (_start), without clearing the registers or cleaning up the stack.

The attack plan

Normal dynamic linking needs a lot of (large) ELF header stuff. The trick is to bypass all this, and do the necessary minimal amount of linking ourselves.

Now let's have a look at the link_map. (It's in the same link.h header.) It's a linked list containing information of every ELF file loaded by the rtld. (The first entry is our binary, the second is the vDSO . Usually, the third one is libc.) The link_map contains the path, base address and a pointer to its PT_DYNAMIC phdr. We can use the latter to traverse all the symbol tables and figure out what the address of every symbol is. This is what bold and dnload do, except they save a hash of every symbol name, instead of the symbol names themselves.

However, there are a few ways to save even more bytes:

First of all, instead of using the DT_DEBUG trick, we can read some of the internal linker state that's leaked to our entry point. On i386, the link_map ends up in eax because _dl_start_user loaded it in there, and it is retained after a function call because of the calling convention.

On x86_64, it's a bit more convolved: the register the link_map is stored in doesn't get preserved, but the return address to some place in _dl_start_user is placed on the stack when calling some internal rtld function. We can read this address, to which we add an offset (to the instruction that fetches the link_map, which is placed in a global variable) to decode the offset to the link_map, and voila.

_dl_start_user excerpt (_dl_start_user excerpt)

The code that fetches the pointer to the link_map is then as follows:

mov r12, [rsp -  8]        ; return address of _dl_init
mov ebx, dword [r12 - 20]  ; decode part of 'mov rdi, [rel _rtld_global]' ('movq _rtld_global(%rip), %rdi')
mov r12, [r12 + rbx - 16]  ; ???
; r12 is now the link_map pointer

Secondly, instead of iterating over the symbol tables (and having to compute the hashes of every symbol), we can use the internal link_map data in glibc to access the calculated symbol hash tables, so we only have to do a hashtable lookup to get the address of a symbol.

However, the offset to the hashtable in the link_map tends to vary between glibc versions. But, note the presence of the l_entry field somewhere down there. This value is known at (static) link time, we control it, so we can simply scan the struct for the value, and use the difference between the address of the link_map and the l_entry fields to compute the offset between the "near" and "far" fields of the link_map, and use that to read the hashtables.

Smol uses these two tricks to achieve an even smaller binary size.

ARM

Most of the tricks presented here don't depend on the processor architecture. Getting hold of the link_map, however, needs yet another hack to get it to work.

glibc does the following:

@ call internal init stuff w/ link_map pointer
ldr r0, .L_LOADED
ldr r0, [sl, r0]
bl _dl_init(PLT)

@ load _dl_fini, jump to entrypoint
ldr r0, .L_FINI_PROC
add r0, sl, r0
mov pc, r6

This has a few problems:

  • r0 (containing the link_map struct) always gets overwritten
  • The return address is saved to lr instead of being written to the stack. This means we can't use the stack trick as in x86_64

The only useful thing that gets passed to our entrypoint is the sl register. The address to .L_LOADED would still be needed, though.

For now, only the dnload-style loading might be possible. This should work with glibc, musl and probably bionic as well.