How do hackers get proprietary source code?

I just listened to a podcast that was a couple of years old about a gray hat* hacker who would work as a developer during the day and go through source code looking for zero days at night. He was reading iOS source code and found a zero day (which was not described in depth but sounded like the kind of vulnerability you get in C by using a pointer to index an array).

So where did this guy, who never worked for Apple, get the source code?

(The rest of the story was about how he started out trying to be a good guy by notifying the companies of the flaws and the companies mostly ignored him, so he started selling them on the underground market. He made the mistake of telling someone he knew about this particular flaw and another group analyzed it, packaged software to exploit it, and sold the whole thing to a Chinese group for hundreds of thousands of dollars.)


*I don’t know if that’s a thing but he was not breaking the law even though it’s a breach of professional ethics.

Without any specifics it’s hard to say.

Much of Apple’s platform is based on open source software, so it depends what you mean by IOS source code. Open Source - Apple Developer

A lot of recent bugs, such as Heartbleed, were found in open source software that commercial companies use.

Others were such as the very recent Cloudflare were reveal by Google’s Project Zero team, which is (I think, I can’t find a description anywhere) given access to propriety source as well open source code, in exchange for agreeing to not disclose zero-day bugs until the vendor has had a chance to fix them.

Was thisthe bug you were referring to?

Thinking about the situation described OP a bit more, I am sure it must be an open source project.

To get access to closed source code via Project Zero or something similar (that wasn’t part of a published bug description), would be mean he would have to work there. If an employee of Project Zero took another companies proprietary source home, and then sold an exploit to hackers (even if the company involved was being tardy about fixing it), he’d be guilty of a whole host of hacking related charges.

Still it’s interesting to me that details of how Project Zero gets access to other companies closed source projects is not discussed anywhere. Give how high profile some of the bugs they’ve revealed have been.

And of course there’s the possibility of plain old misreporting. The guy may not have had source code at all, but instead assembly code derived from machine code on the device. No special access needed there. It may be tedious, but it’s not particularly difficult to identify certain classes of errors, like out-of-bounds array accesses, from assembly.

A surprising amount of code is available in the wild. Universities especially can get access to source code for close source operating systems. Similarly other research groups and no doubt defence and security related groups.

DEC used to publish an internals manual for VMS. And a very interesting and comprehensive volume it was. Nowadays the quality of documentation is dreadful in comparison. Microsoft don’t even publish a manual at all, but leave it to others to reverse engineer the workings. This matters if you are doing very low level coding. (I once wrote a very fast low level coroutine system for x86. When run on Windows I needed to know how to make it work with the existing mechanisms. Finding out what these were, and how they worked was ridiculous.)

A zillion years ago I had access to DEC’s VMS and to Sun’s Solaris. Microsoft have programmes to provide access to Windows source. (At least they have done so in the past.) It was the compilers you would never see source code for. They were considered the crown jewels. Typically any individual who had access had to sign various assurances, and the number of people afforded access was limited and controlled.

The most relevant open source aspect that might be part of the OPs situation is the kernel Darwin.

The UI and other stuff built on top of Darwin is mostly closed source. If you’re trying to find a hole into an Apple OS, you go after the kernel first. Not the UI.

And of course, even for closed source code, there are still plenty of methods. You break into the office building, carrying a thumb drive. You bribe the building janitor (or become the janitor yourself) to stick a thumb drive in the computer while the legitimate user is in the bathroom. You take a guy with access out for a few beers after work every day, and talk him into it when he’s drunk. You seduce someone with access, and talk him into it when he’s horny. You call up someone with access, and say that you’re from company IT, and convince them to give you their password. You compromise a close family member of someone with access, and get the person with access to give the relevant information to them. You use some other hack to break into the office’s internal security, and use that to gain access to the source code. You wait for someone else to do one of these things and then either freely release or sell the information they got.

Here is the story. The hacker went by the handle of Johnny Mnemonic. The piece was not just a story on him but includes interviews. The part I mentioned starts at about 6:05. After listening again, the reporter calls it “source code” but Johnny’s description refers to registers, suggesting assembler or disassembled machine code. But they included him reading some code:

cd3_softc *softc = getsoftc();

so that looks like dereferencing a pointer in C. (I’m unaware of talking in terms of registers when you’re writing C code.) I’m a little rusty at C but looks to me like the idea is to load the result of a function into an array by using a pointer to index the array, but the syntax looks little off for that. I think the exploit is to do something to force the pointer outside of the intended range of the array. They refer only to “an indexing error.”

No, as can be seen and heard as described above.

How hard is reverse-engineering? Can you take an executable (and its dlls) and un-compile them?

You can decompile code, but the “source code” you get that way probably won’t be the same as the actual original source code, and will usually be nearly as hard to read as the assembly code you started with (or the machine code you started with, but machine code and assembly are trivial to convert between).

On the other hand, if you get ahold of some source code from an unknown source and need to verify its authenticity, it’s fairly simple to recompile it and check it against the known machine code.

you can disassemble them and get back the instructions/operations in assembler, which looks like this (for amd64):



sub rsp, 0x48
mov rax, [rsp+0x78]
mov byte [rsp+0x30], 0x0

mov [rsp+0x10], rbx
mov [rsp+0x18], rbp
mov [rsp+0x20], rsi

push rsi
push r14
push r15
sub rsp, 0x480

mov rax, rsp
mov [rax+0x8], rbx
mov [rax+0x10], rsi
mov [rax+0x18], r12

sub rsp, 0x38
mov [rsp+0x20], r8
mov r9d, edx
mov r8, rcx


but you can’t get back to the source code written in a high level language like C.

There is a transcript of the program here (though even with the details there I can’t find any more information, or match to a public bug anywhere).

I think jz78817 might be right, based on this comment he may have been actually looking at the disassembly, not the C code (even though they refer to the “source code” multiple times). What they describe here sounds like disassembly (particular due to the mention of a register, which you would not have in higher level C code):

That is pretty impressive to me, I’ve looked at the disassembly for a snippet that I know had a bug to try and work out what is going on. The idea of trawling through 100,000s of lines of disassembled C code does not sound like a fun way to spend the evening.

EDIT - Though just after that I notice they refer to C code, but maybe that was only revealed later to be the source of the bad memory access?

Doesn’t look like dereferencing to me. That looks like a pointer being declared and assigned.

Not exactly. When you compile a program, you lose most of the semantic information embedded in the source.

But reading the disassembly, an experienced person can figure out what’s going on. You can see that a particular function call is to a known library function and that it returns a pointer (say).

I’ve never done any large-scale reverse engineering of other people’s code, but if I was, I’d take notes as I went along. The notes would probably be in a C-like pseudocode. That could be what we’re seeing here.

Additionally this sounds like a well known bug, in a older library. Presumably he was able to infer that C code from the disassembly because of that.

So my take on this is that he DID NOT have the source code, he was analyzing the disassembly (despite the program saying otherwise, they just didn’t understand, or didn’t think the listeners would understand the distinction between source code and disassembly).

If he did have the C source code it must have been an open source library. It clearly says what he was doing was legal, which would definitely not be the case if was from a proprietary closed source project he had access to because of his day job.

It is possible to decompile machine code into C. However, it is rather unreadable. The variable names won’t mean anything. (Compare to Obfuscated Code contest.) Example below.

That was my first thought but I couldn’t figure out how cd43_softc was a type. Then I realized it must be defined with a typedef. It has been almost 20 years since I’ve written C or C++ code.

Example of decompiled C code:



struct s0 {
    int32_t f0;
    signed char[4] pad8;
    struct s0* f8;
    struct s0* f16;
};

void insert(struct s0** rdi, struct s0* rsi) {
    struct s0** v3;
    struct s0* v4;

    v3 = rdi;
    v4 = rsi;
    if (*v3 != (struct s0*)0) {
        if ((*v3)->f0 <= v4->f0) {
            if (v4->f0 > (*v3)->f0) {
                insert(&(*v3)->f8, v4);
            }
        } else {
            insert(&(*v3)->f16, v4);
        }
    } else {
        *v3 = v4;
    }
    return;
}


It should also be pointed out (and the NPR program should have made this clearer, if you ask me) that this wasn’t a “hack” in the sense of Heartbleed or Cloudbleed, that was used to steal data. It was used to jailbreak phones, as in giving you root access to a computer you own. Not something anyone should have ethical qualms about IMO, regardless of what the law says.

This could be a the incident described in the OP (or a similar one involving the same group).

And as an example of what I’m talking about–the code is clearly a binary tree inserter, and some mild restructuring makes the intent even more clear (though it wasn’t too bad already):


struct s0 {
    int32_t data;
    signed char[4] pad8;
    struct s0* right;
    struct s0* left;
};

void insert_binary_tree(struct s0 **currentElementRef, struct s0 *newElement) {
    struct s0* currentElement = *currentElementRef;

    if (currentElement != NULL) {
        if (currentElement->data < newElement->data) {
            insert_binary_tree(&currentElement->right, newElement);
        } else if (currentElement->data > newElement->data) {
            insert_binary_tree(&currentElement->left, newElement);
        } else {
            // do nothing; the new element already exists
        }		
    } else {
        *currentElementRef = newElement;
    }
}

As said, it’s time-consuming, but usually not particularly difficult to reverse-engineer the intent of the code from assembly. It’s rare that you have to be really clever, partly because the compilers generally aren’t that clever.

Well, they think you’re crude, go technical; if they think you’re technical, go crude.