This is a difficult project. The blog post seems to hint at reasonable feasibility, this stuff is hard! We build a less ambitious tool in the university lab: "ASTANA: Practical String Deobfuscation for Android Applications Using Program Slicing" [0].
Would advise to first read the reverse engineering related work. Genetic programming is just a technique best used when everything else has failed :-)
I've read the abstract of your article. I am not much in the field of decompilers. Let alone deobfuscation. It's even hard for me to type it :))
I think it is probably a safe assumption that the kernel binary found on Android devices is not obfuscated. Tho I probably need more research to confirm this.
Thanks for the hints. Of course, it's very very difficult. But one thing I think you missed, is that I'm proposing a "byte equivalent decompilation". And after that, we should go into reading the code readable and understandable.
If we could create a program doing all this, automatically or semi-automatically, it will be great-great because then not releasing the kernel code doesn't matter. I believe if enough effort and time is put into it, there is a good chance we could see such a thing in like 5-7 years.
After that, we might be able to target the binary blobs, the propriety firmwares. Those might have some legal issue, of course. But as long as it is used only to write a FOSS alternative, that probably won't be an issue, I think.
I wanted to append [idea] but it became too long. I think there should be balance "musings" and actual work. I learned very valuable lessons in my wakegp research. Even though I haven't completed it. If the one doing this research would be me, I would go for no actual work as long as I can. And when I start to do the work, I should do as little as I can and move very slowly. Like the saying "When you move slowly, you can see the path ahead clear". After all, we should know if the next step we go forward, the ground under us is solid.
That sounds like a load of pseudo-profound bullshit to me.
And this little idea you've described? It's exactly the kind of thing where both the fundamental issues and the specific implementation details would conspire to fuck you over hard at every single turn.
If you actually tried to implement that idea of yours, you'd fail a lot, and maybe you'd learn something, and then it would be worth writing a post about that. About what you tried, what didn't work, what worked a little bit but not really.
As is? You picked a domain where ideas are easy but implementation is hilariously hard, and then you had an idea but implemented nothing. You never tried anything and you never learned anything. The value is nil.
> Theoretically, we could also go for finding the semantically equivalent C code. However, last time I researched, checking semantic equivalency is a very complex problem. I think it was NP hard.
Already deciding whether two finite automata decide the same (regular) language is PSPACE-complete; it's undecidable for anything that can decide arbitrary context-free languages (which C programs can clearly do).
While you might be able to produce code that would re-compile into a binary resembling the original binary, how would you be able to generate the same code that was compiled to produce the original binary? Seems like there would be a significant difference between the actual original code and the decompiled version - enough that it could not be used to prove license violation.
This is a difficult project. The blog post seems to hint at reasonable feasibility, this stuff is hard! We build a less ambitious tool in the university lab: "ASTANA: Practical String Deobfuscation for Android Applications Using Program Slicing" [0].
Would advise to first read the reverse engineering related work. Genetic programming is just a technique best used when everything else has failed :-)
[0] https://arxiv.org/pdf/2104.02612
I've read the abstract of your article. I am not much in the field of decompilers. Let alone deobfuscation. It's even hard for me to type it :))
I think it is probably a safe assumption that the kernel binary found on Android devices is not obfuscated. Tho I probably need more research to confirm this.
Thanks for the hints. Of course, it's very very difficult. But one thing I think you missed, is that I'm proposing a "byte equivalent decompilation". And after that, we should go into reading the code readable and understandable.
If we could create a program doing all this, automatically or semi-automatically, it will be great-great because then not releasing the kernel code doesn't matter. I believe if enough effort and time is put into it, there is a good chance we could see such a thing in like 5-7 years.
After that, we might be able to target the binary blobs, the propriety firmwares. Those might have some legal issue, of course. But as long as it is used only to write a FOSS alternative, that probably won't be an issue, I think.
Not exactly the same goal but in the ballpark, reminded me of the linux-libre patches (removes binary blobs and non-gpl code from the kernel)-
https://www.fsfla.org/ikiwiki/selibre/linux-libre/
I thought this would be an actual process, not a blog post with idle musings.
I wanted to append [idea] but it became too long. I think there should be balance "musings" and actual work. I learned very valuable lessons in my wakegp research. Even though I haven't completed it. If the one doing this research would be me, I would go for no actual work as long as I can. And when I start to do the work, I should do as little as I can and move very slowly. Like the saying "When you move slowly, you can see the path ahead clear". After all, we should know if the next step we go forward, the ground under us is solid.
That sounds like a load of pseudo-profound bullshit to me.
And this little idea you've described? It's exactly the kind of thing where both the fundamental issues and the specific implementation details would conspire to fuck you over hard at every single turn.
If you actually tried to implement that idea of yours, you'd fail a lot, and maybe you'd learn something, and then it would be worth writing a post about that. About what you tried, what didn't work, what worked a little bit but not really.
As is? You picked a domain where ideas are easy but implementation is hilariously hard, and then you had an idea but implemented nothing. You never tried anything and you never learned anything. The value is nil.
I don't welcome rudeness towards people. Not me nor anyone else.
> Theoretically, we could also go for finding the semantically equivalent C code. However, last time I researched, checking semantic equivalency is a very complex problem. I think it was NP hard.
Already deciding whether two finite automata decide the same (regular) language is PSPACE-complete; it's undecidable for anything that can decide arbitrary context-free languages (which C programs can clearly do).
Well I knew that checking if two binary circuits are equivalent is NP hard. Checking semantic equivalency of C code, of course, should be harder.
I'm having a really hard time parsing this title.
A dash between GPL and violated would probably help.
Thanks for the feedback. I don't think I can edit the HN entry, but I'm updating me blog post.
While you might be able to produce code that would re-compile into a binary resembling the original binary, how would you be able to generate the same code that was compiled to produce the original binary? Seems like there would be a significant difference between the actual original code and the decompiled version - enough that it could not be used to prove license violation.