Jay Taylor's notes

back to listing index

Impossible Java | Hacker News

[web search]
Original source (news.ycombinator.com)
Tags: java jvm bytecode news.ycombinator.com
Clipped on: 2017-04-22

Image (Asset 1/2) alt=
Image (Asset 2/2) alt=
Fun fact: in Java versions 5 and 6, it was actually possible to write valid Java source code with overloaded return types!

The trick is that generic types in Java are subject to type erasure at runtime. Due to an oversight, it was possible to declare a class with methods like:

    String foo(List<X> l);
    double foo(List<Y> l);
which would be erased to:

    String foo(List l);
    double foo(List l);
At runtime, even though the type information for your List was no longer available, the compiler would be able to locate the correct method using the method signature stored in the caller's bytecode, giving the appearance of return-type dispatch. Technically this violates the Java language spec, and javac 7 was updated to be stricter and prevent this sort of code: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6182950

I guess it's ultimately not that mysterious, but I first encountered it "in the wild", and ended up scratching my head for a while before I figured out why our code suddenly stopped compiling when we upgraded the JDK.

> giving the appearance of return-type dispatch

It don't know that it was an appearance, my understanding is it is return-type dispatch which is supported by the JVM (but not java).

Interesting. Then, why isn't the bytecode always stored with the caller, that way you can have return-type dispatch?

The Java bytecode stores it, however the Java language normally does not expose it. Having the compiler select the correct overload based on return type would most likely add a lot of complexity to the language itself ( have you seen the current overload resolution rules? ) without much benefit.

I think the compiler actually has to work around that when you narrow the return type of an overriden function: A method Object Base::get() overriden with Integer Child::get() will result in an additional compiler generated Object Child::get() in the bytecode.

Yep. They're called "bridge" methods. Though, interestingly, there's only one point that they're mentioned in the language spec, specifically in the case of erasure. Even though they were needed before erasure for exactly this reason.


The bytecode for method invocation does store the type of the method (which includes the return type), and it is quite valid for a class file to contain multiple methods that are only distinguishable by their return type. You can't normally do this visibly in Java, but I think it's used for bridge methods and a few other things by the compiler.

tl;dr Java needs to be able to decide which overloaded method implementation to use at compile time, meaning you can't differentiate between method implementations based on return type alone. However, when compiled to bytecode, each method and method call includes the return type as part of the method identification, so Java bytecode is actually capable of differentiating between implementations with the same name based on return type alone. This fact is used by bytecode obfuscators, and can lead to bugs in decompiled code if the decompiler doesn't account for it.

> Java needs to be able to decide which overloaded method implementation to use at compile time, meaning you can't differentiate between method implementations based on return type alone.

That's not correct. Rust has no issue statically dispatching based on the return type.

Java-the-language does not allow it so it does not have to deal with calls which don't use the return value e.g.

    int getSomething()
    String getSomething()
If the return value is not used, you have to explicitly disambiguate this call somehow, and Java provides no way to do so.

What's not correct? I said Java needs to decide the return type, not Rust, or some other language.

> Java needs X, meaning you can't Y

X and Y are certainly both possible, even if the Java languages chooses not to have Y.

Parse error: implicitly the author intended "Java needs X, meaning you can't Y in Java".

Please read charitably and give the poster the benefit of the doubt.

> Parse error: implicitly the author intended "Java needs X, meaning you can't Y in Java".

Except the second clause does not generally follow from the first yet is presented thus. You can't Y in Java because Java very specifically decided not to support Y.

The correct statement is

1. Java needs X.

2. Unreleatedly, Java chose not to Y.

You're stating "you can't differentiate between method implementations based on return type alone" which is nonsensical both on its face (people have no issue doing it) and technically (since several languages do exactly that).

And thus "Java needs to be able to decide which overloaded method implementation to use at compile time" is not an issue relevant to return-type overloading.

There was no issue with Java implementing return-type overloading, the creators of Java decided not to do it for non-technical reasons.

Oh, I see. My language may have been imperfect, but I think the meaning is clear. Obviously it's possible to differentiate based on return type, by very definition of it having a different return type, and also that the mentioned bytecode does just that. I see no reason why you would assume that the second half of the sentence was supposed to apply to languages generally when the first half mentions Java specifically. "Java needs to decide if you want a method to be public at compile time, so you can't leave out the public keyword and expect it to be public". Not true, says you, they could have made public methods the default.

Anyway, I think we understand each other now, and I shall endeavor to be clearer in the future.

It could be allowed. Consider the case of var-args method called with a null value. Say you call the method ambiguousNull(String ...args) like so:

Should args be null? Or should args be new String[]{null}? In the absence of any more information the compiler chooses the first, but may still issue a warning. But you can clearly disambiguate by casting:

    ambiguousNull((String)null); // args is null
    ambiguousNull((String[])null); // args is a 1-element array

So given

  int something() { }
  String something() { }
A hypothetical calling syntax could be

Or the compiler could even choose one for you with a warning. Seems like a good suggestion for Java 10.

It seems like a in the case of var args, the args should always be an array, since the meat of most var args based functions is iterating over that array.

Even if you have a language that forces you to assign the return value, what does the compiler do in a situation like:

  Integer getSomething()
  String getSomething()
when the calling site is:

  Object something = obj.getSomething();

The compiler would tell you that it can't disambiguate that and fail. It's a type error.

I guess that makes sense. It is similar to:

  void putSomething(Integer i)
  void putSomething(String i)

which I think throws a compiler error in java.

The Haskell solution is to require a type annotation if the type is ambiguous.

Of course, Java provides no way to do that.

Well, I presume `String something = getSomething(); Object s = (Object)something;` would be okay.

Note that in Rust we don't have inheritance (not exactly the same kind, at least), so it's rare where a statically typed value can be referred to with a different type. So there are still annoyances in Java with this approach that Rust wouldn't have.

This and other questions are discussed at http://stackoverflow.com/a/442291

I was surprised to learn that in Kotlin, it is possible to disambiguate overloaded functions based only on their return type. I had no idea the JVM even supports such semantics. [1]

[1]: http://stackoverflow.com/q/42916801/1772342

According you your link JVM does not support this feature. It's implemented by Kotlin itself. I guess it's probably some kind of name-mangling scheme.

According to the link Java/Javac does not support the feature.

The JVM does, `bar(foo: List<String>): String` can be compiled to `bar(Ljava/util/List;)Ljava/lang/String;` and `bar(foo: List<Int>): Int` to `bar(Ljava/util/list;)Ljava/lang/Integer;` or somesuch, there is no ambiguity at the bytecode level.

In fact, the documentation for Class#getMethod specifically outlines this issue[0]:

> Note that there may be more than one matching method in a class because while the Java language forbids a class to declare multiple methods with the same signature but different return types, the Java virtual machine does not.

[0] http://docs.oracle.com/javase/8/docs/api/java/lang/Class.htm...

The article describes Dalvik which is different from JVM, so it's not really a proof.

But you are actually right. It looks that JVM bytecode includes full function signature in function invocation: https://www.ibm.com/developerworks/library/it-haggar_bytecod... (if I'm reading examples correctly).

> The article describes Dalvik which is different from JVM, so it's not really a proof.

My comment describes the actual JVM, hence having linked to official Java/JVM documentation.

Sorry, I misread it.

It could hardly be otherwise, or the JVM would need to do overload resolution at runtime, which would not be a pleasant experience for anyone.

Well, it could just use function name and leave it to javac/kotlin/other JVM front end to handle overloads and generate unique names. Although that wouldn't work nice with refection and other run-time features I guess.

It would also screw up interop. You'd basically have the issue everyone has when trying to call C++.

Note that the first example compiles without any further changes, producing valid JVM bytecode.

Not only is this possible, as stated by the article, this level of overloading is incredibly useful for in a number of areas and has been used for a long time.

One use case is API backwards compatibility. If your API wants to change the return type of a function, say from int to double, but also wants to maintain binary backwards compatibility, you can do that. See OverMapped [ https://github.com/Wolvereness/OverMapped ].

Obfuscation is another area and ProGuard employs this to make decompliling more difficult iirc.

This is about Dalvik bytecode format, but the same applies to standard Java bytebode files. Practically any obfuscated Java code will have this, which makes reverse engineering much more difficult without the tools to handle it.

Why is it not as simple as mapping the method call instructions to returntype_methodname format? Am I missing something?

What if the call site does not use the returned value?

The bytecode encodes that information, the method identifier includes name, parameter types and return type.

It's still present in the bytecode (since the VM has to know which method to call).

Yes. Because of overloading there may be 2 functions with the same name and the same return type but different arguments. So you need to include them too.

No. Java itself handles the differing arguments.

Well one obfuscation technique is to change the argument types. So for instance, you might have foo(string). But after obfuscation, you move the code of foo(string) into foo(object). So a decompiler ends up showing:

  String s = "123";
Java will call foo(string) if this is re-compiled, which is wrong. The decompiler would need to show foo((Object)s).

On the CLR you can go even further and erase a lot of the type info and just pass objects around. That is, you can just change all your method signatures to foo(object, object, object) no matter which classes (not structs) you pass. Or have even more fun and randomize the classes. So now foo(string) becomes foo(DBConnectionInfo).

For that specific example that would actually be fine in Java. Everything is an Object, so even if you have foo(Object) you can just pass it a String and that is then fine both for the JVM and the Java compiler.

Since the JVM identifies methods by their name + argument types this would most likely break overloading unless done very carefully. I've decompiled quite a few Java applications and haven't seen anything like this either, but perhaps I've just had the luck to not yet run into it.

It wouldn't be fine. It'd call the wrong method. The bytecode invokes the foo(object) one, but a naive decompiling would invoke foo(string). A cast would be required to get it to resolve correctly. Or am I misunderstanding?

I'm talking about decompiled code, which would turn out looking something like http://ideone.com/HCOLU4 which compiles and runs just fine. Maybe I am misunderstanding what you mean?

Very interesting. Technically this should work in Dalvik opcodes too. But I do not know if there any runtime checks done by VM. I have never seen such technique used in Android.

Do you have some examples for tools to handle the deobfuscation stuff?

The same thing exists in .NET IL where you can overload methods based only on return values (among other interesting things like modopt/modreq [0] etc.).

[0]: http://stackoverflow.com/a/5294456

> Any compiler of sound mind and memory will issue an error

GHC would beg to differ.

Also rustc, though it doesn't quite have overloading in the Java sense: https://is.gd/s0oF79

GHC doesn't either. Rust traits are heavily inspired by Haskell typeclasses… so it's like in your example.

Swift has overloading in the Java sense, and does allow overloading by return type.

That was my first thought, that this is being presented as a fundamental tenant of compilers and really it's a design choice that a whole family of languages, some of them that people even use, didnt make.

Ada also allows return type overloading.

A complaint against Ada is that it is hard for the compiler to figure out the right overloading in a complex statement with many nested function calls. My compilers prof took the time to show how to do it, and why it doesn't take much time.

Too bad C's automatic conversions prevent this from being used in C++.

It seems like this is simply a disassembler error (albeit an understandable one). Am I missing something?

Edit: Based on the responses below, I guess the point is that the disassembler can't generate Java code that will "naively" (wrong word, but I can't think of a better one) generate the same output. Notable (I assume) in that name munging would be problematic outside the current compilation unit.

No, Java class files usually include the names of variables and functions, so this isn't the disassembler's fault. The class file actually had two functions with the same name. You could certainly implement an anti-obfuscation layer to detect stuff like this, but I wouldn't call it an "error" as is.

I think it's a common expectation that a disassembler should provide output that is valid to be compiled, and that therefore this is an error.

Sure, but it's also expected that it should provide output that could be compiled to produce the input, and in this case it's impossible to satisfy both those constraints. The best thing would probably be to leave a comment in the generated source code explaining the problem, and provide an option to rename overlapping functions.

It's not necessary always possible to output valid-to-compile Java sources. If the bytecode came from a different JVM language, then there are times where javac can't emit certain bytecode patterns.

I would argue that decompilers are primarily reference tools (i.e., a more readable disassembly). It is wrong to see them as source code recovery tools because they will never be able to capture every aspect of the original program. So it doesn't make sense to have as a primary goal the ability to provide output that can be compiled again. It is more important that they more faithfully represent the disassembly.

Well, it's not a jvm class file but a dalvik Dex file. A disassembler which can't generate compilable code from this valid Bytecode is incomplete or in other words: buggy.

Not if you are talking about compilable java code, because as the article explains, there are things you can express in byte code which you can simply not express in java source code.

Indeed. The lower level language must be more expressive by the "definition". It is more difficult to write but allows fine grain control. This is the reason some optimization and obfuscation tricks are done in Assembler (native world, not Java). And hence disassembler simply can not re-translate it back.

There are two valid ways for a disassembler mitigate this: a) decompile to a language in which the bytecode can be expressed (in a concise / expresive manner, Java would always be a "possible" target because of turing completeness) or b) accommodate for the fact that there could be signature collisions in java, e.g. by prefixing/suffixing the method name

If you change the method name you end up with code that acts differently, just imagine something that does something like this pseudocode:

if (!new Exception().getStackTrace().getSha1sum().startsWith("0000")) alert("hello decompiler")

Your comment about java and turing completeness doesn't make sense unless you want the decompiler to basically output a java implementation of a JVM?

Dare [0] emulates the Dalvik VM's runtime behavior to generate verifiable (for the vast majority of cases) Java bytecode from dex bytecode.

[0]: http://siis.cse.psu.edu/dare/

What did the disassembler do wrong? It just happens that there is no valid Java code which could produce that (valid) bytecode. What should it have outputted instead?

Output the method names as a_void and a_string instead of just a?

Then it wouldn't be true to the real structure of the program (although it would be true to the real behaviour of the program). Why is that less wrong?

> Why is that less wrong?

It compiles.

But may not run.

It's obviously not impossible to create legit Java code during disassembly. This is like an arms race; go ahead and dissemble my code, but I've made sure that you now need a better disassembler to produce valid code.

Eventually that disassembled will come along, and then more tricks will be used by the obfuscators...

This is not a proper arms race; it terminates with a victory for the disassemblers in a finite and feasible amount of work.

There is an arms race in whether the resulting output is at all human-comprehensible. The only thing preventing that from terminating with victory for the obfuscaters is that the obfuscators have a finite technical budget with which to obfuscate. Changing identifiers is nearly free, but as you go beyond that and start rewriting the source code itself they start incurring performance penalties and increasing odds that the rewrite will fail as they get more aggressive.

It would fail at runtime if this method is called by reflection.

Yes. That reflection code would have also failed the moment the obfuscator did it's thing, so that is not really a concern. The two are not really compatible.

In basic blocks (no conditionals or loops), a disassembler mostly does a mechanical job translating opcodes to the appropriate Java code and aggregating expressions. But it doesn't rename functions and or classes. They are left as were in the bytecode.

No, it's a mismatch between what the compiler accepts and what the JVM can execute.

The interesting part to me is that it seems that Java would be perfectly capable of differentiating between methods by return type if the compiler was tweaked slightly. Is there a reason why this isn't a formal language feature?

As explained in the post, function call is a full expression for which appropriate function should be found. If there 2 functions with the same name and param types it would be impossible to compile such an expression in its own.

Not necessarily, you'd just need a syntactic mechanism to disambiguate.

And have one, since Java either throws away the return value, or assigns it to a variable the type of which is known.

Edit: here's the edge case though. You can call a function and use the return value directly as a parameter: foo(bar()). It's possible to have two foo that take both possible bar return types, at which point the compiler is stuck.

It could require a cast in this instance, however. The more I think about this the more I wonder why this isn't possible.

> here's the edge case though.

That's no more an edge case than the cases where you throw away the return value or you're binding to an ambiguous type e.g. `A getFoo()`, `B getFoo()`, `Object foo = getFoo()`.

> The more I think about this the more I wonder why this isn't possible.

The Java spec does not say, for C++ Stroustrup states it's

> to keep resolution for an individual operator or function call context-independent.

the Java reason is likely also some sort of Principle of Least Surprise claim.

It should be noted that the JVM allows for things that javac does not. This is one of those things, and describes it accurately, albeit in the android java environment, however it applies to the standard JVM too.

One of the more fun things I've done is use the ASM and BCEL libraries to make (and unmake) these kinds of manipulations (manipulations that javac won't let you do.)

I believe the technical term for the correspondence between what it is possible to compile and what is valid bytecode/machine code is fully abstract compilation. It's an interesting concept with many interesting implications (e.g. for security). In the past at least there were various examples of Java programs that were illegal in the language but nonetheless could be created directly as bytecode and would be loaded by the JVM. This obviously becomes a security problem if your program loads bytecode dynamically and makes assumptions about its capabilities at the language level as opposed to the bytecode level.

If you load untrusted code dynamically, it strikes me as wrong to assume anything about its capabilities. Even more so "at the language level". Untrusted code can do anything unless you sandbox it.

You're right but I guess it's perhaps easy to overlook, e.g. that if you decompile/disassemble a valid bytecode program it might give you a program that is not valid at the source level. Some interesting examples and discussion here: http://lambda-the-ultimate.org/node/5364

This is a well known problem in the field of decompilers and disassemblers. It figures that the pseudo-code it outputs for generic compilers is pretty good, but when encountered with a man-made assembly or byte-code, they go places.

Even the Java compiler uses that trick, by example with a bridge method.

  public static void main(String[] args) {
    class Fun implements Supplier<String> {
      public String get() { return null; }
      .filter(m -> m.getDeclaringClass() == Fun.class)

why not use .getDeclaredMethods()?

no real reason :)

Another oddity to think about in java source code: In regular Java, all objects extend java.lang.Object, including java.lang.Class. So how do you bootstrap building java.lang.Object from source?

You don't.

This is part of the bootstraping process of a programming language.

Usually such special types are built manually in the compiler data structures, or make use of special primitives, like native methods on Java's case.

java.lang.Object is just part of the VM. Same thing with native methods.

While many of the methods in java.lang.Object have a native implementation, there are still other methods that do sport a pure java implementation: http://hg.openjdk.java.net/jdk7/jdk7/jdk/file/tip/src/share/...

Sure, but...if you look at the JVMS, there are a lot of special cases concerning java.lang.Object to help with bootstrapping.

I call clickbait:

Author admits this is not valid Java. It is not even compilable.

If I read correctly it is just artifacts from a partially sucessful decompile.

Interesting and this discussion is interesting but this is not and have never been valid Java.

never mind.

We detached this subthread from https://news.ycombinator.com/item?id=13978371 and marked it off-topic.

I could see his comment just fine. What's the evidence of an HN shadowban?

I, for one, can see what he posts and they aren't hidden in any way.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact