Impossible Java | Hacker News - Jay Taylor's notes

Jay Taylor's notes

back to listing index

Impossible Java | Hacker News

[web search]

Original source (news.ycombinator.com)

Tags: java jvm bytecode news.ycombinator.com

Clipped on: 2017-04-22

Fun fact: in Java versions 5 and 6, it was actually possible to write valid Java source code with overloaded return types!

The trick is that generic types in Java are subject to type erasure at runtime. Due to an oversight, it was possible to declare a class with methods like:

    String foo(List<X> l);
    double foo(List<Y> l);

which would be erased to:

    String foo(List l);
    double foo(List l);

At runtime, even though the type information for your List was no longer available, the compiler would be able to locate the correct method using the method signature stored in the caller's bytecode, giving the appearance of return-type dispatch. Technically this violates the Java language spec, and javac 7 was updated to be stricter and prevent this sort of code: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6182950

I guess it's ultimately not that mysterious, but I first encountered it "in the wild", and ended up scratching my head for a while before I figured out why our code suddenly stopped compiling when we upgraded the JDK.

masklinn 25 days ago [-]

> giving the appearance of return-type dispatch

It don't know that it was an appearance, my understanding is it is return-type dispatch which is supported by the JVM (but not java).

jxi 25 days ago [-]

Interesting. Then, why isn't the bytecode always stored with the caller, that way you can have return-type dispatch?

josefx 25 days ago [-]

The Java bytecode stores it, however the Java language normally does not expose it. Having the compiler select the correct overload based on return type would most likely add a lot of complexity to the language itself ( have you seen the current overload resolution rules? ) without much benefit.

I think the compiler actually has to work around that when you narrow the return type of an overriden function: A method Object Base::get() overriden with Integer Child::get() will result in an additional compiler generated Object Child::get() in the bytecode.

jdmichal 24 days ago [-]

Yep. They're called "bridge" methods. Though, interestingly, there's only one point that they're mentioned in the language spec, specifically in the case of erasure. Even though they were needed before erasure for exactly this reason.

https://docs.oracle.com/javase/specs/jls/se8/html/jls-15.htm...

aardvark179 25 days ago [-]

The bytecode for method invocation does store the type of the method (which includes the return type), and it is quite valid for a class file to contain multiple methods that are only distinguishable by their return type. You can't normally do this visibly in Java, but I think it's used for bridge methods and a few other things by the compiler.

nebulous1 25 days ago [-]

tl;dr Java needs to be able to decide which overloaded method implementation to use at compile time, meaning you can't differentiate between method implementations based on return type alone. However, when compiled to bytecode, each method and method call includes the return type as part of the method identification, so Java bytecode is actually capable of differentiating between implementations with the same name based on return type alone. This fact is used by bytecode obfuscators, and can lead to bugs in decompiled code if the decompiler doesn't account for it.

masklinn 25 days ago [-]

> Java needs to be able to decide which overloaded method implementation to use at compile time, meaning you can't differentiate between method implementations based on return type alone.

That's not correct. Rust has no issue statically dispatching based on the return type.

Java-the-language does not allow it so it does not have to deal with calls which don't use the return value e.g.

    int getSomething()
    String getSomething()

If the return value is not used, you have to explicitly disambiguate this call somehow, and Java provides no way to do so.

nebulous1 25 days ago [-]

What's not correct? I said Java needs to decide the return type, not Rust, or some other language.

paulddraper 25 days ago [-]

> Java needs X, meaning you can't Y

X and Y are certainly both possible, even if the Java languages chooses not to have Y.

kelnos 24 days ago [-]

Parse error: implicitly the author intended "Java needs X, meaning you can't Y in Java".

Please read charitably and give the poster the benefit of the doubt.

masklinn 24 days ago [-]

> Parse error: implicitly the author intended "Java needs X, meaning you can't Y in Java".

Except the second clause does not generally follow from the first yet is presented thus. You can't Y in Java because Java very specifically decided not to support Y.

paulddraper 24 days ago [-]

The correct statement is

1. Java needs X.

2. Unreleatedly, Java chose not to Y.

masklinn 25 days ago [-]

You're stating "you can't differentiate between method implementations based on return type alone" which is nonsensical both on its face (people have no issue doing it) and technically (since several languages do exactly that).

And thus "Java needs to be able to decide which overloaded method implementation to use at compile time" is not an issue relevant to return-type overloading.

There was no issue with Java implementing return-type overloading, the creators of Java decided not to do it for non-technical reasons.

nebulous1 25 days ago [-]

Oh, I see. My language may have been imperfect, but I think the meaning is clear. Obviously it's possible to differentiate based on return type, by very definition of it having a different return type, and also that the mentioned bytecode does just that. I see no reason why you would assume that the second half of the sentence was supposed to apply to languages generally when the first half mentions Java specifically. "Java needs to decide if you want a method to be public at compile time, so you can't leave out the public keyword and expect it to be public". Not true, says you, they could have made public methods the default.

Anyway, I think we understand each other now, and I shall endeavor to be clearer in the future.

justin_oaks 25 days ago [-]

It could be allowed. Consider the case of var-args method called with a null value. Say you call the method ambiguousNull(String ...args) like so:

    ambiguousNull(null);

Should args be null? Or should args be new String[]{null}? In the absence of any more information the compiler chooses the first, but may still issue a warning. But you can clearly disambiguate by casting:

    ambiguousNull((String)null); // args is null
    ambiguousNull((String[])null); // args is a 1-element array

guelo 25 days ago [-]

So given

  int something() { }
  String something() { }

A hypothetical calling syntax could be

  (int)something();

Or the compiler could even choose one for you with a warning. Seems like a good suggestion for Java 10.

lgas 24 days ago [-]

It seems like a in the case of var args, the args should always be an array, since the meat of most var args based functions is iterating over that array.

bradleyjg 25 days ago [-]

Even if you have a language that forces you to assign the return value, what does the compiler do in a situation like:

  Integer getSomething()
  String getSomething()

when the calling site is:

  Object something = obj.getSomething();

tazjin 25 days ago [-]

The compiler would tell you that it can't disambiguate that and fail. It's a type error.

bradleyjg 25 days ago [-]

I guess that makes sense. It is similar to:

  void putSomething(Integer i)
  void putSomething(String i)

and

  obj.putSomething(null)

which I think throws a compiler error in java.

Filligree 25 days ago [-]

The Haskell solution is to require a type annotation if the type is ambiguous.

Of course, Java provides no way to do that.

Manishearth 25 days ago [-]

Well, I presume `String something = getSomething(); Object s = (Object)something;` would be okay.

Note that in Rust we don't have inheritance (not exactly the same kind, at least), so it's rare where a statically typed value can be referred to with a different type. So there are still annoyances in Java with this approach that Rust wouldn't have.

areyousure 25 days ago [-]

This and other questions are discussed at http://stackoverflow.com/a/442291

bmc7505 25 days ago [-]

I was surprised to learn that in Kotlin, it is possible to disambiguate overloaded functions based only on their return type. I had no idea the JVM even supports such semantics. [1]

[1]: http://stackoverflow.com/q/42916801/1772342

mbel 25 days ago [-]

According you your link JVM does not support this feature. It's implemented by Kotlin itself. I guess it's probably some kind of name-mangling scheme.

masklinn 25 days ago [-]

According to the link Java/Javac does not support the feature.

The JVM does, `bar(foo: List<String>): String` can be compiled to `bar(Ljava/util/List;)Ljava/lang/String;` and `bar(foo: List<Int>): Int` to `bar(Ljava/util/list;)Ljava/lang/Integer;` or somesuch, there is no ambiguity at the bytecode level.

In fact, the documentation for Class#getMethod specifically outlines this issue[0]:

> Note that there may be more than one matching method in a class because while the Java language forbids a class to declare multiple methods with the same signature but different return types, the Java virtual machine does not.

[0] http://docs.oracle.com/javase/8/docs/api/java/lang/Class.htm...

mbel 25 days ago [-]

The article describes Dalvik which is different from JVM, so it's not really a proof.

But you are actually right. It looks that JVM bytecode includes full function signature in function invocation: https://www.ibm.com/developerworks/library/it-haggar_bytecod... (if I'm reading examples correctly).

masklinn 25 days ago [-]

> The article describes Dalvik which is different from JVM, so it's not really a proof.

My comment describes the actual JVM, hence having linked to official Java/JVM documentation.

mbel 25 days ago [-]

Sorry, I misread it.

barrkel 25 days ago [-]

It could hardly be otherwise, or the JVM would need to do overload resolution at runtime, which would not be a pleasant experience for anyone.

mbel 25 days ago [-]

Well, it could just use function name and leave it to javac/kotlin/other JVM front end to handle overloads and generate unique names. Although that wouldn't work nice with refection and other run-time features I guess.

masklinn 25 days ago [-]

It would also screw up interop. You'd basically have the issue everyone has when trying to call C++.

bmc7505 25 days ago [-]

Note that the first example compiles without any further changes, producing valid JVM bytecode.

mooman219 25 days ago [-]

Not only is this possible, as stated by the article, this level of overloading is incredibly useful for in a number of areas and has been used for a long time.

One use case is API backwards compatibility. If your API wants to change the return type of a function, say from int to double, but also wants to maintain binary backwards compatibility, you can do that. See OverMapped [ https://github.com/Wolvereness/OverMapped ].

Obfuscation is another area and ProGuard employs this to make decompliling more difficult iirc.

matharmin 25 days ago [-]

This is about Dalvik bytecode format, but the same applies to standard Java bytebode files. Practically any obfuscated Java code will have this, which makes reverse engineering much more difficult without the tools to handle it.

HighlandSpring 25 days ago [-]

Why is it not as simple as mapping the method call instructions to returntype_methodname format? Am I missing something?

Flowdalic 25 days ago [-]

What if the call site does not use the returned value?

masklinn 25 days ago [-]

The bytecode encodes that information, the method identifier includes name, parameter types and return type.

tokenizerrr 25 days ago [-]

It's still present in the bytecode (since the VM has to know which method to call).

barahilia 25 days ago [-]

Yes. Because of overloading there may be 2 functions with the same name and the same return type but different arguments. So you need to include them too.

tokenizerrr 25 days ago [-]

No. Java itself handles the differing arguments.

MichaelGG 25 days ago [-]

Well one obfuscation technique is to change the argument types. So for instance, you might have foo(string). But after obfuscation, you move the code of foo(string) into foo(object). So a decompiler ends up showing:

  String s = "123";
  foo(s);

Java will call foo(string) if this is re-compiled, which is wrong. The decompiler would need to show foo((Object)s).

On the CLR you can go even further and erase a lot of the type info and just pass objects around. That is, you can just change all your method signatures to foo(object, object, object) no matter which classes (not structs) you pass. Or have even more fun and randomize the classes. So now foo(string) becomes foo(DBConnectionInfo).

tokenizerrr 25 days ago [-]

For that specific example that would actually be fine in Java. Everything is an Object, so even if you have foo(Object) you can just pass it a String and that is then fine both for the JVM and the Java compiler.

Since the JVM identifies methods by their name + argument types this would most likely break overloading unless done very carefully. I've decompiled quite a few Java applications and haven't seen anything like this either, but perhaps I've just had the luck to not yet run into it.

MichaelGG 25 days ago [-]

It wouldn't be fine. It'd call the wrong method. The bytecode invokes the foo(object) one, but a naive decompiling would invoke foo(string). A cast would be required to get it to resolve correctly. Or am I misunderstanding?

tokenizerrr 25 days ago [-]

I'm talking about decompiled code, which would turn out looking something like http://ideone.com/HCOLU4 which compiles and runs just fine. Maybe I am misunderstanding what you mean?

barahilia 25 days ago [-]

Very interesting. Technically this should work in Dalvik opcodes too. But I do not know if there any runtime checks done by VM. I have never seen such technique used in Android.

Animux 25 days ago [-]

Do you have some examples for tools to handle the deobfuscation stuff?

DorothySim 25 days ago [-]

The same thing exists in .NET IL where you can overload methods based only on return values (among other interesting things like modopt/modreq [0] etc.).

[0]: http://stackoverflow.com/a/5294456

Gaelan 25 days ago [-]

> Any compiler of sound mind and memory will issue an error

GHC would beg to differ.

masklinn 25 days ago [-]

Also rustc, though it doesn't quite have overloading in the Java sense: https://is.gd/s0oF79

floatboth 25 days ago [-]

GHC doesn't either. Rust traits are heavily inspired by Haskell typeclasses… so it's like in your example.

mayoff 24 days ago [-]

Swift has overloading in the Java sense, and does allow overloading by return type.

mattnewton 25 days ago [-]

That was my first thought, that this is being presented as a fundamental tenant of compilers and really it's a design choice that a whole family of languages, some of them that people even use, didnt make.

cjensen 24 days ago [-]

Ada also allows return type overloading.

A complaint against Ada is that it is hard for the compiler to figure out the right overloading in a complex statement with many nested function calls. My compilers prof took the time to show how to do it, and why it doesn't take much time.

Too bad C's automatic conversions prevent this from being used in C++.

mnarayan01 25 days ago [-]

It seems like this is simply a disassembler error (albeit an understandable one). Am I missing something?

Edit: Based on the responses below, I guess the point is that the disassembler can't generate Java code that will "naively" (wrong word, but I can't think of a better one) generate the same output. Notable (I assume) in that name munging would be problematic outside the current compilation unit.

burkaman 25 days ago [-]

No, Java class files usually include the names of variables and functions, so this isn't the disassembler's fault. The class file actually had two functions with the same name. You could certainly implement an anti-obfuscation layer to detect stuff like this, but I wouldn't call it an "error" as is.

ciniglio 25 days ago [-]

I think it's a common expectation that a disassembler should provide output that is valid to be compiled, and that therefore this is an error.

burkaman 25 days ago [-]

Sure, but it's also expected that it should provide output that could be compiled to produce the input, and in this case it's impossible to satisfy both those constraints. The best thing would probably be to leave a comment in the generated source code explaining the problem, and provide an option to rename overlapping functions.

wtetzner 24 days ago [-]

It's not necessary always possible to output valid-to-compile Java sources. If the bytecode came from a different JVM language, then there are times where javac can't emit certain bytecode patterns.

shawnz 25 days ago [-]

I would argue that decompilers are primarily reference tools (i.e., a more readable disassembly). It is wrong to see them as source code recovery tools because they will never be able to capture every aspect of the original program. So it doesn't make sense to have as a primary goal the ability to provide output that can be compiled again. It is more important that they more faithfully represent the disassembly.

unsoundInput 25 days ago [-]

Well, it's not a jvm class file but a dalvik Dex file. A disassembler which can't generate compilable code from this valid Bytecode is incomplete or in other words: buggy.

0x0 25 days ago [-]

Not if you are talking about compilable java code, because as the article explains, there are things you can express in byte code which you can simply not express in java source code.

barahilia 25 days ago [-]

Indeed. The lower level language must be more expressive by the "definition". It is more difficult to write but allows fine grain control. This is the reason some optimization and obfuscation tricks are done in Assembler (native world, not Java). And hence disassembler simply can not re-translate it back.

unsoundInput 25 days ago [-]

There are two valid ways for a disassembler mitigate this: a) decompile to a language in which the bytecode can be expressed (in a concise / expresive manner, Java would always be a "possible" target because of turing completeness) or b) accommodate for the fact that there could be signature collisions in java, e.g. by prefixing/suffixing the method name

0x0 25 days ago [-]

If you change the method name you end up with code that acts differently, just imagine something that does something like this pseudocode:

if (!new Exception().getStackTrace().getSha1sum().startsWith("0000")) alert("hello decompiler")

Your comment about java and turing completeness doesn't make sense unless you want the decompiler to basically output a java implementation of a JVM?

jevinskie 25 days ago [-]

Dare [0] emulates the Dalvik VM's runtime behavior to generate verifiable (for the vast majority of cases) Java bytecode from dex bytecode.

[0]: http://siis.cse.psu.edu/dare/

shawnz 25 days ago [-]

What did the disassembler do wrong? It just happens that there is no valid Java code which could produce that (valid) bytecode. What should it have outputted instead?

tokenizerrr 25 days ago [-]

Output the method names as a_void and a_string instead of just a?

shawnz 25 days ago [-]

Then it wouldn't be true to the real structure of the program (although it would be true to the real behaviour of the program). Why is that less wrong?

tokenizerrr 25 days ago [-]

> Why is that less wrong?

It compiles.

drdrey 24 days ago [-]

But may not run.

bluejekyll 25 days ago [-]

It's obviously not impossible to create legit Java code during disassembly. This is like an arms race; go ahead and dissemble my code, but I've made sure that you now need a better disassembler to produce valid code.

Eventually that disassembled will come along, and then more tricks will be used by the obfuscators...

jerf 25 days ago [-]

This is not a proper arms race; it terminates with a victory for the disassemblers in a finite and feasible amount of work.

There is an arms race in whether the resulting output is at all human-comprehensible. The only thing preventing that from terminating with victory for the obfuscaters is that the obfuscators have a finite technical budget with which to obfuscate. Changing identifiers is nearly free, but as you go beyond that and start rewriting the source code itself they start incurring performance penalties and increasing odds that the rewrite will fail as they get more aggressive.

drdrey 24 days ago [-]

It would fail at runtime if this method is called by reflection.

tokenizerrr 24 days ago [-]

Yes. That reflection code would have also failed the moment the obfuscator did it's thing, so that is not really a concern. The two are not really compatible.

barahilia 25 days ago [-]

In basic blocks (no conditionals or loops), a disassembler mostly does a mechanical job translating opcodes to the appropriate Java code and aggregating expressions. But it doesn't rename functions and or classes. They are left as were in the bytecode.

jandrese 25 days ago [-]

No, it's a mismatch between what the compiler accepts and what the JVM can execute.

The interesting part to me is that it seems that Java would be perfectly capable of differentiating between methods by return type if the compiler was tweaked slightly. Is there a reason why this isn't a formal language feature?

barahilia 25 days ago [-]

As explained in the post, function call is a full expression for which appropriate function should be found. If there 2 functions with the same name and param types it would be impossible to compile such an expression in its own.

masklinn 25 days ago [-]

Not necessarily, you'd just need a syntactic mechanism to disambiguate.

samatman 24 days ago [-]

And have one, since Java either throws away the return value, or assigns it to a variable the type of which is known.

Edit: here's the edge case though. You can call a function and use the return value directly as a parameter: foo(bar()). It's possible to have two foo that take both possible bar return types, at which point the compiler is stuck.

It could require a cast in this instance, however. The more I think about this the more I wonder why this isn't possible.

masklinn 24 days ago [-]

> here's the edge case though.

That's no more an edge case than the cases where you throw away the return value or you're binding to an ambiguous type e.g. `A getFoo()`, `B getFoo()`, `Object foo = getFoo()`.

> The more I think about this the more I wonder why this isn't possible.

The Java spec does not say, for C++ Stroustrup states it's

> to keep resolution for an individual operator or function call context-independent.

the Java reason is likely also some sort of Principle of Least Surprise claim.

cremp 24 days ago [-]

It should be noted that the JVM allows for things that javac does not. This is one of those things, and describes it accurately, albeit in the android java environment, however it applies to the standard JVM too.

One of the more fun things I've done is use the ASM and BCEL libraries to make (and unmake) these kinds of manipulations (manipulations that javac won't let you do.)

anonymousDan 25 days ago [-]

I believe the technical term for the correspondence between what it is possible to compile and what is valid bytecode/machine code is fully abstract compilation. It's an interesting concept with many interesting implications (e.g. for security). In the past at least there were various examples of Java programs that were illegal in the language but nonetheless could be created directly as bytecode and would be loaded by the JVM. This obviously becomes a security problem if your program loads bytecode dynamically and makes assumptions about its capabilities at the language level as opposed to the bytecode level.

drdrey 24 days ago [-]

If you load untrusted code dynamically, it strikes me as wrong to assume anything about its capabilities. Even more so "at the language level". Untrusted code can do anything unless you sandbox it.

anonymousDan 24 days ago [-]

You're right but I guess it's perhaps easy to overlook, e.g. that if you decompile/disassemble a valid bytecode program it might give you a program that is not valid at the source level. Some interesting examples and discussion here: http://lambda-the-ultimate.org/node/5364

tenkeyless 25 days ago [-]

This is a well known problem in the field of decompilers and disassemblers. It figures that the pseudo-code it outputs for generic compilers is pretty good, but when encountered with a man-made assembly or byte-code, they go places.

_old_dude_ 25 days ago [-]

Even the Java compiler uses that trick, by example with a bridge method.

  public static void main(String[] args) {
    class Fun implements Supplier<String> {
      public String get() { return null; }
    }
    
    Arrays.stream(Fun.class.getMethods())
      .filter(m -> m.getDeclaringClass() == Fun.class)
      .forEach(System.out::println);
  }

kuschku 25 days ago [-]

why not use .getDeclaredMethods()?

_old_dude_ 25 days ago [-]

no real reason :)

0x0 25 days ago [-]

Another oddity to think about in java source code: In regular Java, all objects extend java.lang.Object, including java.lang.Class. So how do you bootstrap building java.lang.Object from source?

pjmlp 25 days ago [-]

You don't.

This is part of the bootstraping process of a programming language.

Usually such special types are built manually in the compiler data structures, or make use of special primitives, like native methods on Java's case.

seanmcdirmid 25 days ago [-]

java.lang.Object is just part of the VM. Same thing with native methods.

0x0 25 days ago [-]

While many of the methods in java.lang.Object have a native implementation, there are still other methods that do sport a pure java implementation: http://hg.openjdk.java.net/jdk7/jdk7/jdk/file/tip/src/share/...

seanmcdirmid 25 days ago [-]

Sure, but...if you look at the JVMS, there are a lot of special cases concerning java.lang.Object to help with bootstrapping.

reitanqild 25 days ago [-]

I call clickbait:

Author admits this is not valid Java. It is not even compilable.

If I read correctly it is just artifacts from a partially sucessful decompile.

Interesting and this discussion is interesting but this is not and have never been valid Java.

ipsum2 25 days ago [-]

never mind.

sctb 25 days ago [-]

We detached this subthread from https://news.ycombinator.com/item?id=13978371 and marked it off-topic.

usmannk 25 days ago [-]

I could see his comment just fine. What's the evidence of an HN shadowban?

Tideflat 25 days ago [-]

I, for one, can see what he posts and they aren't hidden in any way.

Search: