Jay Taylor's notes

back to listing index

Jocko – Kafka implemented in Golang | Hacker News

[web search]
Original source (news.ycombinator.com)
Tags: golang go kafka jocko news.ycombinator.com
Clipped on: 2017-04-22

Image (Asset 1/2) alt=
Image (Asset 2/2) alt=
What are the usual benefits of all those "x reimplemented in go" type of projects ? Do people see performance improvments in practice ? Or is it simpler to deploy / monitor ?

I know go has all the hype at the moment, but i'm curious to know what the benefits are in practice. Those real world reimplementation project make for a great benchmark.


Getting rid of the zookeeper requirement is a huge plus. Adds complication to the stack when trying to automate the deployment process. Maybe you could achieve a similar thing with the Scala version however.

For the Kafka deployment itself, you wouldn't need to install scala or java. Just download the go executable compiled for your architecture and away you go.


I think this is poor marketing; if you're familiar with Go, saying "written in Go", is shorthand for "low memory requirements, decent performance, no runtime dependencies / single binary deployment, straightforward compilation from source, easy to contribute to". But to everyone else it sounds like writing in Go is an end unto itself.


> decent performance

Also in Java (if not better there).

> single binary deployment

Also possible in Java

> straightforward compilation from source

ditto

> easy to contribute to

ditto

Edit: a much more expanded reply here: https://news.ycombinator.com/reply?id=13451098&goto=item%3Fi...


Yeah, Go and Java have comparable deployment. In practice, Java doesn't have single binary deployments; it has fat jars, which is similar, but the client must still have a Java runtime installed to run them. More importantly, you can't bank on every Java project supporting this out of the box. Java also doesn't have straightforward compilation; with Go, it's `go build` for nearly every project; no metadata files or build scripts. For Java, you must learn Maven, Ant, Gradle, etc just to get off the ground. I actually like Java, but I wouldn't say it's as easy or convenient as Go on the counts mentioned above. YMMV.


Go doesn't have single binary deployments either. You need a separate binary for every target platform. Java's use of a VM was intended to get around this restriction so that a single jar file would work on any platform, as long as the JVM and libraries were present. Though 20 years ago people would call it "write once, debug everywhere" but perhaps that's not much of a problem anymore.


> Go doesn't have single binary deployments either. You need a separate binary for every target platform.

You're confusing static compilation and portability. Static compilation produces a single binary artifact that can run on the platform for which it was compiled. This doesn't imply that it will run on every (or even any) other platforms--in fact, it's not possible for a single binary to run on every platform.


>Also possible in Java

Really? When working with Java system-level tools I always need a shell script to manage JAVA_HOME and CLASSPATH, or an OS package where the maintainer has done that for me.

I've seen these generated, but in a way that places an artificial upper bound on the number of command line arguments. Some programs like to have tons of these, so only going up to $10 doesn't cut it.

With Go I can literally just copy the binary over and execute it.


There are quite a few AOT native compilers available for Java, actually all commercial JDK ones.


> single binary deployment

> Also possible in Java

Does not mean that's the default. Default matters quite a bit.

Not to mention the verbosity of Java that I guess you can use other languages on top of Java to deal with. But Go makes a lot of sense for these sorts of projects which in the past would be written in C or C++


Funnily enough, there are complaints about Go being verbose. Just look at the error handling discussion that was on here a few days ago.


From the README: "Learn a lot and have fun"


Here the point is more no Zookeeper needed apparently.


First, I'm a big fan of yours. I was recently interviewed by Go Time and I talked about how awesome Redis is, what a good example it/you've been for open source projects, and how funny your v4 RC blog post was.

My goals for building Jocko to start:

1. Learn more about distributed systems to the point I could build one "What I cannot build, I do not understand." sort of thing. 2. Simplify running Kafka—so remove ZK, only need to install a binary, and so on 3. Maintain compatability—so people can just drop Jocko in, Kafka clients and anything else that uses Kafka's protocol works

After that, I have some ideas on things to do but am waiting to get feedback from people. For instance, I think the configuration around setting how much disk space topics use could be better.


Hello, thanks for the kind words! And sorry for guessing what your goal was, I used the wrong words probably, what I meant is that for me the highest value of Jocko at a first glance is the tremendous operational simplification provided by the fact it does not need ZK while implementing the same protocol. Of course I guess the project is currently ongoing, but once it will be a drop in replacement for the Java implementation needing ZK this could be very interesting for many users. This project could also serve to Kafka developers as an inspiration to also remove ZK perhaps, I guess that they already reasoned towards this idea tons of times btw so there must be some rationale they have (based on their design choices) to keep it.


Having distributed consensus as a distinct service (as opposed to directly building it into each application) is beneficial to a lot of companies because they can invest resources to make it operate well (in addition to being correct) and reap the benefits across all of their services. By well, I both mean having competent operations around the service as well as making it more resilient and faster to respond to various networking failures.

the downside is that in order to use something like kafka locally (afaik?) you have to run zookeeper locally as well.


Also, speaking with my sysadmin hat here, Zookeeper is an insanely powerful introspection point for running distributed services. The number of services with internal cluster coordination protocols that offer similar capabilities is abysmal.


There are other service discovery tools than Zookeeper such as Consul.io, Etcd, etc.


The biggest benefit IMHO would be from making this a pluggable service within the Kafka client and server. Then, you're not tied into ZooKeeper explicitly.


Though author has not mentioned I think Go program would be using an order of magnitude less amount of memory compared to Java alternative. Besides of course you have mentioned ease of deployment. Also standard builtin http(1/2) client/server, json package can help a lot for setting up communication between various components.


Go memory usage will definitely not be an order of magnitude less. This is something that cannot be emphasized enough: the idea that a Go program of any substantial size would use an order of magnitude less memory than a Java one is completely wrong.

First, the "proof": https://days2011.scala-lang.org/sites/days2011/files/ws3-1-H.... That's a paper from Google. The memory footprint of the Go program was 501MB, the Java version 617MB, and the Scala version 293MB. Now, given that the Scala version and Java versions were so far off in memory just tells you that a bunch of it just depends on some specific things used. Most importantly, it should dispel the myth that Java uses more memory than Go. The JVM isn't perfect, but arguing that Java uses an order of magnitude more memory is simply wrong.

Java has a bad reputation for memory usage among people who have never run substantial programs because a Java hello-world (or other small programs) will often allocate itself a decently sized heap. Java programs also take a while to start up because of the JVM, but that doesn't make Java slow.

Go certainly has some good things. Structs can help with memory locality and avoiding some unnecessary pointers. But that's not going to help one with attaining an order of magnitude better memory usage.

But Go's garbage collector isn't made for memory efficiency, but rather short pause times. That can be great, but Go seems to have made the choice that shorter pause times are preferable to throughput (actually accomplishing work) or using less memory (might as well throw more RAM at the problem).

Now let's talk about Kafka specifically. It's been a while since I've looked at it, but Kafka actually doesn't keep the data on the JVM heap. It writes to the filesystem using the kernel's built-in async flushing and uses sendfile to write data directly from the OS page cache to sockets. In a lot of ways, Kafka bypasses its own runtime and just leans on the OS for the heavy lifting.

I like Go, but it isn't magical. It certainly doesn't offer an order of magnitude memory improvement over Java. Go has some great bits, but if you're thinking that you'll get an order of magnitude memory improvement, it's not there. It might even use more memory than Java.


The link you gave was analysed and Go code fixed by Russ. See this:

https://blog.golang.org/profiling-go-programs

Better and updated link for benchmark is following:

http://benchmarksgame.alioth.debian.org/u64q/compare.php?lan...

Here are web related benchmarks:

https://www.techempower.com/benchmarks/#section=data-r13&hw=...

Following link details for Java memory bloat:

https://www.cs.virginia.edu/kim/publicity/pldi09tutorials/me...

In short Java memory bloat is not myth but well known fact.


And yet it does, try to create a simple API in Java vs Go and look at the memory of both program, there is no need for white paper to see that the JVM at least will allocate 100MB~~+ where Go will start bellow 10MB.


He addressed this:

> Java has a bad reputation for memory usage among people who have never run substantial programs because a Java hello-world (or other small programs) will often allocate itself a decently sized heap. Java programs also take a while to start up because of the JVM, but that doesn't make Java slow.


In a world where we break down monolithic applications into smaller services it's clear that the JVM memory model at startup is way bigger than any other language, I'm not sure how one can argue about that.

Since those "micro services" have just a few features, their memory won't grow up that much over time, so the startup memory is an important factor.


I'm not a Java expert (or even a semi-frequent user of it) but I think it would be pretty crazy if you could not override the default initial heap size and, indeed, it seems that you can: http://stackoverflow.com/questions/1951347/xms-initial-heap-...


I agree that the focus on memory size is misguided, but I do think that startup times are critical even (especially?) in production. Compounding the startup time of the JVM w/ classpath searching and reflection heavy libraries and frameworks really impact dynamic scaling and deployment. When I can setup monitoring that dynamically scales my application fleet in seconds, having to eat the cost of both a large binary (JDK + all of my various libraries) and then the startup cost of that application hurts.

I'm not saying that Java applications can't be built to be small and fast, but its not how most of industry had thought about building them. Golang has the benefit of not having that legacy to begin with.


Dynamically scaling a Kafka cluster in seconds is generally not something you're going to want to do anyways, regardless of what it's implemented in.

Rebalancing partitions isn't fast.


I wasn't commenting to Kafka specifically (I've never operated a Kafka cluster in a production environment), but Java applications in general.


Kafka isn't particularly memory intensive. Confluent suggests a small (5gb) heap and to let the OS filesystem caching handle the rest [0]. It is still early days in my deployment, but I believe IOPs are going to be the bottleneck.

[0] http://docs.confluent.io/1.0/kafka/deployment.html


Yeah same, the only problems we've had with our cluster has been disks, RAID 5 is a bad idea with Kafka. Other than that we've been running our cluster for almost 2 years now with the same 32GB servers, never once had a problem with memory or CPU, and we've been pushing more and more data into it.


It is in general Java itself thats memory intensive compare to Go.


NATS (and NATS Streaming) is a good implementation of fast messaging with streaming/persistence built in Go without the usual Kafka issues: http://nats.io/

Also AMPS for the commercial software version: http://www.crankuptheamps.com/


Biggest problem that I have with NATS and why I can't use it is that configuration is all done in config files. So say you want to add a new user during operation, too bad. There is an issue about trying to add this sort of support and they are resistant to it the last time I checked. They still want to use configuration and maybe provide a hot-reload. But seriously, you need a run-time API that you can use to modify access to make it useful to me.


I do remember that a straightforward C++ rewrite of Cassandra performed ~10 times better.

I imagine that a Go implementation would be orders of magnitude less lines of code, resource usage and tree-four times more performant.


Go doesn't use less resources then Java (Source: Google [1]).

Go is only marginally ~10% faster then Java [2]

[1] https://days2011.scala-lang.org/sites/days2011/files/ws3-1-H...

[2] https://benchmarksgame.alioth.debian.org/u64q/go.html


> Go doesn't use less resources then Java

It obviously does. Java's strings and other objects representation, JNI coersions, necessary copying of buffers, common aliasing bugs in code which affects GC, bloatware of dependencies, etc, etc.

Knowledge of some principles protects one from being distracted by noise.


> common aliasing bugs in code which affects GC

Care to elaborate?


Your second link shows Java used about 1.7 to 30 times more memory than Go.


> Go doesn't use less resources then Java

You point to measurements [2] that show Go programs (for tasks that need memory to be allocated -- reverse-complement, regex-dna, k-nucleotide, binary-trees) using 2-3x less memory!


If at the same time we could move some responsibility from the client to the server, it would be awesome.

Kafka gives so much responsibility to the client that most implementations are incomplete and don't cope well with state changes (e.g. adding or removing a node, migrating partitions to an other node, etc).

So, unless you are using Java and the official client, you don't benefit from all of Kafka goodness, fault tolerance and scaling abilities.


I hope there will be a static "discovery" backend that doesn't use Serf, preferably just config file or similar to set/seed the Raft peers and perhaps commands to add/remove peers from Raft.

Good to see more users of hashicorp/raft though, it's definitely the implementation that is looking to have the better long term future.


Since reading through the source of the Kafka server (and several client implementations), I've been intrigued by the idea of reimplementation using cooperative multitasking.

Kafka clients tend to be very resource intensive. It would be interesting to support a smaller/scalable footprint and a lower latency mode of operation.


I'm guessing zookeeper was killed because Raft was implemented here. Have you experimented with Jepsen to see how this handles network partitions?


What are the downsides to Zookeeper, or the motivations for keeping it out of the stack?


Yet another open source project where the front page doesn't tell me what it does.


Is it wire compatible with Kafka?


And, which version(s) of the Kafka protocol?


fantastic project -- I've been waiting for someone to tackle this!




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: