Jay Taylor's notes

back to listing index

Peloton – a relational database designed for autonomous operation | Hacker News

[web search]
Original source (news.ycombinator.com)
Tags: database peloton autonomous news.ycombinator.com
Clipped on: 2017-04-22

Image (Asset 1/2) alt=
Image (Asset 2/2) alt=
OP here. Peloton has been posted here before, but didn't get any attention.

I think this database is very interesting even if you don't care about the time saving part of it, since it claims to be a hybrid (OLAP and OLTP), it implements postgres' wire protocol and it claims to compile queries to machine code using LLVM [1].

[1]: https://www.youtube.com/watch?v=mzMnyYdO8jk (slideshow: http://www.cs.cmu.edu/~pavlo/slides/selfdriving-nov2016.pdf)


Andy Pavlo always tries to spice things up and his lectures and presentations are a treat.

He is on a list of people of mine that fits on 10 fingers. James Mickens is in there.

His work on H-Store was great. I spent 6 years working on VoltDB which is a commercial spinoff of H-store and it was a formative experience for me.


Strange talk and slides, I am not sure if he is serious or not.


He's joking about the rehab I'm sure...that's his style (I like it personally).

BTW, here are the video lectures to the graduate database course he mentioned in the presentation, where students were developing features for Peloton as part of the course (they're great IMO):

https://www.youtube.com/watch?v=MyQzjba1beA&list=PLSE8ODhjZX...


both in person, lectures, and videos - he has a very different style than many others


The lectures are great.


Thanks for the shout-out!


Side note, but I really dislike the current trend (in in-memory databases, to be clear) of not bothering to include any real provisions for durability and justifying it by saying "NVRAM exists." It effectively doesn't for anyone who need to be able to deploy to off-the-shelf environments, and it's super expensive (and if you're going for performance, like most of the research projects are, countering by using the database in a clustered configuration would be counterproductive). Are there any cloud providers who provide NVRAM in any configuration?


But, it provides a dead easy way to publish a research work, claim insane speedups, and not worry about disk journals, caches, in flight data corner cases when VM is snapshoted, etc.


Flash storage is nvram, so yes, hosting companies offer it.


there are many types of NVM's not all of them are available on most hosting providers one of the big player offering ssd on cloud is digitalocean


Not in the sense that people mean in these papers, and you know it. It doesn't have even close to the same performance characteristics.


The idea of write-behind logging is slick.

http://www.cs.cmu.edu/~pavlo/papers/p337-arulraj.pdf


Thanks :) We believe that non-volatile memory (NVM) will be a game-changer for database management systems [1].

[1] https://www-ssl.intel.com/content/www/us/en/architecture-and...


Does anyone know what happens after the query plan is generated in most database? I'm assuming individual step, like index scan, hashjoin are coded already and the plan steps are iterated and respective methods are called? So the execution steps are already compiled but the step traversal is kind of interpreted. With Peloton LLVM engine everything is merged together in a single sequence of machine code?

How much advantage does this give you? Is there really so many steps in the execution plan (the visible steps are usually < 50) but what about the internal actual compiled steps? Unless this is allowing merging and further simplification steps identifying redundant operation that gets trimmed of not sure where 100x performance improvement comes from.

Though I remember seeing the scala based in-memory query engine that was sort of doing simplification of the actual steps and doing very well in benchmark, maybe this is similar.


I wonder why they try to support both OLTP and OLAP workloads. Supporting both of these workloads requires too much work (both row and columnar storage types, different algorithms for both storage and querying etc) and they didn't even prove that autonomous systems (which is the main point of the project) can replace the existing databases.


Great question! There happens to be an autonomous mechanism for supporting hybrid workloads (OLTP & OLAP). Peloton supports hybrid storage layouts that are automatically and dynamically adapted over time based on the workload patterns. Row and columnar storage types are special cases of hybrid storage layouts.

This is a promising area of ongoing research. If you are curious about this kind of autonomous tuning of storage layout, you might want to check this out [1].

[1] https://www.cs.cmu.edu/~jarulraj/papers/2016.tile.sigmod.pdf


I guess it is a trend currently with modern MMDB's (MEMSQL,HyperDb etc) have support for both OLTP & OLAP workloads. You can checkout the git repo give it a spin see if it hold up to the claims.


http://www.memsql.com/ does this today. Fast, distributed, rowstore + columnstore, relational database with mysql protocol.


However Peloton also aims to be an autonomous system. That's a lot for undergrad and grad students so I'm not sure if he wants Peloton to be stable in a near future.


Also fits in the niche between people who want both possibilities - though the onus is on the authors to show that it actually is just as good


This sure has a lot to live up to: trying to do two thing and do them Well isn't very unix-y. There's a reason relational database are set up to have oltp schemas (highly notmalized tables for supporting transactions etc.) and olap schemas (star schemas for example, large sometimes flat fact and dimension tables etc.). Also I'm not sure about the learning part: any decent database these days will cache frequently used data and tables can be built as in-memory ones.


> addressing your caching point

so from my understanding - the learning part isn't frequently used and caching, it's (attempting to be) generalized workload learning, the part of understanding that every DBA should do but usually doesnt.

If that is successfully and is even marginally able to predict workload skews, then the scheduling of operations can be significantly more efficient -- you're essentially reducing entropy in your database massively.


Any team of database admins/engineers worth their salary plans for capacity, fixes inefficient queries, And works with development on future goals for what they want out of the database layer.


And you don't think it would be valuable to be able to automate many of those tasks?


I agree it would but my premise is that I doubt it can be.


Is very rare to have a DB that not need both oltp/olap workloads.

All db-based apps end fast the need requeriments for transactional code and move into "infinity-reporting-requests".

For certain ERP I work on in the past, it have at least 300 reports in the base package. Most request was for more reports specialized for each customers. And additions to the transactional code was in part driven by the need to add more data for the reports!

So, I think have both styles is exactly what "everyone" want. Even folks that get stuck with NOSQL databases.

---

I have thinking very much about this, I consider the ideal architecture is a relational-db with decoupled modules that work like this:

Write:

Commands -> WAL -> WaLProcessorAndRejector -> EventLog -> EventLogDispatchToOneOrMoreOf:

- Nothing. EventLog just is history - Caches - Relational Tables for up-to-date view on data - Columnar/Index for speed up part of the reports

Read:

ReadRequest -> ReadDispatchToOneOf:

- EventLog - Caches - Relational Tables - Columnar/Index

The need to be modular is that what is need can change by need.


That's correct! This is the reason why we support both OLTP and OLAP workloads in Peloton.


We do just fine with a data warehouse and a bunch of traditional OLTP databases.


We certainly do :) There happens to be an autonomous mechanism for supporting hybrid workloads (OLTP & OLAP). Peloton supports hybrid storage layouts that are automatically and dynamically adapted over time based on the workload patterns. Row and columnar storage types are special cases of hybrid storage layouts. This is a promising area of ongoing research. If you are curious about this kind of autonomous tuning of storage layout, you might want to check this out [1].

[1] https://www.cs.cmu.edu/~jarulraj/papers/2016.tile.sigmod.pdf


How old is this project? I wouldn't be surprised to see a cease and desist from the maker of the exercise bike.


Peloton is in fact the French word for platoon. I'd be highly surprised if the bike maker had the legal standing for issue infringement claims. Just as you can't copyright the word "bicycle", peloton is used widely enough that they should be fine. Then again, Jade the preprocessor was forced to rebrand as Pug.


Peloton is also Finnish word and it means 'fearless'


Copyright has nothing to do with this. You certainly can register bicycle as a trademark, bicycle brand playing cards, for instance.


Peloton is also Finnish word and means fearless.


Peleton is a word referring to the main group in a bicycle endurance race, so it's not the same as calling your software "Wal-Mart." In this case it seems like the fact they are in completely different markets would be sufficient.


The first commit was:

  commit 35823950d500314811212282bd68c101e34b9a06
  Author: jarulraj <jarulraj@cs.cmu.edu>
  Date:   Thu Dec 18 16:41:48 2014 -0500
Take a look at the different graphs on GitHub, like code frequency, to get a better idea: https://github.com/cmu-db/peloton/graphs/code-frequency


Peloton also means fearless in finnish, so the name could be based on that.


That's not how trademarks work.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: