Jay Taylor's notes

back to listing index

Show HN: Distributed SQLite on FoundationDB | Hacker News

[web search]
Original source (news.ycombinator.com)
Tags: database distributed-systems foundationdb sqlite news.ycombinator.com
Clipped on: 2022-07-31

Image (Asset 1/2) alt=


Image (Asset 2/2) alt=
I love the idea of distributed SQLite but I’m having a hard time understanding which parts of FoundationDB and which parts of SQLite are available in this implementation.

I’m guessing virtual table extensions work with this since you’re just replacing the storage engine? So we could in theory use FTS5 and even OSQuery and other extensions right?

However since this is using FoundationDB I’m also guessing we can’t use this as a serverless embedded DB since since you’ll probably need a foundation db cluster to use this. Is that right?

So if I understand correctly this is a SQLite query engine on top of FoundationDB with distributed transactions and we can theoretically use SQLite ecosystem stuff like FTS5 and datasette on top of it.


Yes it's correct! mvsqlite integrates as a custom VFS underlying SQLite's query engine, and SQLite ecosystem stuff can be used on top of it.

"The SSD storage engine stores the data in a B-tree based on SQLite." XD https://apple.github.io/foundationdb/architecture.html#stora...

Per your link they are leaving sqlite.

In the upcoming FoundationDB 7.0 release, the B-tree storage engine will be replaced with a brand new Redwood engine.


(I'm on the osquery steering committee)

In theory osquery is "just" virtual tables, but in practice there's quite a bit more that would probably make attaching it to mvsqlite. If you have a use case in mind I would love to know!


One thing I've been curious about with FDB (need to find time to try this myself) is using FDB as a way to easily implement replication with consistency.

For example: You have 5 Postgres instances. You send the query "SELECT * FROM TABLE" to FDB, you want the result of this from any of the 5 Postgres's (first to return wins). When you insert, you want to insert into all 5 and make sure that all 5 have actually finished the transaction before telling the client.

Seems simple enough to implement via FDB?


I think Kafka is enough for this use case? FDB should work too but sounds like overkill.

Right, my question is more around replicating any stateful system. Isn’t Kafka not partitionable?

Good work! I don't understand the innards of this at all, but I love the design.

Iirc when I first read through the book designing data centric applications, the author talked about a lot of trade-offs for data storage and replication and network connectivity issues.

the impression I left with was for my particular application that foundation DB was the best option I had for my wild dreams of web scale popularity.

the current data persistence layer I use is sqlite, which means if I use mvsqlite, that only makes it easier for me to try to use foundation DB for my someday no doubt irresistible web application.


I think foundationdb uses sqlite as its tablet kv engine? It’s sqlite all the way down

They are leaving sqlite.

In the upcoming FoundationDB 7.0 release, the B-tree storage engine will be replaced with a brand new Redwood engine.

https://apple.github.io/foundationdb/architecture.html#stora...


Interesting… doesn’t seem to have a lot of docs on this. Surprised they didn’t just go along with RocksDB

It supports RocksDB also, although I think that feature is a bit more experimental.

> Surprised they didn’t just go along with RocksDB

Given how often they're butting heads publicly, maybe Apple didn't want to use something developed by Facebook? It probably wouldn't normally be a concern due to the licenses of each, but it's possible that corporate politics might still be relevant.


Does this work similar to rqlite or dqlite from a usage standpoint, or does it solve a different use case?

This really, IMHO (as someone implements things on top of SQLite too https://dflat.io) pushes SQLite too far as the implementation of cross-db transactions have some big issues: https://www.sqlite.org/limits.html (the number of attached databases cannot exceed 10 or 125 (if you compile your own)) https://www.sqlite.org/wal.html (in WAL mode, there is no transactional guarantee for cross database transactions (atomic per database, but not cross database))

Anyone interested this topic would also be interested in the discussion at https://news.ycombinator.com/item?id=32285435. mvsqlite might be able to handle cross-database transactions in the storage layer(=mvsqlite's mvstore backed by FoundationDB's fully serializable transactions) rather than sqlite's native one so I think there should be no theoretical limit due to SQLITE_MAX_ATTACHED.

Update:

But yeah, mvsqlite as of today seems to be tied to one sqlite per mvstore so it's attaching databases to the sqlite instance is the only way to deal with multi databases. So, it will naturally be affected by SQLITE_MAX_ATTACHED.

Perhaps a potential big idea would be to have a sqlite instance per database and a "proxy" layer above sqlite instances to (1)obtain/set a global transaction ID per multi-db transaction and (2)redirect queries to each involved sqlite databse, while the serializability is guaranteed in the FoundationDB.


mvsqlite doesn't rely on the SQLite WAL or rollback journal for safety. Actually, it enforces journal_mode=memory! The atomic commit semantics is guaranteed by FDB instead.

The limit on max number of open DBs looks low, but maybe short-lived attaches can be done? Like attach just before the transaction and detach after commit. This should be enough for common transactions involving just a few DBs.


I am a bit out of depth here for obvious reasons (I work above SQLite, not under SQLite). But from my understanding, journal_mode=memory simply means it is in rollback mode, but rollback pages are not maintained on disk. Therefore, the modifications from a transaction will be applied in place in these pages. If the process that runs SQLite crashes in the middle of a transaction, these details will be leaked unless you know how to rollback these pages (through rollback journal, you can alternatively found these from FDB, but you should have no idea which one corresponding to which transaction?).

More over, without WAL, SQLite to the best of my knowledge would require read-lock on every read as well, practically becomes not only single-writer, but also single-reader (you probably can break the global lock (that through VFS layer)? But it doesn't make this thing safe (i.e. you will have things in a transaction leaked?)).


> Therefore, the modifications from a transaction will be applied in place in these pages.

With mvsqlite it is applied to the transaction's own snapshot of the database. Changes are not visible globally until transaction commit. That's what we get from page-level MVCC.

> More over, without WAL, SQLite to the best of my knowledge would require read-lock on every read as well

Reads are also MVCC. Data is fetched from a consistent and clean snapshot of the database, without uncommitted data.


This approach provides SERIALIZABLE transaction isolation right?

What happen if there are page conflicts during the write? The transaction fails and you retry? Does SQLite have a good way of of signalling that this is the case (vs for example a network failure or other write failures).


Yes it's the serializable isolation level.

For applications targeting upstream SQLite, mvsqlite enables pessimistic locking by default - when a transaction is promoted to EXCLUSIVE, it acquires a one-minute lock lease from mvstore. At this point we have the chance to fail gracefully and return a "database is locked" error if multiple clients want to acquire lock on the same DB. This is a best-effort mechanism to prevent conflict on commit (which causes the process to abort).

A future feature is "MVCC-aware clients". Compatible clients can opt-in to full, optimistic MVCC, and avoid pessimistic locking. After COMMIT, the client should call a SQLite custom function provided by mvsqlite to check whether the commit actually succeeds, and retry if not.


So the idea is that a small business could start on SQLite, and then switch over to this when it's time to scale, without re-writing it in the Postgres dialect?

Regardless, it's very very cool. Would love to see it get turned into a product.


FDB is a nice piece of technology if you know how to go around its constraints. Congrats on the project.

If you've written your own multi-versioning, what does FDB bring to the table that you couldn't have gotten out of other distributed but non-transactional KV stores? E.g. Cassandra, etc. Isn't there an overhead to the MVCC aspect of FDB? And it sounds like you've had to jump hoops around things to get past its duration and size limits, as well...

Cassandra is a column store, surely that would be a very un-natural mapping for a SQL Engine backend to use? An orthogonal mapping, to be precise!

Cassandra is not column oriented. It coined the phrase “wide column” (which has caused untold confusion), but this just meant row-oriented without a schema and unlimited numbers of dynamic “columns” in each row. This is no longer true anyway, it is simply row-oriented now in the normal sense.

> But a group of N sqlite databases is an N-writer database. And mvsqlite provides the necessary mechanisms to do serializable cross-database transactions without additional overhead.

I'm confused, are these databases planned to be replicated? Or is it expected for the databases to have separate schemas?


Replication is handled by FDB so you don't need to care about it on the application level. These databases can contain partitioned data of your application, like one DB per user, so that a transaction on only user A and another one on only user B won't conflict.

Did you intend "shard", or s/shard/partition/?

I guess "partition" is the better wording. Updated my comment.

This sounds really cool, do you have any source code in a workable state yet or is this project still in the formulative ideation phase?

If there's anything concrete so far, I'd love to take a look and/or try it out!


It already works! There are steps to try it in readme.

Nice, I dug a bit and found the git repository:

https://github.com/losfair/mvsqlite

Would be great to add this link to the body of your story to make it easy for HNers to get to the thing :)

If no longer editable, consider emailing moderator Dang (hn@ycombinator.com).


Comdb2: distributed sqlite: http://comdb2.org/


Any chance this could get Jepsen tested? I’d donate to make that happen.

Haven't you heard of bedrock by expensify?

blockchain? different solution.


Discussions on similar submissions:

  1. Distributed, MVCC SQLite that runs on FoundationDB (July 30, 2022 — 6 points, 0 comments)
  2. Distributed, MVCC SQLite that runs on FoundationDB (July 29, 2022 — 1 points, 0 comments)
  3. Show HN: Mvsqlite – Distributed, MVCC SQLite That Runs on FoundationDB (July 27, 2022 — 2 points, 3 comments)

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: