> The bidding model is elegant, but it’s insufficient to route network requests. To allow an HTTP request in Tokyo to find the nearest instance in Sydney, we really do need some kind of global map of every app we host.
So is this a case of wanting to deliver a differentiating feature before the technical maturity is there and validated? It's an acceptable strategy if you are building a lesser product but if you are selling Public Cloud maybe having a better strategy than waiting for problems to crop up makes more sense? Consul, missing watchdogs, certificate expiry, CRDT back filling nullable columns - sure in a normal case these are not very unexpected or to-be-ashamed-of problems but for a product that claims to be Public Cloud you want to think of these things and address them before day 1. Cert expiry for example - you should be giving your users tools to never have a cert expire - not fixing it for your stuff after the fact! (Most CAs offer API to automate all this - no excuse for it.)
I don't mean to be dismissive or disrespectful, the problem is challenging and the work is great - merely thinking of loss of customer trust - people are never going to trust a new comer that has issues like this and for that reason move fast break things and fix when you find isn't a good fit for this kind of a product.
I was referring to the "HTTP request in Tokyo to find the nearest instance in Sydney" part which felt to me like a differentiating feature- no other cloud provider seems to have bidding or HTTP request level cross regional lookup or whatever.
The "decision that long predates Corrosion" is precisely the point I was trying to make - was it made too soon before understanding the ramifications and/or having a validated technical solution ready? IOW maybe the feature requiring the problem solution could have come later? (I don't know much about fly.io and its features, so apologies if some of this is unclear/wrongly assumes things.)
> an if let expression over an RWLock assumed (reasonably, but incorrectly) in its else branch that the lock had been released. Instant and virulently contagious deadlock.
They may have upgraded by now, their source links to a thread from a year ago, prior to the 2024 edition, which may be when they encountered that particular bug.
It seems to be a quirk of cr-sqlite, it wants to keep track of clock values for the new column. It's not backfilling the field values as far as I understand. There is a comment mentioning it could be optimized away:
I assume it would backfill values for any column, as a side-effect of propagating values for any column. But nullable columns are the only type you can add to a table that already contains rows, and mean that every row immediately has an update that needs to be sent.
in case people don't read all the way to the end, the important takeaway is "you simply can't afford to do instant global state distribution" - you can formal method and Rust and test and watchdog yourself as much as you want, but you simply have to stop doing that or the unknown unknowns will just keep taking you down.
I mean, the thing we're saying is that instant global state with database-style consensus is unworkable. Instant state distribution though is kind of just... necessary? for a platform like ours. You bring up an app in Europe, proxies in Asia need to know about it to route to it. So you say, "ok, well, they can wait a minute to learn about the app, not the end of the world". Now: that same European instance goes down. Proxies in Asia need to know about that, right away, and this time you can't afford to wait.
I guess all designers at fly were replaced by ai because this article is using gray bold font for the whole text. I remember these guys had good blog some time ago
Looking at the css, there's a .text-gray-600 CSS style that would cause this, and it's overridden by some other style in order to achieve the actual desired appearance. Maybe the override style isn't loading - perhaps the GP has javascript disabled?
and I think the intended webfont is loaded because the font is clearly weird ish and non-standard and the text is invisible for good 2 seconds at first while it loads:)
> The bidding model is elegant, but it’s insufficient to route network requests. To allow an HTTP request in Tokyo to find the nearest instance in Sydney, we really do need some kind of global map of every app we host.
So is this a case of wanting to deliver a differentiating feature before the technical maturity is there and validated? It's an acceptable strategy if you are building a lesser product but if you are selling Public Cloud maybe having a better strategy than waiting for problems to crop up makes more sense? Consul, missing watchdogs, certificate expiry, CRDT back filling nullable columns - sure in a normal case these are not very unexpected or to-be-ashamed-of problems but for a product that claims to be Public Cloud you want to think of these things and address them before day 1. Cert expiry for example - you should be giving your users tools to never have a cert expire - not fixing it for your stuff after the fact! (Most CAs offer API to automate all this - no excuse for it.)
I don't mean to be dismissive or disrespectful, the problem is challenging and the work is great - merely thinking of loss of customer trust - people are never going to trust a new comer that has issues like this and for that reason move fast break things and fix when you find isn't a good fit for this kind of a product.
It's not a "differentiating feature"; it eliminated a scaling bottleneck. It's also a decision that long predates Corrosion.
I was referring to the "HTTP request in Tokyo to find the nearest instance in Sydney" part which felt to me like a differentiating feature- no other cloud provider seems to have bidding or HTTP request level cross regional lookup or whatever.
The "decision that long predates Corrosion" is precisely the point I was trying to make - was it made too soon before understanding the ramifications and/or having a validated technical solution ready? IOW maybe the feature requiring the problem solution could have come later? (I don't know much about fly.io and its features, so apologies if some of this is unclear/wrongly assumes things.)
That's literally the premise of the service and always has been.
> an if let expression over an RWLock assumed (reasonably, but incorrectly) in its else branch that the lock had been released. Instant and virulently contagious deadlock.
I believe this behavior is changing in the 2024 edition: https://doc.rust-lang.org/edition-guide/rust-2024/temporary-...
> I believe this behavior is changing
Past tense, the 2024 edition stabilized in (and has been the default edition for `cargo new` since) Rust 1.85.
Yes, I've already performed the upgrade for my projects, but since they hit this bug, I'm guessing they haven't.
They may have upgraded by now, their source links to a thread from a year ago, prior to the 2024 edition, which may be when they encountered that particular bug.
I see now that this incident happened in September 2024 as well.
> Like an unattended turkey deep frying on the patio, truly global distributed consensus promises deliciousness while yielding only immolation
Their writing is so good, always a fun and enlightening read.
> New nullable columns are kryptonite to large Corrosion tables: cr-sqlite needs to backfill values for every row in the table
Is this a typo? Why does it backfill values for a nullable column?
It seems to be a quirk of cr-sqlite, it wants to keep track of clock values for the new column. It's not backfilling the field values as far as I understand. There is a comment mentioning it could be optimized away:
https://github.com/vlcn-io/cr-sqlite/blob/891fe9e0190dd20917...
I assume it would backfill values for any column, as a side-effect of propagating values for any column. But nullable columns are the only type you can add to a table that already contains rows, and mean that every row immediately has an update that needs to be sent.
For the TL;DR folks: https://github.com/superfly/corrosion
in case people don't read all the way to the end, the important takeaway is "you simply can't afford to do instant global state distribution" - you can formal method and Rust and test and watchdog yourself as much as you want, but you simply have to stop doing that or the unknown unknowns will just keep taking you down.
I mean, the thing we're saying is that instant global state with database-style consensus is unworkable. Instant state distribution though is kind of just... necessary? for a platform like ours. You bring up an app in Europe, proxies in Asia need to know about it to route to it. So you say, "ok, well, they can wait a minute to learn about the app, not the end of the world". Now: that same European instance goes down. Proxies in Asia need to know about that, right away, and this time you can't afford to wait.
> Proxies in Asia need to know about that, right away, and this time you can't afford to wait.
Did you ever consider envoy xDS?
There are a lot of really cool things in envoy like outlier detection, circuit breakers, load shedding, etc…
I guess all designers at fly were replaced by ai because this article is using gray bold font for the whole text. I remember these guys had good blog some time ago
The design hasn't changed in years. If someone has a screenshot and a browser version we can try to figure out why it's coming out fucky for you.
Looking at the css, there's a .text-gray-600 CSS style that would cause this, and it's overridden by some other style in order to achieve the actual desired appearance. Maybe the override style isn't loading - perhaps the GP has javascript disabled?
Thanks! Relayed.
latest macos firefox and safari both show grey on white, legible but contrast somewhat lacking, but rendered properly for grey on white.
Not sure if that was changed since then, but it's not bold for me and also readable. Maybe browser rendering?
Also not bold for me (Safari). Variable font rendering issue?
stock safari on ios 26 for me. is it another of 37366153 regressions of ios 26?
Looks normal to me on iOS 26.0.1
stock safari on ios
and I think the intended webfont is loaded because the font is clearly weird ish and non-standard and the text is invisible for good 2 seconds at first while it loads:)
Please try the article mode in your web browser. Firefox has a pretty good one but I understand all major browsers have this now.
I only use article mode in exceptional cases. I hold fly to higher standard than that.
D'awwwwww.
It's totally unreadable.
Looks like it always has, to me.