The format of the article comes across as AI-sloppy. Each section is filled with numbered lists and there are several AIsms, such as the omni-present "not-only-x-but-y".
Thanks for the feedback on the formatting.
While I do use tools to help structure thoughts and edit for clarity (which might explain the lists and phrasing you noticed), the core technical analysis regarding the challenges of optical flow vs. spatiotemporal AI stems directly from our actual engineering work in building video restoration models.
The goal was to make complex concepts digestible, but I appreciate the note on style. I hope the substance of the technical argument still comes through.
if you were talking about erasure in classic film, and not the constraints of non-linear editable data streams (I and P blocks, you-name-it), how much of this would remain true? Yes, its a temporal-spatial space. But, it consists of a sequence of static images. (in the case of film) and so erasure could be a 2 phase process 1) find the mask per image and apply it and 2) construct an infill which respects the rest of the images.
The choice of a dog running down a beach is quite smart: the background has a plane of movement which is mechanistically unrelated to the dog. thats part 2) reconstruct waves lapping on the seashore. Hard. you can't do this per-image. you have to do this across the entire sequence.
I would think, even in a film model, this is a really quite complicated problem because for each static image an infill is plausible, but to maintain consistency across the image series, it has to avoid uncanny valley for the specifics of wave motion up a beach.
This is a fantastic insight. You absolutely nailed why the 'dog on a beach' scenario is the ultimate stress test for temporal consistency.
You are right that the fundamental problem exists even in a film model composed of static images. The challenge isn't just filling the hole; it's dealing with the background's non-rigid, stochastic motion (like waves lapping).
A generative model can easily hallucinate a plausible static wave infill for a single frame. But ensuring those hallucinations transition smoothly across t-1, t, and t+1 without jittering or warping is exactly the 'uncanny valley' of motion we are trying to solve. It has to understand the physics of the wave motion, not just the texture.
Thanks for this thoughtful analysis.
The format of the article comes across as AI-sloppy. Each section is filled with numbered lists and there are several AIsms, such as the omni-present "not-only-x-but-y".
Thanks for the feedback on the formatting. While I do use tools to help structure thoughts and edit for clarity (which might explain the lists and phrasing you noticed), the core technical analysis regarding the challenges of optical flow vs. spatiotemporal AI stems directly from our actual engineering work in building video restoration models. The goal was to make complex concepts digestible, but I appreciate the note on style. I hope the substance of the technical argument still comes through.
if you were talking about erasure in classic film, and not the constraints of non-linear editable data streams (I and P blocks, you-name-it), how much of this would remain true? Yes, its a temporal-spatial space. But, it consists of a sequence of static images. (in the case of film) and so erasure could be a 2 phase process 1) find the mask per image and apply it and 2) construct an infill which respects the rest of the images.
The choice of a dog running down a beach is quite smart: the background has a plane of movement which is mechanistically unrelated to the dog. thats part 2) reconstruct waves lapping on the seashore. Hard. you can't do this per-image. you have to do this across the entire sequence.
I would think, even in a film model, this is a really quite complicated problem because for each static image an infill is plausible, but to maintain consistency across the image series, it has to avoid uncanny valley for the specifics of wave motion up a beach.
This is a fantastic insight. You absolutely nailed why the 'dog on a beach' scenario is the ultimate stress test for temporal consistency. You are right that the fundamental problem exists even in a film model composed of static images. The challenge isn't just filling the hole; it's dealing with the background's non-rigid, stochastic motion (like waves lapping). A generative model can easily hallucinate a plausible static wave infill for a single frame. But ensuring those hallucinations transition smoothly across t-1, t, and t+1 without jittering or warping is exactly the 'uncanny valley' of motion we are trying to solve. It has to understand the physics of the wave motion, not just the texture. Thanks for this thoughtful analysis.