Event Capture in (Collaborative) Systems: Patches and Storyboards

09 Jun, 2025

Last week, I had an interesting conversation about how we capture events. One of the things we discussed was the different ways events are recorded in software development. For instance, in Git, we use ~~diffs~~ patches to capture the difference between two replicas of a state. These "states" or discrete "moments" in a process are defined on a commit-by-commit basis.

That is, whenever I publish (usually upon deciding I’ve completed a meaningful chunk of work), I’m effectively identifying a single unit of work. This information is then "stored." There are different ways to do this. One option is to store the output generated after publishing. Another is to capture the underlying work that led to each published output. But if you're working cumulatively and building on previous work it may be more efficient to capture just the incremental changes over time. This is essentially what Git does. It’s also the principle behind CRDTs (Conflict-Free Replicated Data Types), which underpin collaborative editing tools like Google Docs — the one I used to write this post.

Now, imagine we took a different route entirely. Let’s call this the "storyboard" approach to state encoding. In this model, every “published” moment captures a full frame. That frame could represent either the output (what’s produced at the moment of publishing) or the input (what was written to produce that output). Storyboards have certain structural properties. If you gave me a plot, sketched each frame onto its own card, and then shuffled the cards into a random order, the plot would no longer make sense. For my storyboard to retain its integrity — to be uniquely mine — its order must also be preserved. And that’s where event ordering becomes essential.

Timestamp representations and total ordering

Returning the "timestamp" representations however, we take the same situation but assume a numerical representation where instead of abstract positions we have a counter of event placement. It makes the assumption that at a fine enough observational ability, one can always discern what event was placed first. In this view, simultaneity is not possible.

Anyway, something else we discussed was Cursor’s new Memories feature, which captures development history. Memories can be considered as (user-provided) facts from earlier conversations within a project that may be referenced later. From intuition, the methodology seems to center on capturing lightweight, conversational artifacts — such as design rationales, code decisions, or debugging context — as memory units that can be retrieved when relevant to the task at hand.

PRISM Paper

Lastly, it made me think of this paper that I was referencing in a paper and have been reading on “Learning Procedural Abstractions and Evaluating Discrete Latent Temporal Structure.” Using terminology from their work, what they dub “high-dimensional observation sequences, such as video demonstrations” could be analogized as the completed film from the storyboarding detailed previously.

Their work describes an approach for parsing a referenceable procedure from a high-dimensional observation sequence. Put simply, given a recording of an activity, how do you extract the steps needed to replicate it? Their approach, called PRISM, is a hierarchical Bayesian model. Here, referenceability refers to the ability to identify a sequence of discrete, interpretable steps that a human can understand and follow. In contrast, many models produce representations in a continuous latent space. These take the form of smooth manifolds of numerical values, which are useful for machine learning but are not directly human-readable. As a result, such continuous outputs are considered unreferenceable within this framework.

These discrete procedural units are named “segments” and the process of extracting them is called “temporal clustering.” Here, temporal clustering is done via labelling where you have some similar set of recordings that have extracted steps as an example dataset used to train an unsupervised labelling algorithm.

They distinguish temporal clustering from “time-series segmentation and (Chung et al., 2004) and changepoint detection (Killick et al., 2012)” based on the clustering’s awareness of the events themselves and not only their boundaries.

To evaluate their approach, they consider two properties: 1) completeness and 2) homogeneity. Completeness refers to the degree of the V-measure presented by Rosenberg and Hirschberg (2007). Based on their work, maximal completeness would suggest that all items in the ground truth label are in the same cluster. And you don’t have ground truth labels split between clusters. In this context, "completeness" — to be defined shortly — applies.

Completeness measures whether all data points that belong to the same true class/category are assigned to the same cluster. Completeness aims to put all elements of each class in single clusters. For example, if you have a full procedure for making breakfast and one step is "making eggs," then completeness would require that all egg-related actions — taking the eggs out of the fridge, cracking them, whisking them, putting them in the pan, cooking them until they're ready, and all other sub-actions in between — are grouped together in the same cluster.

Homogeneity can be considered the litmus test for contamination and purity. Ideally, for each discovered cluster, only the "correct" actions are attributed to it. That is, frothing milk for making a latte shouldn't be an action included in the "making eggs" cluster.

Completeness and homogeneity represent two axes for evaluating procedural clustering quality:

High completeness, poor homogeneity: All the egg-making actions are kept together, but the "making eggs" cluster also incorrectly includes actions like buttering toast, brewing coffee, and setting the table. The complete procedure is captured but contaminated with irrelevant actions.
High homogeneity, poor completeness: Each cluster contains only semantically related actions, but the egg-making procedure is fragmented across multiple clusters. For instance, "cracking eggs" might be in a "preparation" cluster with "slicing bread," while "cooking eggs" might be in a "heating" cluster with "toasting bread." The actions are uncontaminated, but the procedure isn't coherent or orderly.

Conclusion

Much of this conversation and blog post came from my work on capturing procedural knowledge in collaborative research environments, particularly Jupyter notebooks. If this is interesting to you, I’d be happy to discuss it further! I’m also interested in building on the PRISM work. I don’t go into the details of the algorithm itself, but I focus more on how the procedures are represented using the methodology they introduce, which I find really interesting. I’m especially curious about reproducibility: in what ways and to what extent do different procedural extraction methods produce varying results on the same dataset? I’m also wondering how the process could become more supervised. For example, what kinds of demonstration videos or "high-dimensional observations" are considered out-of-distribution, and what features make them so?