Re: RepositoryBasedCode

August 07, 2008 16:37:22 +0300 (EEST)

Martin Fowler wrote recently about what he called RepositoryBasedCode, which he sees as "the idea that the core definition of a system should be held in a model and edited through projections". He contrasts it with code in source files, where the editable representation is essentially the same as the storage representation: a two-dimensional array of characters flattened as a one-dimensional array of characters with line break characters as separators. Such simple arrays don't embody any real understanding of the structure of the language or code, so it's left to the programmer to maintain that structure in his head as he writes. To help the programmer a little, many simple editors can automatically maintain the indent level -- useful for understanding and readability, even if it doesn't affect the behavior of the program (except in rare cases like Python). The next step is things like emacs modes, that use simple text processing on the fly to identify block structures etc. If the results of the processing are kept around in memory, we can do things like today's IDEs: on-the-fly syntax error display, code completion, refactoring etc.

Having what amounts to the abstract syntax tree of the code allows us to have a much richer editing experience. However, as long as the majority of the editing happens at the character level, where you can insert or delete any single character anywhere in the whole file, it's going to be a hard task to keep the code legal and the abstract syntax tree up to date. One thing that a RepositoryBasedCode system does is to switch things around: the abstract syntax tree becomes the primary form, and the display of code in lines or the serialization of it into a file are created from that. Since pretty printing and serialization are much faster than parsing, this is more efficient for the IDE. Because the editing operations now operate on the abstract syntax tree, it is possible to only offer operations that maintain the legality of the code. (If that seems impossibly restrictive, consider this: Word document files are always maintained as legal Word document files: you can't do things like add an "BeginSection" mark without the corresonding "EndSection". The UI guarantees it, and a good choice of concepts ("SectionBreak") makes it feel natural.)

Up to here, although I am emphasizing different aspects from Martin, I think we are in agreement, and the arguments apply equally to systems described as source code or as graphical models. My experience starts to differ from his claims here:

A tool manipulates the abstract representation and projects multiple editable representations for the programmer to change the definition of the system. The tool persists the abstract representation in a storage representation, but this is entirely separated from any of the editable representations that it projects.

That sounds like it repeats the mistake of Intentional Software's approach (and several before them): the editable representations are created automatically on the fly as projections from the abstract form. That works fine in a demo, and looks very cool as you enter information in one representation and the tool can automatically display a completely different one. The problem however is precisely the same thing that makes it look cool: the representation of the information changes a lot at the push of a button. That looks cool, because the button appears very powerful, but it's a nightmare if you're actually trying to work with the model through the representation. The brain remembers spatial information well, and uses landmarks in known positions to navigate. Make all the landmarks look different and move them to different places, and you're lost.

To stop the user getting lost like that, the tool must also store information about the visual layout, in addition to the abstract data. It doesn't really project the editable representations, creating them from the abstract data on the fly, it displays them as specified in the stored layout. Note that this doesn't preclude multiple editable representations: each abstract concept can have 0..N representations. Nor does it preclude tool support for layout: autolayout simply places things or moves them around, and those changes are saved in the same way as if the user had made them. And of course since no autolayout algorithm ever produces quite the right results, you can finish it off by hand without fear of the tool overwriting your changes.

My experience and Martin's don't match at all on the following point:

many repository based environments suffer greatly because they don't have a decent configuration control system, which makes it much harder for multiple people to collaborate on the same system definition. This is a big contrast to source based environments that have a plethora of source code control systems to do this task.

Most of the uses of the word "repository" that I can think of imply two things: reusable assets and multiple users. In contrast to what Martin says, multiple people can collaborate on building a system much more easily with a repository than with source code version control systems. Rather than the overly coarse granularity of locking whole source files, only the meaningful units that you are working on are locked. Thus two people can work on areas that equate to (generate to) different parts of a single source file, without their work interfering with each other. This is particularly important in the common case where each source code file is built from information from several models (e.g. different aspects).

Rather than assuming optimistic locking and trying to merge two (or more) sets of changes, the changes can be made in parallel without conflict and both developers see the full result after saving. If you try to change an area of the system that another user has changed in parallel, you are denied the lock and can see which user has it. Of course no system can prevent semantic conflicts, but for the syntactic and logical ones a multi-user repository is better than today's source control systems. Editing as part of a team is much more fluid and transparent: you can see the changes when the other user has saved them (no "check out" needed), and with the fine granularity less than 3 operations in 1000 are in conflict with another user, so you're not being refused locks all the time (as was the case in file-based multi-user modeling tools).

Since Martin goes on to claim that "almost all MDD tools are repository based", I think he's just missing the important multi-user aspect of repositories. Almost all of the tools are in fact file based -- i.e. save a single diagram to a single file -- along with all the problems of reference-by-name which that brings for inter-diagram references. Of course being file based means they can try to use source code version control systems to enable multiple users to work in parallel on the same file, but as Martin says that really doesn't work. Models are by their nature strongly interlinked, with direct references between elements -- at least within a ingle model -- rather than reference-by-name. Serializing them to XML, which only natively supports trees not graphs, is bad enough, but trying to do an automated merge on two such files is never going to be reliable.

Any merge or integration of two users' work should not be taking place on the level of the serialized storage representation, as in the file-based tools, but on the level of the abstract representation, as in a repository. In moving from source code files to models, not all steps can be taken separately and incrementally: some need to be taken together with each other. Since we change the main representation from being the storage representation to being the in-memory abstract representation, we also change the approach to integrating multiple users' work from the storage representation to the abstract representation. Since the integration now happens in real time rather than after the fact, we can now better protect against conflicts, and better provide information to the modelers by updating the editable representation on the fly.

Obviously, what I'm describing here is different from many MDD tools -- but it's nothing new or amazing. Back in the 1990s everybody involved in modeling tools knew this was the way things had to be, and so of course MetaEdit+ was designed to bring these benefits. After a dozen years of industrial use of the multi-user version, I'm happy we made the right decision. Of course a multi-user approach doesn't suit every project, and some projects use a separate single user repository for each MetaEdit+ user. As DSM reduces the work to create a product by a factor of 5-10, applications that previously would have required several developers can now be built by a single developer, so there sometimes isn't even a need for linking to another developer's work. The only links necessary are to code that is the same for all such applications, and those are made automatically by the generator or build process. As always, the best kind of conflict resolution is to avoid conflict in the first place!

Comments

collaboration in modeling environments

[Andriy Levytskyy] September 26, 2008 15:40:00 +0300 (EEST)

Hi Steven,

I can relate very well on the subject of the configuration control system for modeling environments. I've been envolved in an MDE introduction in a large company. This introduction was done in the bottom-up manner. Naturally versioning and collaboration aspects in modeling was approached like in familiar source based environments: that is outside of the (constrained by metamodels) modeling environments, in serialized storage representation (XML/XMI) and with help of a third party source code control system (that understands XML/XMI, but not the DSL of models). It is not too hard to guess that merge and integration of models by the code control system did not work well, mostly due to these problems:

In case of DSLs (even if they are called UML profiles), a third party merging tool would not know how to correctly merge non-standard domain-specific models. The output was usually an unusable model (in a valid XML/XMI representation).
Writing a custom merger was too much work due to many DSLs and the deep forest of incidental details introduced by the storage representation.

A modeling environment with "real time rather than after the fact" DSL-driven integration (and hence with no need for merge) would allow to avoid the above problems. However, it will requre a different collaborative way of working (no more "clone-own-merge"), which can be quite a challenge in itself.

:: Steven Kelly
:: MetaCase
:: DSM Forum

Steven Kelly on DSM

Domain-Specific Modeling: A Toolmaker Perspective