A diff/patch solution for Flash: a proof of concept Monte Carlo algorithm for comparing text versions in as3

EDIT 2:

Patch.as was fixed again, another issue with RangeErrors from ByteArray.writeUTF (). Well, forget certainty levels, but this time all should really be working just fine with strings of any length.

EDIT 1:

Patch.as underwent some serious debugging and unit testing these days (thanks to Rich, see the comments below), so I’ve 95% certainty that the sources below (updated) are free from bugs. Anyway, the bugs we fixed were silly copy/paste havoc, so if you DO find something wrong about the Patch class, I doubt it’ll be difficult to fix.

Also check out the .reversed flag on the Patch object, it’s helpful when you need to ‘unpatch’ a string.

/EDIT.

diff and patch

diff is a file comparison tool in Unix-like environments, born in the 1970s, which takes two plain text files and outputs the spots where they differ. Its counterpart, patch, is a tool that applies diff output to files in order to keep them updated; this technique was used to synchronise content, often source code, on multiple workstations (today Subversion uses the same trick), and later became the paradigm for revision control systems in collaborative content applications.

Generally speaking, diff works by solving the longest common subsequence problem; when dealing with texts, that’s finding the longest common substring. The operation could be repeated on the left and right of this biggest common chunk, and so on, until there’s nothing else to match. The optimal solution is the minimal set of edits (inserts and deletes) that would suffice to construct the new text from the old, or the most compact possible correct diff output.

google-diff-match-patch

I came across google-diff-match-patch when reading about the O(ND) diff. The author of the library, Neil Fraser, has done an amazing job in bringing what’s probably the most sophisticated diff algorithm to the world of web development, by creating a common API for javascript, java and python (hello App Engine!), and also c++. The library is already ported to as3; you need to ask Neil for the code though, it’s still not in the repository.

Neil’s google-diff-match-patch provides you with a diff tool and patch tool for strings. The latter is extremely sweet and clever, because it works on a best effort basis: in case the patch tool is ran on a string which is not identical to the ‘original’ text used by diff to make the patch, it looks up the areas to be patched by approximate string matching of the patch context, and is thus pretty much capable of patching texts which have been even heavily modified in the meantime by another user or action.

There’s something that worries me about the O(ND) diff when applied on a character-by-character basis on long texts however, especially in the context of js virtual machines or the avm. I do not fully understand how the solver works, but I am afraid that it may be a little too combersome a computation in settings which are not completely theoretical. In other words, if you run the Google diff on two big enough texts with lots of common and differing chunks alternated throughout, the algorithm will choke, even for minutes.

Why so serious?

In my view, looking for an optimal diff solution is overkill. In fact, if one allows for some degree of suboptimal results, the computational effort needed to pull the trick can be radically decreased.

Well, I though that a problem such as that of finding common substrings in two strings would be a great way to get some experience with Monte Carlo problem-solving, and decided to put together a quick and dirty proof of concept implementation of a plain text diff based on random sampling. Starting from Neil’s articles and the google-diff-match-patch sources, I aimed to build a diff which can relatively quickly find a near-to optimal solution, so to prevent the app from freezing, while avoiding the need for marsalling.

The following benchmark contrasts the performance and result quality (with no post-processing nor semantic clean-up) of my naive Monte Carlo diff and the Google O(ND) diff. The two texts are the first and last revision of my ‘Hello world!’ post:

There are a few possible explainations for why the google diff slows down this much. First of all, the O(ND) diff’s name suggests its time-complexity (N = A+B, the sum of the lengths of the texts being compared, and D being the shortest edit path). Another is in the nature of the port I have at my disposal, which shoots out a ton of warnings, and is nowhere near optimised for the avm [EDIT: Indeed, running the same texts in the diff demo of google-diff-match-patch (in Chrome) yields timings that are about three times lower]. Most importantly, the algorithm takes so much more time because of the much higher quality of its solution: still, most of this effort will be for nothing after an efficiency and semantic clean-up, yet there is no real way to request less detail in the first place.

My point is that you don’t need an optimal solution to communicate text changes; you need a reasonably compact one, but asap. A quick patch can always be reduced server-side if needed, by recursively running diff on all insertion/deletion pairs of edits.

What’s behind this ‘Monte Carlo’ diff

Small substrings from the shorter text are taken, and if they exist in the longer text, a binary search begins on their left and right expanding them in the two directions. When an entire common chunk is found, its location and length are recorded and excluded from the search, hence reducing the problem size. The first two substrings are taken from the begining and the end of shorter string, so to quickly find and rule-out the common prefix or suffix (if any), after which the search goes on at random locations.

Arbitrary decision rules determine when the search will stop, and then a simple solver kicks in to resolve conflicts among the discovered common substring. Such conflicts usually involve chunk overlapping, but can also consist in N-to-1 relationships in between chunks in two texts, usually when those two have repetitive common parts (copy-pasting is good reason for such a thing to happen). Because some conflicts will require that common chunks get discarded, before running the solver, the commonalities are sorted (and thus prioritised) by length: hence naively solving the longest common subsequence problem. Conflicts are cheap to resolve however, and this step adds very little overhead to the whole algorithm.

Finally, as an optional post-processing step, semantic adjustment takes place, aligning all chunks to whitespace and punctuation, hence yielding human readable output.

What you can do with this

The demo below gives you a basic interface to play around with the API. The scenario mimics the google-diff-match-patch Shakespeare demo. The first two fields contain the old and new versions of the text to be diffed. The following row displays the computed patch in the google-diff-match-patch ‘emulated’ unified diff format, and a human-readable version of the solution. Finally, the final row provides an input field for a text to be patched, and a result field for the patch operation.

The approximate string matching functionality of the patch tool is built directly over Neil’s implementation of the Bitap algorithm; I have ported Neil’s code to as3, doing my best to optimise it for the avm.

Worth noting is that the API is fully compatible with the google-diff-match-patch library through this pseudo-unified serialisation format; hence you can be running the Google diff on the server-side, and this quick diff in the client with no complications.

Grounds for improvement

In terms of performance, there is a lot that could be done with the algorithm’s search logic; currently it abuses the substring and indexOf methods, and they are indeed slow. An inconclusive charAt () binary search with a final substring comparison will probably render the algorithm much faster in most situations. Another thing to look into is to direct the random sampling process towards areas that are more likely to contain commonalities and preventing it from sampling the same substring more than once.

Source

Patch.as, Bitap.as

These are the current sources. The code is still in the works though, so let me know if you spot any trouble.

Bookmark and Share

19 Responses to “A diff/patch solution for Flash: a proof of concept Monte Carlo algorithm for comparing text versions in as3”


  • This is beautiful. I plan on using your diff/patch algo to implement an undo mechanism in my XML-based tool. Again, great work!

  • Thanks!
    Let me know how it works, I’ve been wondering whether it’ll be good for structured text

  • Hi hristo!

    Rich again. I’m just now getting around to experimenting with your Patch class. I’m confused, though. When I run this:

    var p:Patch = new Patch();
    p.diff(‘hi there. How you doing?’, ‘I said, hi there! How are you doing?’);
    trace(p.string);

    the only diff I get is for ‘I said, ‘. In other words, this is all that trace(p.string) outputs:

    @@ -1,17 +1,29 @@
    +I said,
    hi the

    There are two more differences – the ‘.’ becoming a ‘!’ and the insertion of ‘ are’. Am I doing something wrong?

    Thanks! :)
    -Rich

  • Update: Swapping the parameters seems to help:

    p.diff(‘I said, hi there! How are you doing?’, ‘hi there. How you doing?’);

    yields:

    @@ -1,29 +1,17 @@
    -I said,
    hi there
    -!
    +.
    How
    - are
    you
    @@ -33,4 +21,4 @@
    ing?

    BUT, shouldn’t the diff work both ways? (e.g. in one case, we’re inserting the string ‘ are’, while in the other case, we’re removing the string ‘ are’.)

    • We’re working with Rich to clear the classes from bugs and to make them more useful for memo management. Rich is doing the unit testing, something I’ve always underestimated and stupidly avoided.

      It all looks very good right now, but I’ll postpone updates for a few more days so that we are certain that everything is bug-free.

  • any word on the new release? This is perfect for my app, and I’m anxious to try it out :]

    what license will you release it as?

    • Hey Jack, it’s all fixed and all, I was just pondering on some features.
      I’ll update the files here in a minute and I’ll also mail you the right sources.

      Neil’s code, which is the Bitap class, is released under Apache 2. The other stuff just comes without warranty :)

  • Hey hristo!

    Great work on the class. Everything’s running quite smoothly nowadays. I’ll be in touch as I do more experimentation, especially with using third party diff/patch suites with the diffs produced by Patch.as.

    Have a wonderful day! :)
    -Rich

  • I don’t understand counting of index in “@@”. It would be nice to see your “human-readable version of the solution” code.

    Thanks,
    Liudas

  • There are small bug( maybe not bug) in your code – when two texts are equal, diff returns both text joined. Texts check before going to diff, solves this.

  • I can not download the source. Could you please send me a copy?
    Thanks!

  • Hi Hristo
    Links to your source files actually lead to 404′s
    Could you please drop a valid links ?

  • Thanks for adding source code !
    I have a question – i have problems parsing Diff command output and would like to ask you how do you generate human readable representation of diff command ?

    Parsing diff result chunk by chunk doesn’t seem such a good idea for me.

  • Hey Alex,

    The demos above use the .show() method of the Patch class, which I think is commented out in the sources above, but it works once you’ve parsed the diff data into the internal representation of the Patch object, and uses this state to render the output in a text field.

    .show is really a debug method, but you can see how you could build something of your own -essentially it iterates over the chuncks and renders inserts and deletes from the diff data and renders the “kept” chuncks from the original string. It’s quite simple really.

    What is your worry?

  • It’s me again, now my question could seem interesting to you.
    As you can see by my posts i did some studying of your fast-diff port :)

    I am trying to create an as3 based program for comparing texts. For that i need to get positions of letters in text which should be highlighted (removed letters from one document and added letters in another document)

    The problem is that it’s not possible to get which symbols were added or removed with used Diff output format. And i think that the format used is not correct and doesn’t correspond to unified diff format.

    For example, consider comparing 2 texts (i’ll write newline character as “\n”)

    text 1:
    test1 test2 test3

    text 2:
    test1 test2 test3xxx
\n xxx

    Raw output of Diff command would be:
    @@ -14,4 +14,12 @@\n est3\n+xxx\n xxx

    which is:
    @@ -14,4 +14,12 @@
    est3
    +xxx
    xxx

    In fact we know that both ‘xxx’ were added in second text. But according to the format used we should select only the first one ‘xxx’ because it’s preceded by a “+” sign. There is no way we can tell that second ‘xxx’ should be marked as ‘added’. So it seems like there is a bug in a format or in the implementation of algorithm.

    Hope to read your reply
    Thanks ! :)

    • The correct pseudo-unified-diff output would be:

      @@ -14,4 +14,14 @@
      est3
      +xxx
\n xxx

      Which happens above just as you would expect it to
      but i’m seeing another issue above that should be related to the parser

      i need to put this on github and make a js port of it as well, that would make it SO MUCH EASIER to manage

  • Thanks again Hristo,
    Indeed i commented out string encoding and that led to the problem i mentioned earlier (for some reason i thought that encoding is only used for safely transporting diff results via GET/POST requests and i can turn it off safely)
    Well everything went ok after i uncommented it out.

    Regarding a Js port of this algorithm – isn’t it already ported ?
    http://neil.fraser.name/software/diff_match_patch/svn/trunk/demos/demo_diff.html

Leave a Reply