In my attempt (and hope) to follow through with my talks idea, I've come to the point in development where I need to analyze differences in text.
Specifically, I need to perform a
diff, periodically, on many different blob's of text. I'm looking for what has been removed, what has been added, and what has been changed, ignoring markup changes.
For example, I'm interested in the title of a blog post changing, but not the attribute value of a link on the page.
In looking into all the different software out there, it seems that I'll be able to get 2 our of those 3 requests. Namely, the ability to figure out what's been removed, and what's been added (line by line).
The following is a breakdown of the resources I stumbled on during my research process.
Web-based detection service which emails you when there are changes. It's a good proof-of-concept of what I'm looking for.
In actuality, I'm looking for granular control of the diff process, and need it integrated directly into my application, so I'm not able to use this, but it's good if you're looking to be emailed the differences in a web page as they happen.
Another web-based service which attempts to find the diff between two provided html documents. There were a lot of false-positives, but I tested on more complicated examples (eg. http://www.amazon.com/Restful-Web-Services-Leonard-Richardson/dp/0596529260), so maybe it's good on smaller documents.
These seem to work pretty well, but unfortuantely, don't fit into what I'm looking for. I'm looking for the ability to detect changes within markup (x/html), but disregarding the unimportant stuff (unimportant, for me, being structural changes in the markup).
Although I'm using PHP, I would consider this as it seems designed to handle what I'm looking for.
A python library that I haven't been able to test, so I'm not sure if it's capabilities including detecting changes in markup. Seems to be a port of another library, which I couldn't find :(
A cool website that quickly colour-coordinates a diff found online (I believe this could be useful for highlighting git/svn changes). Not what I'm looking for at the moment, though.
The library I mentioned above. A ruby diff library, which seems to do a solid job detecting insertions and deletions within a text-source.
A Lisp HTML Diff tool, which doesn't seem to contain much information or examples, and was last worked on 3+ years ago. Not sure how effective it is.
The source library for the Visual Diff tool, seems to be a Java library which ought to work the same as the Visual Diff tool below, which is a PHP port of it.
A PHP library which was developed by a MediaWiki member, which seems to now be removed from usage by MediaWiki. Was published online at http://gitorious.org/htmldiff.
A Python Diff library which appears to be exceptionally-well documented, along with many more advanced features (eg. parsing email, testing doctype's, etc.)
I'm adding this one in now (December 3rd) as I stumbled on it, and it may be worth a try. While it doesn't have a website or documentation, it allows you to downlown the C files and comes with an installation guide.
I assume it'll be pretty speedy since it's a C library.
Originally, before I had discovered the myriad of resources available, I contemplated, perhaps foolishly (we'll never know), building my own engine. These are some resources which could be helpful in that kind of pursuit.
A jQuery library which allows you to compare node's in a document, and see if they are equal to others. Additionally, marks whether they are not in that document at all, if they're before, or if they're after, the node you're comparing it to.
This link is more or less for myself, as it contains an example of how to run a shell script (eg. node, c++, python) from within PHP (since that's the environment I'm working in anyhow).
A traversing library which allows you to use CSS3 selectors to traverse a document, instead of XPath. Seems pretty powerful. Kind of wish I'd discovered it during my MetaParser library development :)
A post that provides helpful information on how to search for textNode's in a document in PHP.
A PHP library which presents jQuery-inspired selectors to search through a document and access node's.
A quick walkthrough on how to get the text for a node in PHP.
Quick note on how to load an HTML document into PHP's DOMDocument library.
A page dedicated to covering some of the different HTML Diff software available online.
No conclusion just yet. It's possible that I'll use a non-PHP library and access it through
shell_exec, and while I think the Node/JS one is sweet, I'm not sure it would get me to MVP fastest.
Hopefully this helps someone out there looking for software, and for me, when I get around to writing the code :)
Resources I haven't yet looked into fully
Python Diff Script
Perl Diff Script
Python Diff Script
Python Diff Script
Docs regarding lxml diff library
Example lxml Script