Jay Taylor's notes

back to listing index

Building an HTML Diff/Patch Algorithm

[web search]
Original source (stackoverflow.com)
Tags: html patch diff stackoverflow.com
Clipped on: 2016-11-21

A description of what I'm going to accomplish:

  • Input 2 (N is not essential) HTML documents.
  • Standardize the HTML format
  • Diff the two documents -- external styles are not important but anything inline to the document will be included.
  • Determine delta at the HTML Block Element level.

Expanding the last point:

Imagine two pages of the same site that both share a sidebar with what was probably a common ancestor that has been copy/pasted. Each page has some minor changes to the sidebar. The diff will reveal these changes, then I can "walk up" the DOM to find the first common block element shared by them, or just default to <body>. In this case, I'd like to walk it up and find that, oh, they share a common <div id="sidebar">.

I'm familiar with DaisyDiff and the application is similar -- in the CMS world.

I've also begun playing with the google diff-patch library.

I wanted to give ask this kind of non-specific question to hopefully solicit any advise or guidance that anybody thinks could be helpful. Currently if you put a gun to my head and said "CODE IT" I'd rewrite DaisyDiff in Python and add-in this block-level logic. But I thought maybe there's a better way and the answers to Anyone have a diff algorithm for rendered HTML? made me feel warm and fuzzy.

asked Sep 29 '12 at 3:44
Image (Asset 2/5) alt=
Shane H
1,49821425
1 upvote
  flag
   upvote
  flag
I'm not sure of your exact application but a DOM ranking algorithm is used by projects like readability.com to extract relevant content. If you want to diff only on the core of the page, something like that might make sense – Pratik Mandrekar Oct 6 '12 at 17:46
   upvote
  flag
Would love to hear an update about this project; if you managed to find what you were looking for and if you plan on open sourcing any of it :) – onassar Jan 17 '13 at 4:41
   upvote
  flag
up vote 9 down vote accepted
+25

If you were going to start from scratch, a useful search term would be "tree diff".

There's a pretty awesome blog post here, although I just found it by googling "daisydiff python" so I bet you've already seen it. Besides all the interesting theoretical stuff, he mentions the existence of Logilab's xmldiff, an open-source XML differ written in Python. That might be a decent starting point — maybe less correct than trying to wrap or reimplement DaisyDiff, but probably easier to get up and running quickly.

There's also html-tree-diff on pypi, which I found via this Quora link: http://www.quora.com/Is-there-any-good-Python-implementation-of-a-tree-diff-algorithm

There's some theoretical stuff about tree diffing at efficient diff algorithm for trees and Levenshtein distance on cstheory.stackexchange.

BTW, just to clarify, you are talking about diffing two DOM trees, but not necessarily rendering the diff/merge back into any particular HTML, right? (EDIT: Right.) A lot of the similarly-worded questions on here are really asking "how can I color deleted lines red and added lines green" or "how can I make matching paragraphs line up visually", skipping right over the theoretical hard part of "how do I diff two DOM trees in the first place" and the practical hard part of "how do I parse possibly malformed HTML into a DOM tree even before that". :)

answered Oct 4 '12 at 17:24
Image (Asset 3/5) alt=
Quuxplusone
7,85212667
   upvote
  flag
That's correct -- There's a ton of noise in this area about people who want to render a diff in HTML the way you described. I don't care about that, I won't be rendering the diff at all but instead use the output block-element deltas to drive more powerful visualizations of the differences between different pages and versions of the same page. Appreciate your input, this isn't like anything I've built before and I wanted to try to make sure I'm not over thinking it or missing anything obvious. – Shane H Oct 4 '12 at 18:42

You could start by using beautifulsoup to parse both documents.

Then you have a choice:

  • use prettify to render both documents as more or less standardized HTML and diff those.
  • compare the parse trees.

The latter allows you to e.g. discard elements that only affect the presentation, not the content. The former is probably easier.

answered Oct 7 '12 at 13:15
Image (Asset 4/5) alt=
Roland Smith
17.4k11836

I know this questions is related to python but you could take a look 3DM - XML 3-way Merging and Differencing Tool (default implementation in java) but here is the actual paper describing the algorithm used http://www.cs.hut.fi/~ctl/3dm/thesis.pdf, and here is the link to the site.

Drawback to this is that you do have to cleanup the document and be able to pars it as XML.

answered Oct 5 '12 at 19:52
Image (Asset 5/5) alt=
Greg
1,1291818

Your Answer

asked

4 years ago

viewed

2145 times

active

24 days ago

Upcoming Events

Featured on Meta

Hot Network Questions

Technology Life / Arts Culture / Recreation Science Other
  1. Stack Overflow
  2. Server Fault
  3. Super User
  4. Web Applications
  5. Ask Ubuntu
  6. Webmasters
  7. Game Development
  8. TeX - LaTeX
  1. Software Engineering
  2. Unix & Linux
  3. Ask Different (Apple)
  4. WordPress Development
  5. Geographic Information Systems
  6. Electrical Engineering
  7. Android Enthusiasts
  8. Information Security
  1. Database Administrators
  2. Drupal Answers
  3. SharePoint
  4. User Experience
  5. Mathematica
  6. Salesforce
  7. ExpressionEngine® Answers
  8. Cryptography
  1. Code Review
  2. Magento
  3. Signal Processing
  4. Raspberry Pi
  5. Programming Puzzles & Code Golf
  6. more (7)
  1. Photography
  2. Science Fiction & Fantasy
  3. Graphic Design
  4. Movies & TV
  5. Music: Practice & Theory
  6. Seasoned Advice (cooking)
  7. Home Improvement
  8. Personal Finance & Money
  1. Academia
  2. more (8)
  1. English Language & Usage
  2. Skeptics
  3. Mi Yodeya (Judaism)
  4. Travel
  5. Christianity
  6. English Language Learners
  7. Japanese Language
  8. Arqade (gaming)
  1. Bicycles
  2. Role-playing Games
  3. Anime & Manga
  4. Motor Vehicle Maintenance & Repair
  5. more (17)
  1. MathOverflow
  2. Mathematics
  3. Cross Validated (stats)
  4. Theoretical Computer Science
  5. Physics
  6. Chemistry
  7. Biology
  8. Computer Science
  1. Philosophy
  2. more (3)
  1. Meta Stack Exchange
  2. Stack Apps
  3. Area 51
  4. Stack Overflow Talent
site design / logo © 2016 Stack Exchange Inc; user contributions licensed under cc by-sa 3.0 with attribution required
rev 2016.11.18.4219