Jay Taylor's notes

back to listing index

Building an HTML Diff/Patch Algorithm

[web search]

Original source (stackoverflow.com)

Tags: html patch diff stackoverflow.com

Clipped on: 2016-11-21

Stack Exchange Inbox Reputation and Badges +40

5,851 43157

30 review help

Stack Overflow

Ask Question

Building an HTML Diff/Patch Algorithm

up vote 17 down vote favorite

A description of what I'm going to accomplish:

Input 2 (N is not essential) HTML documents.
Standardize the HTML format
Diff the two documents -- external styles are not important but anything inline to the document will be included.
Determine delta at the HTML Block Element level.

Expanding the last point:

Imagine two pages of the same site that both share a sidebar with what was probably a common ancestor that has been copy/pasted. Each page has some minor changes to the sidebar. The diff will reveal these changes, then I can "walk up" the DOM to find the first common block element shared by them, or just default to <body>. In this case, I'd like to walk it up and find that, oh, they share a common <div id="sidebar">.

I'm familiar with DaisyDiff and the application is similar -- in the CMS world.

I've also begun playing with the google diff-patch library.

I wanted to give ask this kind of non-specific question to hopefully solicit any advise or guidance that anybody thinks could be helpful. Currently if you put a gun to my head and said "CODE IT" I'd rewrite DaisyDiff in Python and add-in this block-level logic. But I thought maybe there's a better way and the answers to Anyone have a diff algorithm for rendered HTML? made me feel warm and fuzzy.

share|edit|close|flag

edited Sep 29 '12 at 17:50

asked Sep 29 '12 at 3:44

Shane H

1,49821425

1	upvote
	flag

Related: stackoverflow.com/questions/1576459/… . – Juho Vepsäläinen Sep 29 '12 at 17:56

	upvote
	flag

I'm not sure of your exact application but a DOM ranking algorithm is used by projects like readability.com to extract relevant content. If you want to diff only on the core of the page, something like that might make sense – Pratik Mandrekar Oct 6 '12 at 17:46

	upvote
	flag

Would love to hear an update about this project; if you managed to find what you were looking for and if you plan on open sourcing any of it :) – onassar Jan 17 '13 at 4:41

	upvote
	flag

Profiled a lot of the libraries here: web.onassar.com/blog/2012/11/21/htmldiff-software-discoverie‌s – onassar Feb 23 '13 at 6:09

add a comment

start a bounty

3 Answers

active oldest votes

up vote 9 down vote accepted

+25

If you were going to start from scratch, a useful search term would be "tree diff".

There's a pretty awesome blog post here, although I just found it by googling "daisydiff python" so I bet you've already seen it. Besides all the interesting theoretical stuff, he mentions the existence of Logilab's xmldiff, an open-source XML differ written in Python. That might be a decent starting point — maybe less correct than trying to wrap or reimplement DaisyDiff, but probably easier to get up and running quickly.

There's also html-tree-diff on pypi, which I found via this Quora link: http://www.quora.com/Is-there-any-good-Python-implementation-of-a-tree-diff-algorithm

There's some theoretical stuff about tree diffing at efficient diff algorithm for trees and Levenshtein distance on cstheory.stackexchange.

BTW, just to clarify, you are talking about diffing two DOM trees, but not necessarily rendering the diff/merge back into any particular HTML, right? (EDIT: Right.) A lot of the similarly-worded questions on here are really asking "how can I color deleted lines red and added lines green" or "how can I make matching paragraphs line up visually", skipping right over the theoretical hard part of "how do I diff two DOM trees in the first place" and the practical hard part of "how do I parse possibly malformed HTML into a DOM tree even before that". :)

share|edit|flag

edited Oct 5 '12 at 19:40

answered Oct 4 '12 at 17:24

Quuxplusone

7,85212667

	upvote
	flag

That's correct -- There's a ton of noise in this area about people who want to render a diff in HTML the way you described. I don't care about that, I won't be rendering the diff at all but instead use the output block-element deltas to drive more powerful visualizations of the differences between different pages and versions of the same page. Appreciate your input, this isn't like anything I've built before and I wanted to try to make sure I'm not over thinking it or missing anything obvious. – Shane H Oct 4 '12 at 18:42

add a comment

up vote 1 down vote

You could start by using beautifulsoup to parse both documents.

Then you have a choice:

use prettify to render both documents as more or less standardized HTML and diff those.
compare the parse trees.

The latter allows you to e.g. discard elements that only affect the presentation, not the content. The former is probably easier.

share|edit|flag

edited Oct 7 '12 at 14:08

answered Oct 7 '12 at 13:15

I know this questions is related to python but you could take a look 3DM - XML 3-way Merging and Differencing Tool (default implementation in java) but here is the actual paper describing the algorithm used http://www.cs.hut.fi/~ctl/3dm/thesis.pdf, and here is the link to the site.

Drawback to this is that you do have to cleanup the document and be able to pars it as XML.

share|edit|flag

answered Oct 5 '12 at 19:52

Greg

1,1291818

add a comment

Your Answer

community wiki

Not the answer you're looking for? Browse other questions tagged python html algorithm html-parsing diff or ask your own question.

asked	4 years ago
viewed	2145 times
active	24 days ago

Upcoming Events

2016 Community Moderator Election

ends tomorrow

Featured on Meta

WordPress.com button is going away from login/signup screens

Documentation Update, October 20th

Upcoming improvements to job search

2016 Stack Overflow Moderator Election Q&A - Questionnaire

Linked

Anyone have a diff algorithm for rendered HTML?

Generate pretty diff html in Python

How to diff two .xml files and store differences in XSLT?

Anyone have a diff algorithm for rendered HTML?

590

How do I view 'git diff' output with a visual diff program?

103

Diff Algorithm

Generate pretty diff html in Python

Git-diff to HTML

235

How to apply `git diff` patch without Git installed?

1520

How do you parse and process HTML/XML in PHP?

4726

Why does HTML think “chucknorris” is a color?

table diff patch algorithm for UITableView

1511

What is the optimal algorithm for the game 2048?

Hot Network Questions

question feed

about us tour help blog chat data legal privacy policy work here advertising info mobile contact us feedback

Technology				Life / Arts		Culture / Recreation		Science		Other
Stack Overflow Server Fault Super User Web Applications Ask Ubuntu Webmasters Game Development TeX - LaTeX	Software Engineering Unix & Linux Ask Different (Apple) WordPress Development Geographic Information Systems Electrical Engineering Android Enthusiasts Information Security	Database Administrators Drupal Answers SharePoint User Experience Mathematica Salesforce ExpressionEngine® Answers Cryptography	Code Review Magento Signal Processing Raspberry Pi Programming Puzzles & Code Golf more (7)	Photography Science Fiction & Fantasy Graphic Design Movies & TV Music: Practice & Theory Seasoned Advice (cooking) Home Improvement Personal Finance & Money	Academia more (8)	English Language & Usage Skeptics Mi Yodeya (Judaism) Travel Christianity English Language Learners Japanese Language Arqade (gaming)	Bicycles Role-playing Games Anime & Manga Motor Vehicle Maintenance & Repair more (17)	MathOverflow Mathematics Cross Validated (stats) Theoretical Computer Science Physics Chemistry Biology Computer Science	Philosophy more (3)	Meta Stack Exchange Stack Apps Area 51 Stack Overflow Talent

rev 2016.11.18.4219