Jay Taylor's notes
back to listing indexBuilding an HTML Diff/Patch Algorithm
[web search]
A description of what I'm going to accomplish:
Expanding the last point: Imagine two pages of the same site that both share a sidebar with what was probably a common ancestor that has been copy/pasted. Each page has some minor changes to the sidebar. The diff will reveal these changes, then I can "walk up" the DOM to find the first common block element shared by them, or just default to I'm familiar with DaisyDiff and the application is similar -- in the CMS world. I've also begun playing with the google diff-patch library. I wanted to give ask this kind of non-specific question to hopefully solicit any advise or guidance that anybody thinks could be helpful. Currently if you put a gun to my head and said "CODE IT" I'd rewrite DaisyDiff in Python and add-in this block-level logic. But I thought maybe there's a better way and the answers to Anyone have a diff algorithm for rendered HTML? made me feel warm and fuzzy. |
|||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||
If you were going to start from scratch, a useful search term would be "tree diff". There's a pretty awesome blog post here, although I just found it by googling "daisydiff python" so I bet you've already seen it. Besides all the interesting theoretical stuff, he mentions the existence of Logilab's There's also html-tree-diff on pypi, which I found via this Quora link: http://www.quora.com/Is-there-any-good-Python-implementation-of-a-tree-diff-algorithm There's some theoretical stuff about tree diffing at efficient diff algorithm for trees and Levenshtein distance on cstheory.stackexchange. BTW, just to clarify, you are talking about diffing two DOM trees, but not necessarily rendering the diff/merge back into any particular HTML, right? (EDIT: Right.) A lot of the similarly-worded questions on here are really asking "how can I color deleted lines red and added lines green" or "how can I make matching paragraphs line up visually", skipping right over the theoretical hard part of "how do I diff two DOM trees in the first place" and the practical hard part of "how do I parse possibly malformed HTML into a DOM tree even before that". :) |
|||||||||
|
You could start by using beautifulsoup to parse both documents. Then you have a choice:
The latter allows you to e.g. discard elements that only affect the presentation, not the content. The former is probably easier. |
||||
I know this questions is related to python but you could take a look 3DM - XML 3-way Merging and Differencing Tool (default implementation in java) but here is the actual paper describing the algorithm used http://www.cs.hut.fi/~ctl/3dm/thesis.pdf, and here is the link to the site. Drawback to this is that you do have to cleanup the document and be able to pars it as XML. |
|||
Your Answer
Not the answer you're looking for? Browse other questions tagged python html algorithm html-parsing diff or ask your own question.
asked |
4 years ago |
viewed |
2145 times |
active |
Linked
Related
Hot Network Questions
- Have we attempted to experimentally confirm gravitational time dilation?
- Find the secret word
- Timing attack and good coding practices
- What is the difference between "theme" and "topic"?
- Too many reports because report button is too convenient
- Python and GDAL: NoneType error while reading shapefile
- Image not published in data source item (Sitecore 8.1)
- Can morse code be called steganography?
- Reading properties from file during standlone application startup
- Buying United miles from third-party sites
- How do some microcontrollers implement baud rates even though it uses crystal frequency not scalable to standard baud rates?
- Fired because your skills are too far above your coworkers
- Z-axis steppers and bed alignment problems
- What's the fastest way to generate a 1 GB file containing only random numbers?
- Is hunting animals, crafting and attacking/defending from enemies intellectually equivalent to quantum physics?
- securely storing and using keys in an embedded system
- Do 40% of U.S. Americans think that global warming is unproblematic since Christ will return soon?
- What sort of people are the Hogwarts school governors?
- Why countries selling passports is bad?
- Generate /* line number comments */
- How to neatly align long equations one below each other
- Shortest code to throw SIGILL
- Can guns be rendered unusable by changing the atmosphere?
- Guessing the JDK home directory
Technology | Life / Arts | Culture / Recreation | Science | Other | ||||||
---|---|---|---|---|---|---|---|---|---|---|