lxml.html.diff

module documentation

Undocumented

Class	`DEL_END`	Undocumented
Class	`DEL_START`	Undocumented
Class	`href_token`	Represents the href in an anchor tag. Unlike other words, we only show the href when it changes.
Class	`InsensitiveSequenceMatcher`	Acts like SequenceMatcher, but tries not to find very small equal blocks amidst large spans of changes
Class	`NoDeletes`	Raised when the document no longer contains any pending deletes (DEL_START/DEL_END)
Class	`tag_token`	Represents a token that is actually a tag. Currently this is just the <img> tag, which takes up visible space just like a word but is only represented in a document by a tag.
Class	`token`	No summary
Function	`cleanup_delete`	Cleans up any DEL_START/DEL_END markers in the document, replacing them with <del></del>. To do this while keeping the document valid, it may need to drop some tags (either start or end tags).
Function	`cleanup_html`	This 'cleans' the HTML, meaning that any page structure is removed (only the contents of <body> are used, if there is any <body). Also <ins> and <del> tags are removed.
Function	`compress_merge_back`	Merge tok into the last element of tokens (modifying the list of tokens in-place).
Function	`compress_tokens`	Combine adjacent tokens when there is no HTML between the tokens, and they share an annotation
Function	`copy_annotations`	Copy annotations from the tokens listed in src to the tokens in dest
Function	`default_markup`	Undocumented
Function	`end_tag`	The text representation of an end tag for a tag. Includes trailing whitespace when appropriate.
Function	`expand_tokens`	Given a list of tokens, return a generator of the chunks of text for the data in the tokens.
Function	`fixup_chunks`	This function takes a list of chunks and produces a list of tokens.
Function	`fixup_ins_del_tags`	Given an html string, move any <ins> or <del> tags inside of any block-level elements, e.g. transform <ins><p>word</p></ins> to <p><ins>word</ins></p>
Function	`flatten_el`	Takes an lxml element el, and generates all the text chunks for that tag. Each start tag is a chunk, each word is a chunk, and each end tag is a chunk.
Function	`html_annotate`	doclist should be ordered from oldest to newest, like:
Function	`html_annotate_merge_annotations`	Merge the annotations from tokens_old into tokens_new, when the tokens in the new document already existed in the old document.
Function	`htmldiff`	Do a diff of the old and new document. The documents are HTML fragments (str/UTF8 or unicode), they are not complete documents (i.e., no <html> tag).
Function	`htmldiff_tokens`	Does a diff on the tokens themselves, returning a list of text chunks (not tokens).
Function	`is_end_tag`	Undocumented
Function	`is_start_tag`	Undocumented
Function	`is_word`	Undocumented
Function	`locate_unbalanced_end`	like locate_unbalanced_start, except handling end tags and possibly moving the point earlier in the document.
Function	`locate_unbalanced_start`	No summary
Function	`markup_serialize_tokens`	Serialize the list of tokens into a list of text chunks, calling markup_func around text to add annotations.
Function	`merge_delete`	Adds the text chunks in del_chunks to the document doc (another list of text chunks) with marker to show it is a delete. cleanup_delete later resolves these markers into <del> tags.
Function	`merge_insert`	doc is the already-handled document (as a list of text chunks); here we add <ins>ins_chunks</ins> to the end of that.
Function	`parse_html`	Parses an HTML fragment, returning an lxml element. Note that the HTML will be wrapped in a <div> tag that was not in the original document.
Function	`serialize_html_fragment`	Serialize a single lxml element as HTML. The serialized form includes the elements tail.
Function	`split_delete`	No summary
Function	`split_trailing_whitespace`	This function takes a word, such as 'test
Function	`split_unbalanced`	Return (unbalanced_start, balanced, unbalanced_end), where each is a list of text and tag chunks.
Function	`split_words`	Splits some text into words. Includes trailing whitespace on each word when appropriate.
Function	`start_tag`	The text representation of the start tag for a tag.
Function	`tokenize`	Parse the given HTML and returns token objects (words with attached tags).
Function	`tokenize_annotated`	Tokenize a document and add an annotation attribute to each token
Variable	`block_level_container_tags`	Undocumented
Variable	`block_level_tags`	Undocumented
Variable	`empty_tags`	Undocumented
Variable	`end_whitespace_re`	Undocumented
Variable	`split_words_re`	Undocumented
Variable	`start_whitespace_re`	Undocumented
Function	`_contains_block_level_tag`	True if the element contains any block-level elements, like <p>, <td>, etc.
Function	`_fixup_ins_del_tags`	fixup_ins_del_tags that works on an lxml document in-place
Function	`_merge_element_contents`	Removes an element, but merges its contents into its place, e.g., given <p>Hi <i>there!</i></p>, if you remove the <i> element you get <p>Hi there!</p>
Function	`_move_el_inside_block`	helper for _fixup_ins_del_tags; actually takes the <ins> etc tags and moves them inside any block-level tags.
Variable	`_body_re`	Undocumented
Variable	`_end_body_re`	Undocumented
Variable	`_ins_del_re`	Undocumented

def cleanup_delete(chunks):

Cleans up any DEL_START/DEL_END markers in the document, replacing them with <del></del>. To do this while keeping the document valid, it may need to drop some tags (either start or end tags).

It may also move the del into adjacent tags to try to move it to a similar location where it was originally located (e.g., moving a delete into preceding <div> tag, if the del looks like (DEL_START, 'Text</div>', DEL_END)

def cleanup_html(html):

This 'cleans' the HTML, meaning that any page structure is removed (only the contents of <body> are used, if there is any <body). Also <ins> and <del> tags are removed.

def compress_merge_back(tokens, tok):

Merge tok into the last element of tokens (modifying the list of tokens in-place).

def compress_tokens(tokens):

Combine adjacent tokens when there is no HTML between the tokens, and they share an annotation

def copy_annotations(src, dest):

Copy annotations from the tokens listed in src to the tokens in dest

def default_markup(text, version):

Undocumented

def end_tag(el):

The text representation of an end tag for a tag. Includes trailing whitespace when appropriate.

def expand_tokens(tokens, equal=False):

Given a list of tokens, return a generator of the chunks of text for the data in the tokens.

def fixup_chunks(chunks):

This function takes a list of chunks and produces a list of tokens.

def fixup_ins_del_tags(html):

Given an html string, move any <ins> or <del> tags inside of any block-level elements, e.g. transform <ins><p>word</p></ins> to <p><ins>word</ins></p>

def flatten_el(el, include_hrefs, skip_tag=False):

Takes an lxml element el, and generates all the text chunks for that tag. Each start tag is a chunk, each word is a chunk, and each end tag is a chunk.

If skip_tag is true, then the outermost container tag is not returned (just its contents).

def html_annotate(doclist, markup=default_markup):

doclist should be ordered from oldest to newest, like:

>>> version1 = 'Hello World'
>>> version2 = 'Goodbye World'
>>> print(html_annotate([(version1, 'version 1'),
...                      (version2, 'version 2')]))
<span title="version 2">Goodbye</span> <span title="version 1">World</span>

The documents must be fragments (str/UTF8 or unicode), not complete documents

The markup argument is a function to markup the spans of words. This function is called like markup('Hello', 'version 2'), and returns HTML. The first argument is text and never includes any markup. The default uses a span with a title:

>>> print(default_markup('Some Text', 'by Joe'))
<span title="by Joe">Some Text</span>

def html_annotate_merge_annotations(tokens_old, tokens_new):

Merge the annotations from tokens_old into tokens_new, when the tokens in the new document already existed in the old document.

def htmldiff(old_html, new_html):

Do a diff of the old and new document. The documents are HTML fragments (str/UTF8 or unicode), they are not complete documents (i.e., no <html> tag).

Returns HTML with <ins> and <del> tags added around the appropriate text.

Markup is generally ignored, with the markup from new_html preserved, and possibly some markup from old_html (though it is considered acceptable to lose some of the old markup). Only the words in the HTML are diffed. The exception is <img> tags, which are treated like words, and the href attribute of <a> tags, which are noted inside the tag itself when there are changes.

def htmldiff_tokens(html1_tokens, html2_tokens):

Does a diff on the tokens themselves, returning a list of text chunks (not tokens).

def is_end_tag(tok):

Undocumented

def is_start_tag(tok):

Undocumented

def is_word(tok):

Undocumented

def locate_unbalanced_end(unbalanced_end, pre_delete, post_delete):

like locate_unbalanced_start, except handling end tags and possibly moving the point earlier in the document.

def locate_unbalanced_start(unbalanced_start, pre_delete, post_delete):

pre_delete and post_delete implicitly point to a place in the document (where the two were split). This moves that point (by popping items from one and pushing them onto the other). It moves the point to try to find a place where unbalanced_start applies.

As an example:

>>> unbalanced_start = ['<div>']
>>> doc = ['<p>', 'Text', '</p>', '<div>', 'More Text', '</div>']
>>> pre, post = doc[:3], doc[3:]
>>> pre, post
(['<p>', 'Text', '</p>'], ['<div>', 'More Text', '</div>'])
>>> locate_unbalanced_start(unbalanced_start, pre, post)
>>> pre, post
(['<p>', 'Text', '</p>', '<div>'], ['More Text', '</div>'])

As you can see, we moved the point so that the dangling <div> that we found will be effectively replaced by the div in the original document. If this doesn't work out, we just throw away unbalanced_start without doing anything.

def markup_serialize_tokens(tokens, markup_func):

Serialize the list of tokens into a list of text chunks, calling markup_func around text to add annotations.

def merge_delete(del_chunks, doc):

Adds the text chunks in del_chunks to the document doc (another list of text chunks) with marker to show it is a delete. cleanup_delete later resolves these markers into <del> tags.

def merge_insert(ins_chunks, doc):

doc is the already-handled document (as a list of text chunks); here we add <ins>ins_chunks</ins> to the end of that.

def parse_html(html, cleanup=True):

Parses an HTML fragment, returning an lxml element. Note that the HTML will be wrapped in a <div> tag that was not in the original document.

If cleanup is true, make sure there's no <head> or <body>, and get rid of any <ins> and <del> tags.

def serialize_html_fragment(el, skip_outer=False):

Serialize a single lxml element as HTML. The serialized form includes the elements tail.

If skip_outer is true, then don't serialize the outermost tag

def split_delete(chunks):

Returns (stuff_before_DEL_START, stuff_inside_DEL_START_END, stuff_after_DEL_END). Returns the first case found (there may be more DEL_STARTs in stuff_after_DEL_END). Raises NoDeletes if there's no DEL_START found.

def split_trailing_whitespace(word):

This function takes a word, such as 'test

' and returns ('test','

def split_unbalanced(chunks):

Return (unbalanced_start, balanced, unbalanced_end), where each is a list of text and tag chunks.

unbalanced_start is a list of all the tags that are opened, but not closed in this span. Similarly, unbalanced_end is a list of tags that are closed but were not opened. Extracting these might mean some reordering of the chunks.

def split_words(text):

Splits some text into words. Includes trailing whitespace on each word when appropriate.

def start_tag(el):

The text representation of the start tag for a tag.

def tokenize(html, include_hrefs=True):

Parse the given HTML and returns token objects (words with attached tags).

This parses only the content of a page; anything in the head is ignored, and the <head> and <body> elements are themselves optional. The content is then parsed by lxml, which ensures the validity of the resulting parsed document (though lxml may make incorrect guesses when the markup is particular bad).

<ins> and <del> tags are also eliminated from the document, as that gets confusing.

If include_hrefs is true, then the href attribute of <a> tags is included as a special kind of diffable token.

def tokenize_annotated(doc, annotation):

Tokenize a document and add an annotation attribute to each token

block_level_container_tags: tuple[str, ...] =

Undocumented

block_level_tags: tuple[str, ...] =

Undocumented

empty_tags: tuple[str, ...] =

Undocumented

end_whitespace_re =

Undocumented

split_words_re =

Undocumented

start_whitespace_re =

Undocumented

def _contains_block_level_tag(el):

True if the element contains any block-level elements, like <p>, <td>, etc.

def _fixup_ins_del_tags(doc):

fixup_ins_del_tags that works on an lxml document in-place

def _merge_element_contents(el):

Removes an element, but merges its contents into its place, e.g., given <p>Hi <i>there!</i></p>, if you remove the <i> element you get <p>Hi there!</p>

def _move_el_inside_block(el, tag):

helper for _fixup_ins_del_tags; actually takes the <ins> etc tags and moves them inside any block-level tags.

_body_re =

Undocumented

_end_body_re =

Undocumented

_ins_del_re =

Undocumented