module documentation

A cleanup tool for HTML.

Removes unwanted tags and content. See the Cleaner class for details.

Class ​Cleaner Instances cleans the document of each of the possible offending elements. The cleaning is controlled by attributes; you can override attributes in a subclass, or set them in the constructor.
Function autolink Turn any URLs into links.
Function autolink​_html Undocumented
Function word​_break Breaks any long words found in the body of the text (not attributes).
Function word​_break​_html Undocumented
Variable basestring Undocumented
Variable clean Undocumented
Function ​_break​_text Undocumented
Function ​_has​_javascript​_scheme Undocumented
Function ​_insert​_break Undocumented
Function ​_link​_text Undocumented
Variable ​_avoid​_classes Undocumented
Variable ​_avoid​_elements Undocumented
Variable ​_avoid​_hosts Undocumented
Variable ​_avoid​_word​_break​_classes Undocumented
Variable ​_avoid​_word​_break​_elements Undocumented
Variable ​_break​_prefer​_re Undocumented
Variable ​_conditional​_comment​_re Undocumented
Variable ​_find​_external​_links Undocumented
Variable ​_find​_image​_dataurls Undocumented
Variable ​_find​_styled​_elements Undocumented
Variable ​_is​_unsafe​_image​_type Undocumented
Variable ​_link​_regexes Undocumented
Variable ​_looks​_like​_tag​_content Undocumented
Variable ​_possibly​_malicious​_schemes Undocumented
Variable ​_replace​_css​_import Undocumented
Variable ​_replace​_css​_javascript Undocumented
Variable ​_substitute​_whitespace Undocumented
def autolink(el, link_regexes=_link_regexes, avoid_elements=_avoid_elements, avoid_hosts=_avoid_hosts, avoid_classes=_avoid_classes):

Turn any URLs into links.

It will search for links identified by the given regular expressions (by default mailto and http(s) links).

It won't link text in an element in avoid_elements, or an element with a class in avoid_classes. It won't link to anything with a host that matches one of the regular expressions in avoid_hosts (default localhost and 127.0.0.1).

If you pass in an element, the element's tail will not be substituted, only the contents of the element.

def autolink_html(html, *args, **kw):

Undocumented

def word_break(el, max_width=40, avoid_elements=_avoid_word_break_elements, avoid_classes=_avoid_word_break_classes, break_character=unichr(8203)):

Breaks any long words found in the body of the text (not attributes).

Doesn't effect any of the tags in avoid_elements, by default <textarea> and <pre>

Breaks words by inserting &#8203;, which is a unicode character for Zero Width Space character. This generally takes up no space in rendering, but does copy as a space, and in monospace contexts usually takes up space.

See http://www.cs.tut.fi/~jkorpela/html/nobr.html for a discussion

def word_break_html(html, *args, **kw):

Undocumented

basestring =

Undocumented

clean =

Undocumented

def _break_text(text, max_width, break_character):

Undocumented

def _has_javascript_scheme(s):

Undocumented

def _insert_break(word, width, break_character):

Undocumented

def _link_text(text, link_regexes, avoid_hosts, factory):

Undocumented

_avoid_classes: list[str] =

Undocumented

_avoid_elements: list[str] =

Undocumented

_avoid_hosts =

Undocumented

_avoid_word_break_classes: list[str] =

Undocumented

_avoid_word_break_elements: list[str] =

Undocumented

_break_prefer_re =

Undocumented

_conditional_comment_re =

Undocumented

_find_external_links =

Undocumented

_find_image_dataurls =

Undocumented

_find_styled_elements =

Undocumented

_is_unsafe_image_type =

Undocumented

_link_regexes =

Undocumented

_looks_like_tag_content =

Undocumented

_possibly_malicious_schemes =

Undocumented

_replace_css_import =

Undocumented

_replace_css_javascript =

Undocumented

_substitute_whitespace =

Undocumented