class Cleaner(object):
Instances cleans the document of each of the possible offending elements. The cleaning is controlled by attributes; you can override attributes in a subclass, or set them in the constructor.
A list or set of hosts that you can use for embedded content (for content like <object>, <link rel="stylesheet">, etc). You can also implement/override the method allow_embedded_url(el, url) or allow_element(el) to implement more complex rules for what can be embedded. Anything that passes this test will be shown, regardless of the value of (for instance) embedded.
Note that this parameter might not work as intended if you do not make the links absolute before doing the cleaning.
Note that you may also need to set whitelist_tags.
This modifies the document in place.
Method | __call__ |
Cleans the document. |
Method | __init__ |
Undocumented |
Method | allow_element |
Decide whether an element is configured to be accepted or rejected. |
Method | allow_embedded_url |
Decide whether a URL that was found in an element's attributes or text if configured to be accepted or rejected. |
Method | allow_follow |
Override to suppress rel="nofollow" on some anchors. |
Method | clean_html |
Undocumented |
Method | kill_conditional_comments |
IE conditional comments basically embed HTML that the parser doesn't normally see. We can't allow anything like that, so we'll kill any comments that could be conditional. |
Class Variable | add_nofollow |
Undocumented |
Class Variable | allow_tags |
Undocumented |
Class Variable | annoying_tags |
Undocumented |
Class Variable | comments |
Undocumented |
Class Variable | embedded |
Undocumented |
Class Variable | forms |
Undocumented |
Class Variable | frames |
Undocumented |
Class Variable | host_whitelist |
Undocumented |
Class Variable | javascript |
Undocumented |
Class Variable | kill_tags |
Undocumented |
Class Variable | links |
Undocumented |
Class Variable | meta |
Undocumented |
Class Variable | page_structure |
Undocumented |
Class Variable | processing_instructions |
Undocumented |
Class Variable | remove_tags |
Undocumented |
Class Variable | safe_attrs_only |
Undocumented |
Class Variable | scripts |
Undocumented |
Class Variable | style |
Undocumented |
Class Variable | whitelist_tags |
Undocumented |
Instance Variable | inline_style |
Undocumented |
Instance Variable | remove_unknown_tags |
Undocumented |
Method | _has_sneaky_javascript |
Depending on the browser, stuff like e x p r e s s i o n(...) can get interpreted, or expre/* stuff */ssion(...). This checks for attempt to do stuff like this. |
Method | _kill_elements |
Undocumented |
Method | _remove_javascript_link |
Undocumented |
Class Variable | _substitute_comments |
Undocumented |
Class Variable | _tag_link_attrs |
Undocumented |
Parameters | |
el | an element. |
Returns | |
true to accept the element or false to reject/discard it. |
Parameters | |
el | an element. |
url | a URL found on the element. |
Returns | |
true to accept the URL and false to reject it. |
Depending on the browser, stuff like e x p r e s s i o n(...) can get interpreted, or expre/* stuff */ssion(...). This checks for attempt to do stuff like this.
Typically the response will be to kill the entire style; if you have just a bit of Javascript in the style another rule will catch that and remove only the Javascript from the style; this catches more sneaky attempts.