lxml.html.HtmlMixin

class documentation

class HtmlMixin(object):

Known subclasses: lxml.html.HtmlComment, lxml.html.HtmlElement, lxml.html.HtmlEntity, lxml.html.HtmlProcessingInstruction

View In Hierarchy

Undocumented

Method	`classes.setter`	Undocumented
Method	`cssselect`	Run the CSS expression on this element and its children, returning a list of the results.
Method	`drop_tag`	Remove the tag, but not its children or text. The children and text are merged into the parent.
Method	`drop_tree`	Removes this element from the tree, including its children and text. The tail text is joined to the previous element or parent.
Method	`find_class`	Find any elements with the given class name.
Method	`find_rel_links`	Find any links like `<a rel="{rel}">...</a>`; returns a list of elements.
Method	`get_element_by_id`	Get the first element in a document with the given id. If none is found, return the default argument if provided or raise KeyError otherwise.
Method	`iterlinks`	No summary
Method	`label.deleter`	Undocumented
Method	`label.setter`	Undocumented
Method	`make_links_absolute`	No summary
Method	`resolve_base_href`	Find any `<base href>` tag in the document, and apply its values to all links found in the document. Also remove the tag once it has been applied.
Method	`rewrite_links`	Rewrite all the links in the document. For each link `link_repl_func(link)` will be called, and the return value will replace the old link.
Method	`set`	set(self, key, value=None)
Method	`text_content`	Return the text content of the tag (and the text in any children).
Property	`base_url`	Returns the base URL, given when the page was parsed.
Property	`body`	Return the <body> element. Can be called from a child element to get the document's head.
Property	`classes`	A set-like wrapper around the 'class' attribute.
Property	`forms`	Return a list of all the forms
Property	`head`	Returns the <head> element. Can be called from a child element to get the document's head.
Property	`label`	Get or set any <label> element associated with this element.

@classes.setter
def classes(self, classes):

Undocumented

def cssselect(self, expr, translator='html'):

Run the CSS expression on this element and its children, returning a list of the results.

Equivalent to lxml.cssselect.CSSSelect(expr, translator='html')(self) -- note that pre-compiling the expression can provide a substantial speedup.

def drop_tag(self):

Remove the tag, but not its children or text. The children and text are merged into the parent.

Example:

>>> h = fragment_fromstring('<div>Hello <b>World!</b></div>')
>>> h.find('.//b').drop_tag()
>>> print(tostring(h, encoding='unicode'))
<div>Hello World!</div>

def drop_tree(self):

Removes this element from the tree, including its children and text. The tail text is joined to the previous element or parent.

def find_class(self, class_name):

Find any elements with the given class name.

def find_rel_links(self, rel):

Find any links like <a rel="{rel}">...</a>; returns a list of elements.

def get_element_by_id(self, id, *default):

Get the first element in a document with the given id. If none is found, return the default argument if provided or raise KeyError otherwise.

Note that there can be more than one element with the same id, and this isn't uncommon in HTML documents found in the wild. Browsers return only the first match, and this function does the same.

def iterlinks(self):

Yield (element, attribute, link, pos), where attribute may be None (indicating the link is in the text). pos is the position where the link occurs; often 0, but sometimes something else in the case of links in stylesheets or style tags.

Note: <base href> is not taken into account in any way. The link you get is exactly the link in the document.

Note: multiple links inside of a single text string or attribute value are returned in reversed order. This makes it possible to replace or delete them from the text string value based on their reported text positions. Otherwise, a modification at one text position can change the positions of links reported later on.

@label.deleter
def label(self):

Undocumented

@label.setter
def label(self, label):

Undocumented

def make_links_absolute(self, base_url=None, resolve_base_href=True, handle_failures=None):

Make all links in the document absolute, given the base_url for the document (the full URL where the document came from), or if no base_url is given, then the .base_url of the document.

If resolve_base_href is true, then any <base href> tags in the document are used and removed from the document. If it is false then any such tag is ignored.

If handle_failures is None (default), a failure to process a URL will abort the processing. If set to 'ignore', errors are ignored. If set to 'discard', failing URLs will be removed.

def resolve_base_href(self, handle_failures=None):

Find any <base href> tag in the document, and apply its values to all links found in the document. Also remove the tag once it has been applied.

If handle_failures is None (default), a failure to process a URL will abort the processing. If set to 'ignore', errors are ignored. If set to 'discard', failing URLs will be removed.

def rewrite_links(self, link_repl_func, resolve_base_href=True, base_href=None):

Rewrite all the links in the document. For each link link_repl_func(link) will be called, and the return value will replace the old link.

Note that links may not be absolute (unless you first called make_links_absolute()), and may be internal (e.g., '#anchor'). They can also be values like 'mailto:email' or 'javascript:expr'.

If you give base_href then all links passed to link_repl_func() will take that into account.

If the link_repl_func returns None, the attribute or tag text will be removed completely.

def set(self, key, value=None):

set(self, key, value=None)

Sets an element attribute. If no value is provided, or if the value is None, creates a 'boolean' attribute without value, e.g. "<form novalidate></form>" for form.set('novalidate').

def text_content(self):

Return the text content of the tag (and the text in any children).

@property
base_url =

Returns the base URL, given when the page was parsed.

Use with urlparse.urljoin(el.base_url, href) to get absolute URLs.

@property
body =

Return the <body> element. Can be called from a child element to get the document's head.

@property
classes =

A set-like wrapper around the 'class' attribute.

@property
forms =

Return a list of all the forms

@property
head =

Returns the <head> element. Can be called from a child element to get the document's head.

@property
label =

Get or set any <label> element associated with this element.