Module | builder |
A set of HTML generator tags for building HTML documents. |
Module | clean |
A cleanup tool for HTML. |
Module | defs |
Data taken from https://www.w3.org/TR/html401/index/elements.html and https://www.w3.org/community/webed/wiki/HTML/New_HTML5_Elements for html5_tags. |
Module | diff |
No module docstring; 0/9 variable, 32/36 functions, 5/7 classes documented |
Module | ElementSoup |
Undocumented |
Module | formfill |
No module docstring; 0/5 variable, 0/15 function, 1/2 class documented |
Module | html5parser |
An interface to html5lib that mimics the lxml.html interface. |
Module | soupparser |
External interface to the BeautifulSoup HTML parser. |
Package | tests |
No package docstring; 1/14 module documented |
Module | usedoctest |
Doctest module for HTML comparison. |
Module | _diffcommand |
Undocumented |
Module | _html5builder |
Legacy module - don't use in new code! |
Module | _setmixin |
No module docstring; 1/1 class documented |
From __init__.py
:
Class | CheckboxGroup |
Represents a group of checkboxes (<input type=checkbox>) that have the same name. |
Class | CheckboxValues |
Represents the values of the checked checkboxes in a group of checkboxes with the same name. |
Class | Classes |
Provides access to an element's class attribute as a set-like collection. Usage: |
Class | FieldsDict |
Undocumented |
Class | FormElement |
Represents a <form> element. |
Class | HtmlComment |
Undocumented |
Class | HtmlElement |
Undocumented |
Class | HtmlElementClassLookup |
A lookup scheme for HTML Element classes. |
Class | HtmlEntity |
Undocumented |
Class | HtmlMixin |
No class docstring; 6/6 properties, 12/15 methods documented |
Class | HTMLParser |
An HTML parser that is configured to return lxml.html Element objects. |
Class | HtmlProcessingInstruction |
Undocumented |
Class | InputElement |
Represents an <input> element. |
Class | InputGetter |
An accessor that represents all the input fields in a form. |
Class | InputMixin |
Mix-in for all input elements (input, select, and textarea) |
Class | LabelElement |
Represents a <label> element. |
Class | MultipleSelectOptions |
Represents all the selected options in a <select multiple> element. |
Class | RadioGroup |
This object represents several <input type=radio> elements that have the same name. |
Class | SelectElement |
<select> element. You can get the name with .name. |
Class | TextareaElement |
<textarea> element. You can get the name with .name and get/set the value with .value |
Class | XHTMLParser |
An XML parser that is configured to return lxml.html Element objects. |
Function | document_fromstring |
Undocumented |
Function | Element |
Create a new HTML Element. |
Function | fragment_fromstring |
Parses a single HTML element; it is an error if there is more than one element, or if anything but whitespace precedes or follows the element. |
Function | fragments_fromstring |
Parses several HTML elements, returning a list of elements. |
Function | fromstring |
Parse the html, returning a single element/document. |
Function | html_to_xhtml |
Convert all tags in an HTML tree to XHTML by moving them to the XHTML namespace. |
Function | open_http_urllib |
Undocumented |
Function | open_in_browser |
Open the HTML document in a web browser, saving it to a temporary file to open it. Note that this does not delete the file after use. This is mainly meant for debugging. |
Function | parse |
Parse a filename, URL, or file-like object into an HTML document tree. Note: this returns a tree, not an element. Use parse(...).getroot() to get the document root. |
Function | submit_form |
Helper function to submit a form. Returns a file-like object, as from urllib.urlopen(). This object also has a .geturl() function, which shows the URL if there were any redirects. |
Function | tostring |
Return an HTML string representation of the document. |
Function | xhtml_to_html |
Convert all tags in an XHTML tree to HTML by removing their XHTML namespace. |
Constant | XHTML_NAMESPACE |
Undocumented |
Variable | basestring |
Undocumented |
Variable | find_class |
Undocumented |
Variable | find_rel_links |
Undocumented |
Variable | html_parser |
Undocumented |
Variable | iterlinks |
Undocumented |
Variable | make_links_absolute |
Undocumented |
Variable | resolve_base_href |
Undocumented |
Variable | rewrite_links |
Undocumented |
Variable | xhtml_parser |
Undocumented |
Class | _MethodFunc |
No summary |
Function | __fix_docstring |
Undocumented |
Function | _contains_block_level_tag |
Undocumented |
Function | _element_name |
Undocumented |
Function | _nons |
Undocumented |
Function | _transform_result |
Convert the result back into the input type. |
Function | _unquote_match |
Undocumented |
Variable | __bytes_replace_meta_content_type |
Undocumented |
Variable | __str_replace_meta_content_type |
Undocumented |
Variable | _archive_re |
Undocumented |
Variable | _class_xpath |
Undocumented |
Variable | _collect_string_content |
Undocumented |
Variable | _forms_xpath |
Undocumented |
Variable | _id_xpath |
Undocumented |
Variable | _iter_css_imports |
Undocumented |
Variable | _iter_css_urls |
Undocumented |
Variable | _label_xpath |
Undocumented |
Variable | _looks_like_full_html_bytes |
Undocumented |
Variable | _looks_like_full_html_unicode |
Undocumented |
Variable | _options_xpath |
Undocumented |
Variable | _parse_meta_refresh_url |
Undocumented |
Variable | _rel_links_xpath |
Undocumented |
Parses several HTML elements, returning a list of elements.
The first item in the list may be a string. If no_leading_text is true, then it will be an error if there is leading text, and it will always be a list of only elements.
base_url will set the document's base_url attribute (and the tree's docinfo.URL).
Parses a single HTML element; it is an error if there is more than one element, or if anything but whitespace precedes or follows the element.
If create_parent is true (or is a tag name) then a parent node will be created to encapsulate the HTML in a single element. In this case, leading or trailing text is also allowed, as are multiple elements as result of the parsing.
Passing a base_url will set the document's base_url attribute (and the tree's docinfo.URL).
Parse the html, returning a single element/document.
This tries to minimally parse the chunk of text, without knowing if it is a fragment or a document.
base_url will set the document's base_url attribute (and the tree's docinfo.URL)
Parse a filename, URL, or file-like object into an HTML document tree. Note: this returns a tree, not an element. Use parse(...).getroot() to get the document root.
You can override the base URL with the base_url keyword. This is most useful when parsing from a file-like object.
Helper function to submit a form. Returns a file-like object, as from urllib.urlopen(). This object also has a .geturl() function, which shows the URL if there were any redirects.
You can use this like:
form = doc.forms[0] form.inputs['foo'].value = 'bar' # etc response = form.submit() doc = parse(response) doc.make_links_absolute(response.geturl())
To change the HTTP requester, pass a function as open_http keyword argument that opens the URL for you. The function must have the following signature:
open_http(method, URL, values)
The action is one of 'GET' or 'POST', the URL is the target URL as a string, and the values are a sequence of (name, value) tuples with the form data.
Return an HTML string representation of the document.
Note: if include_meta_content_type is true this will create a <meta http-equiv="Content-Type" ...> tag in the head; regardless of the value of include_meta_content_type any existing <meta http-equiv="Content-Type" ...> tag will be removed
The encoding argument controls the output encoding (defaults to ASCII, with &#...; character references for any characters outside of ASCII). Note that you can pass the name 'unicode' as encoding argument to serialise to a Unicode string.
The method argument defines the output method. It defaults to 'html', but can also be 'xml' for xhtml output, or 'text' to serialise to plain text without markup.
To leave out the tail text of the top-level element that is being serialised, pass with_tail=False.
The doctype option allows passing in a plain string that will be serialised before the XML tree. Note that passing in non well-formed content here will make the XML output non well-formed. Also, an existing doctype in the document tree will not be removed when serialising an ElementTree instance.
Example:
>>> from lxml import html >>> root = html.fragment_fromstring('<p>Hello<br>world!</p>') >>> html.tostring(root) b'<p>Hello<br>world!</p>' >>> html.tostring(root, method='html') b'<p>Hello<br>world!</p>' >>> html.tostring(root, method='xml') b'<p>Hello<br/>world!</p>' >>> html.tostring(root, method='text') b'Helloworld!' >>> html.tostring(root, method='text', encoding='unicode') u'Helloworld!' >>> root = html.fragment_fromstring('<div><p>Hello<br>world!</p>TAIL</div>') >>> html.tostring(root[0], method='text', encoding='unicode') u'Helloworld!TAIL' >>> html.tostring(root[0], method='text', encoding='unicode', with_tail=False) u'Helloworld!' >>> doc = html.document_fromstring('<p>Hello<br>world!</p>') >>> html.tostring(doc, method='html', encoding='unicode') u'<html><body><p>Hello<br>world!</p></body></html>' >>> print(html.tostring(doc, method='html', encoding='unicode', ... doctype='<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"' ... ' "http://www.w3.org/TR/html4/strict.dtd">')) <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html><body><p>Hello<br>world!</p></body></html>