uTools-Manuals/docs/python/html.parser.html
2019-04-21 11:50:48 +08:00

109 lines
33 KiB
HTML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<div class="body" role="main"><div class="section" id="module-html.parser"><h1><span class="yiyi-st" id="yiyi-10">20.2。 <a class="reference internal" href="#module-html.parser" title="html.parser: A simple parser that can handle HTML and XHTML."><code class="xref py py-mod docutils literal"><span class="pre">html.parser</span></code></a> - 简单的HTML和XHTML解析器</span></h1><p><span class="yiyi-st" id="yiyi-11"><strong>源代码:</strong> <a class="reference external" href="https://hg.python.org/cpython/file/3.5/Lib/html/parser.py">Lib / html / parser.py</a></span></p><p><span class="yiyi-st" id="yiyi-12">此模块定义了一个类<a class="reference internal" href="#html.parser.HTMLParser" title="html.parser.HTMLParser"><code class="xref py py-class docutils literal"><span class="pre">HTMLParser</span></code></a>作为解析以HTML超文本标记语言和XHTML格式化的文本文件的基础。</span></p><dl class="class"><dt id="html.parser.HTMLParser"><span class="yiyi-st" id="yiyi-13"> <em class="property">class </em><code class="descclassname">html.parser.</code><code class="descname">HTMLParser</code><span class="sig-paren">(</span><em>*</em>, <em>convert_charrefs=True</em><span class="sig-paren">)</span></span></dt><dd><p><span class="yiyi-st" id="yiyi-14">创建一个解析器实例能够解析无效标记。</span></p><p><span class="yiyi-st" id="yiyi-15">如果<em>convert_charrefs</em><code class="docutils literal"><span class="pre">True</span></code>(默认值),所有字符引用(<code class="docutils literal"><span class="pre">script</span></code> / <code class="docutils literal"><span class="pre">style</span></code>会自动转换为相应的Unicode字符。</span></p><p><span class="yiyi-st" id="yiyi-16">当遇到开始标签,结束标签,文本,注释和其他标记元素时,<a class="reference internal" href="#html.parser.HTMLParser" title="html.parser.HTMLParser"><code class="xref py py-class docutils literal"><span class="pre">HTMLParser</span></code></a>实例提供HTML数据并调用处理程序方法。</span><span class="yiyi-st" id="yiyi-17">用户应该子类化<a class="reference internal" href="#html.parser.HTMLParser" title="html.parser.HTMLParser"><code class="xref py py-class docutils literal"><span class="pre">HTMLParser</span></code></a>并覆盖其方法以实现所需的行为。</span></p><p><span class="yiyi-st" id="yiyi-18">此解析器不检查结束标记是否匹配开始标记,或者通过关闭外部元素来隐式关闭的元素调用结束标记处理程序。</span></p><div class="versionchanged"><p><span class="yiyi-st" id="yiyi-19"><span class="versionmodified">在版本3.4中已更改:</span> <em>convert_charrefs</em>添加了关键字参数。</span></p></div><div class="versionchanged"><p><span class="yiyi-st" id="yiyi-20"><span class="versionmodified">在版本3.5中更改:</span>参数<em>convert_charrefs</em>的默认值现在为<code class="docutils literal"><span class="pre">True</span></code></span></p></div></dd></dl><div class="section" id="example-html-parser-application"><h2><span class="yiyi-st" id="yiyi-21">20.2.1. </span><span class="yiyi-st" id="yiyi-22">示例HTML解析器应用程序</span></h2><p><span class="yiyi-st" id="yiyi-23">作为一个基本示例下面是一个简单的HTML解析器它使用<a class="reference internal" href="#html.parser.HTMLParser" title="html.parser.HTMLParser"><code class="xref py py-class docutils literal"><span class="pre">HTMLParser</span></code></a>类打印出遇到的开始标签,结束标签和数据:</span></p><pre><code class="language-python"><span></span><span class="kn">from</span> <span class="nn">html.parser</span> <span class="k">import</span> <span class="n">HTMLParser</span>
<span class="k">class</span> <span class="nc">MyHTMLParser</span><span class="p">(</span><span class="n">HTMLParser</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">handle_starttag</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">tag</span><span class="p">,</span> <span class="n">attrs</span><span class="p">):</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"Encountered a start tag:"</span><span class="p">,</span> <span class="n">tag</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">handle_endtag</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">tag</span><span class="p">):</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"Encountered an end tag :"</span><span class="p">,</span> <span class="n">tag</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">handle_data</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">data</span><span class="p">):</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"Encountered some data :"</span><span class="p">,</span> <span class="n">data</span><span class="p">)</span>
<span class="n">parser</span> <span class="o">=</span> <span class="n">MyHTMLParser</span><span class="p">()</span>
<span class="n">parser</span><span class="o">.</span><span class="n">feed</span><span class="p">(</span><span class="s1">'&lt;html&gt;&lt;head&gt;&lt;title&gt;Test&lt;/title&gt;&lt;/head&gt;'</span>
<span class="s1">'&lt;body&gt;&lt;h1&gt;Parse me!&lt;/h1&gt;&lt;/body&gt;&lt;/html&gt;'</span><span class="p">)</span>
</code></pre><p><span class="yiyi-st" id="yiyi-24">输出将是:</span></p><div class="highlight-none"><div class="highlight"><pre><span></span>Encountered a start tag: html
Encountered a start tag: head
Encountered a start tag: title
Encountered some data : Test
Encountered an end tag : title
Encountered an end tag : head
Encountered a start tag: body
Encountered a start tag: h1
Encountered some data : Parse me!
Encountered an end tag : h1
Encountered an end tag : body
Encountered an end tag : html
</pre></div></div></div><div class="section" id="htmlparser-methods"><h2><span class="yiyi-st" id="yiyi-25">20.2.2. </span><span class="yiyi-st" id="yiyi-26"><a class="reference internal" href="#html.parser.HTMLParser" title="html.parser.HTMLParser"><code class="xref py py-class docutils literal"><span class="pre">HTMLParser</span></code></a>方法</span></h2><p><span class="yiyi-st" id="yiyi-27"><a class="reference internal" href="#html.parser.HTMLParser" title="html.parser.HTMLParser"><code class="xref py py-class docutils literal"><span class="pre">HTMLParser</span></code></a>实例具有以下方法:</span></p><dl class="method"><dt id="html.parser.HTMLParser.feed"><span class="yiyi-st" id="yiyi-28"> <code class="descclassname">HTMLParser.</code><code class="descname">feed</code><span class="sig-paren">(</span><em>data</em><span class="sig-paren">)</span></span></dt><dd><p><span class="yiyi-st" id="yiyi-29">将一些文本提供给解析器。</span><span class="yiyi-st" id="yiyi-30">若包含的元素完整,则它将被解析器处理;不完整的数据将被缓冲,直到送入更多数据或调用<a class="reference internal" href="#html.parser.HTMLParser.close" title="html.parser.HTMLParser.close"><code class="xref py py-meth docutils literal"><span class="pre">close()</span></code></a></span><span class="yiyi-st" id="yiyi-31"><em>数据</em>必须为<a class="reference internal" href="stdtypes.html#str" title="str"><code class="xref py py-class docutils literal"><span class="pre">str</span></code></a></span></p></dd></dl><dl class="method"><dt id="html.parser.HTMLParser.close"><span class="yiyi-st" id="yiyi-32"> <code class="descclassname">HTMLParser.</code><code class="descname">close</code><span class="sig-paren">(</span><span class="sig-paren">)</span></span></dt><dd><p><span class="yiyi-st" id="yiyi-33">强制处理所有缓冲的数据,就好像它后面是文件结束标记。</span><span class="yiyi-st" id="yiyi-34">此方法可以由派生类重新定义,以在输入结束时定义附加处理,但重新定义的版本应始终调用<a class="reference internal" href="#html.parser.HTMLParser" title="html.parser.HTMLParser"><code class="xref py py-class docutils literal"><span class="pre">HTMLParser</span></code></a>基类方法<a class="reference internal" href="#html.parser.HTMLParser.close" title="html.parser.HTMLParser.close"><code class="xref py py-meth docutils literal"><span class="pre">close()</span></code></a></span></p></dd></dl><dl class="method"><dt id="html.parser.HTMLParser.reset"><span class="yiyi-st" id="yiyi-35"> <code class="descclassname">HTMLParser.</code><code class="descname">reset</code><span class="sig-paren">(</span><span class="sig-paren">)</span></span></dt><dd><p><span class="yiyi-st" id="yiyi-36">重置实例。</span><span class="yiyi-st" id="yiyi-37">丢失所有未处理的数据。</span><span class="yiyi-st" id="yiyi-38">这在实例化时隐式调用。</span></p></dd></dl><dl class="method"><dt id="html.parser.HTMLParser.getpos"><span class="yiyi-st" id="yiyi-39"> <code class="descclassname">HTMLParser.</code><code class="descname">getpos</code><span class="sig-paren">(</span><span class="sig-paren">)</span></span></dt><dd><p><span class="yiyi-st" id="yiyi-40">返回当前行号和偏移量。</span></p></dd></dl><dl class="method"><dt id="html.parser.HTMLParser.get_starttag_text"><span class="yiyi-st" id="yiyi-41"> <code class="descclassname">HTMLParser.</code><code class="descname">get_starttag_text</code><span class="sig-paren">(</span><span class="sig-paren">)</span></span></dt><dd><p><span class="yiyi-st" id="yiyi-42">返回最近打开的开始标记的文本。</span><span class="yiyi-st" id="yiyi-43">这通常不需要用于结构化处理,但是可能在处理“部署的”或用于以最小改变重新生成输入(可以保留属性之间的空格等)时有用。</span><span class="yiyi-st" id="yiyi-44">)。</span></p></dd></dl><p><span class="yiyi-st" id="yiyi-45">当遇到数据或标记元素并且它们意图在子类中被覆盖时,调用以下方法。</span><span class="yiyi-st" id="yiyi-46">基类实现什么也不做(除了<a class="reference internal" href="#html.parser.HTMLParser.handle_startendtag" title="html.parser.HTMLParser.handle_startendtag"><code class="xref py py-meth docutils literal"><span class="pre">handle_startendtag()</span></code></a></span></p><dl class="method"><dt id="html.parser.HTMLParser.handle_starttag"><span class="yiyi-st" id="yiyi-47"> <code class="descclassname">HTMLParser.</code><code class="descname">handle_starttag</code><span class="sig-paren">(</span><em>tag</em>, <em>attrs</em><span class="sig-paren">)</span></span></dt><dd><p><span class="yiyi-st" id="yiyi-48">调用此方法来处理标记的开始(例如,</span><span class="yiyi-st" id="yiyi-49"><code class="docutils literal"><span class="pre">lt div</span> <span class="pre">id =“main”gt</span></code>)。</span></p><p><span class="yiyi-st" id="yiyi-50"><em>tag</em>参数是转换为小写的标签的名称。</span><span class="yiyi-st" id="yiyi-51">The <em>attrs</em> argument is a list of <code class="docutils literal"><span class="pre">(name,</span> <span class="pre">value)</span></code> pairs containing the attributes found inside the tags <code class="docutils literal"><span class="pre">&lt;&gt;</span></code> brackets. </span><span class="yiyi-st" id="yiyi-52"><em>name</em>将被转换为小写,<em>value</em>中的引号已被移除,字符和实体引用已被替换。</span></p><p><span class="yiyi-st" id="yiyi-53">对于实例,对于标签<code class="docutils literal"><span class="pre">lt A</span> <span class="pre">HREF =“https://www.cwi.nl/”gt</span></code>称为<code class="docutils literal"><span class="pre">handle_starttag'a'</span> <span class="pre">['href'</span> <span class="pre">'https://www.cwi.nl/'] </span></code></span></p><p><span class="yiyi-st" id="yiyi-54">来自<a class="reference internal" href="html.entities.html#module-html.entities" title="html.entities: Definitions of HTML general entities."><code class="xref py py-mod docutils literal"><span class="pre">html.entities</span></code></a>的所有实体引用都将在属性值中替换。</span></p></dd></dl><dl class="method"><dt id="html.parser.HTMLParser.handle_endtag"><span class="yiyi-st" id="yiyi-55"> <code class="descclassname">HTMLParser.</code><code class="descname">handle_endtag</code><span class="sig-paren">(</span><em>tag</em><span class="sig-paren">)</span></span></dt><dd><p><span class="yiyi-st" id="yiyi-56">调用此方法来处理元素的结束标记(例如,</span><span class="yiyi-st" id="yiyi-57"><code class="docutils literal"><span class="pre">&lt;/div&gt;</span></code>)。</span></p><p><span class="yiyi-st" id="yiyi-58"><em>标签</em>参数是转换为小写的标签的名称。</span></p></dd></dl><dl class="method"><dt id="html.parser.HTMLParser.handle_startendtag"><span class="yiyi-st" id="yiyi-59"> <code class="descclassname">HTMLParser.</code><code class="descname">handle_startendtag</code><span class="sig-paren">(</span><em>tag</em>, <em>attrs</em><span class="sig-paren">)</span></span></dt><dd><p><span class="yiyi-st" id="yiyi-60">Similar to <a class="reference internal" href="#html.parser.HTMLParser.handle_starttag" title="html.parser.HTMLParser.handle_starttag"><code class="xref py py-meth docutils literal"><span class="pre">handle_starttag()</span></code></a>, but called when the parser encounters an XHTML-style empty tag (<code class="docutils literal"><span class="pre">&lt;img</span> <span class="pre">...</span> <span class="pre">/&gt;</span></code>). </span><span class="yiyi-st" id="yiyi-61">这个方法可以被需要这个特定词汇信息的子类覆盖;默认实现只需调用<a class="reference internal" href="#html.parser.HTMLParser.handle_starttag" title="html.parser.HTMLParser.handle_starttag"><code class="xref py py-meth docutils literal"><span class="pre">handle_starttag()</span></code></a><a class="reference internal" href="#html.parser.HTMLParser.handle_endtag" title="html.parser.HTMLParser.handle_endtag"><code class="xref py py-meth docutils literal"><span class="pre">handle_endtag()</span></code></a></span></p></dd></dl><dl class="method"><dt id="html.parser.HTMLParser.handle_data"><span class="yiyi-st" id="yiyi-62"> <code class="descclassname">HTMLParser.</code><code class="descname">handle_data</code><span class="sig-paren">(</span><em>data</em><span class="sig-paren">)</span></span></dt><dd><p><span class="yiyi-st" id="yiyi-63">调用此方法来处理任意数据(例如,</span><span class="yiyi-st" id="yiyi-64">文本节点和<code class="docutils literal"><span class="pre">&lt;script&gt;...&lt;/script&gt;</span></code><code class="docutils literal"><span class="pre">&lt;style&gt;...&lt;/style&gt;</span></code>)的内容。</span></p></dd></dl><dl class="method"><dt id="html.parser.HTMLParser.handle_entityref"><span class="yiyi-st" id="yiyi-65"> <code class="descclassname">HTMLParser.</code><code class="descname">handle_entityref</code><span class="sig-paren">(</span><em>name</em><span class="sig-paren">)</span></span></dt><dd><p><span class="yiyi-st" id="yiyi-66">调用此方法来处理形式为<code class="docutils literal"><span class="pre">&amp;name;</span></code>的命名字符引用(例如,</span><span class="yiyi-st" id="yiyi-67"><code class="docutils literal"><span class="pre">&amp;gt;</span></code>),其中<em>name</em>是一般实体引用</span><span class="yiyi-st" id="yiyi-68"><code class="docutils literal"><span class="pre">'gt'</span></code>)。</span><span class="yiyi-st" id="yiyi-69">如果<em>convert_charrefs</em><code class="docutils literal"><span class="pre">True</span></code>,则不会调用此方法。</span></p></dd></dl><dl class="method"><dt id="html.parser.HTMLParser.handle_charref"><span class="yiyi-st" id="yiyi-70"> <code class="descclassname">HTMLParser.</code><code class="descname">handle_charref</code><span class="sig-paren">(</span><em>name</em><span class="sig-paren">)</span></span></dt><dd><p><span class="yiyi-st" id="yiyi-71">调用此方法以处理形式为<code class="docutils literal"><span class="pre">&amp;#NNN;</span></code><code class="docutils literal"><span class="pre">&amp;#xNNN;</span></code>的十进制和十六进制数字字符引用。</span><span class="yiyi-st" id="yiyi-72">例如,<code class="docutils literal"><span class="pre">&amp;gt;</span></code>的十进制等于<code class="docutils literal"><span class="pre">&amp;#62;</span></code>,而十六进制是<code class="docutils literal"><span class="pre">&amp;#x3E;</span></code>;在这种情况下,该方法将接收<code class="docutils literal"><span class="pre">'62'</span></code><code class="docutils literal"><span class="pre">'x3E'</span></code></span><span class="yiyi-st" id="yiyi-73">如果<em>convert_charrefs</em><code class="docutils literal"><span class="pre">True</span></code>,则不会调用此方法。</span></p></dd></dl><dl class="method"><dt id="html.parser.HTMLParser.handle_comment"><span class="yiyi-st" id="yiyi-74"> <code class="descclassname">HTMLParser.</code><code class="descname">handle_comment</code><span class="sig-paren">(</span><em>data</em><span class="sig-paren">)</span></span></dt><dd><p><span class="yiyi-st" id="yiyi-75">当遇到注释时调用此方法。</span><span class="yiyi-st" id="yiyi-76"><code class="docutils literal"><span class="pre">&lt;!--comment--&gt;</span></code>)。</span></p><p><span class="yiyi-st" id="yiyi-77">例如,注释<code class="docutils literal"><span class="pre"> <span class="pre">注释</span> <span class="pre"> - &gt;</span></span></code>将导致调用此方法与参数<code class="docutils literal"><span class="pre">'</span> <span class="pre">注释</span> <span class="pre">'</span></code></span></p><p><span class="yiyi-st" id="yiyi-78">The content of Internet Explorer conditional comments (condcoms) will also be sent to this method, so, for <code class="docutils literal"><span class="pre">&lt;!--[if</span> <span class="pre">IE</span> <span class="pre">9]&gt;IE9-specific</span> <span class="pre">content&lt;![endif]--&gt;</span></code>, this method will receive <code class="docutils literal"><span class="pre">'[if</span> <span class="pre">IE</span> <span class="pre">9]&gt;IE9-specific</span> <span class="pre">content&lt;![endif]'</span></code>.</span></p></dd></dl><dl class="method"><dt id="html.parser.HTMLParser.handle_decl"><span class="yiyi-st" id="yiyi-79"> <code class="descclassname">HTMLParser.</code><code class="descname">handle_decl</code><span class="sig-paren">(</span><em>decl</em><span class="sig-paren">)</span></span></dt><dd><p><span class="yiyi-st" id="yiyi-80">调用此方法来处理HTML doctype声明。</span><span class="yiyi-st" id="yiyi-81"><code class="docutils literal"><span class="pre">ltDOCTYPE</span> <span class="pre">htmlgt</span></code>)。</span></p><p><span class="yiyi-st" id="yiyi-82"><em>decl t&gt;参数将是<code class="docutils literal"><span class="pre">&lt;!...&gt;</span></code>标记内声明的全部内容。</em></span><span class="yiyi-st" id="yiyi-83"><code class="docutils literal"><span class="pre">'DOCTYPE</span> <span class="pre">html'</span></code>)。</span></p></dd></dl><dl class="method"><dt id="html.parser.HTMLParser.handle_pi"><span class="yiyi-st" id="yiyi-84"> <code class="descclassname">HTMLParser.</code><code class="descname">handle_pi</code><span class="sig-paren">(</span><em>data</em><span class="sig-paren">)</span></span></dt><dd><p><span class="yiyi-st" id="yiyi-85">遇到处理指令时调用的方法。</span><span class="yiyi-st" id="yiyi-86"><em>数据</em>参数将包含整个处理指令。</span><span class="yiyi-st" id="yiyi-87">For example, for the processing instruction <code class="docutils literal"><span class="pre">&lt;?proc</span> <span class="pre">color='red'&gt;</span></code>, this method would be called as <code class="docutils literal"><span class="pre">handle_pi("proc</span> <span class="pre">color='red'")</span></code>. </span><span class="yiyi-st" id="yiyi-88">它的目的是被一个派生类覆盖;基类实现什么也不做。</span></p><div class="admonition note"><p class="first admonition-title"><span class="yiyi-st" id="yiyi-89">注意</span></p><p class="last"><span class="yiyi-st" id="yiyi-90"><a class="reference internal" href="#html.parser.HTMLParser" title="html.parser.HTMLParser"><code class="xref py py-class docutils literal"><span class="pre">HTMLParser</span></code></a>类使用SGML语法规则来处理指令。</span><span class="yiyi-st" id="yiyi-91">使用尾随<code class="docutils literal"><span class="pre">'?'</span></code>的XHTML处理指令</span><span class="yiyi-st" id="yiyi-92">将导致<code class="docutils literal"><span class="pre">'?'</span></code></span><span class="yiyi-st" id="yiyi-93">以包括在<em>数据</em>中。</span></p></div></dd></dl><dl class="method"><dt id="html.parser.HTMLParser.unknown_decl"><span class="yiyi-st" id="yiyi-94"> <code class="descclassname">HTMLParser.</code><code class="descname">unknown_decl</code><span class="sig-paren">(</span><em>data</em><span class="sig-paren">)</span></span></dt><dd><p><span class="yiyi-st" id="yiyi-95">当解析器读取无法识别的声明时,将调用此方法。</span></p><p><span class="yiyi-st" id="yiyi-96"><em>data</em>参数将是<code class="docutils literal"><span class="pre">&lt;![...]&gt;</span></code>标记内声明的全部内容。</span><span class="yiyi-st" id="yiyi-97">有时有用的是被派生类覆盖。</span><span class="yiyi-st" id="yiyi-98">基类实现什么也不做。</span></p></dd></dl></div><div class="section" id="examples"><h2><span class="yiyi-st" id="yiyi-99">20.2.3. </span><span class="yiyi-st" id="yiyi-100">实例</span></h2><p><span class="yiyi-st" id="yiyi-101">下面的类实现了一个解析器,将用来说明更多的例子:</span></p><pre><code class="language-python"><span></span><span class="kn">from</span> <span class="nn">html.parser</span> <span class="k">import</span> <span class="n">HTMLParser</span>
<span class="kn">from</span> <span class="nn">html.entities</span> <span class="k">import</span> <span class="n">name2codepoint</span>
<span class="k">class</span> <span class="nc">MyHTMLParser</span><span class="p">(</span><span class="n">HTMLParser</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">handle_starttag</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">tag</span><span class="p">,</span> <span class="n">attrs</span><span class="p">):</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"Start tag:"</span><span class="p">,</span> <span class="n">tag</span><span class="p">)</span>
<span class="k">for</span> <span class="n">attr</span> <span class="ow">in</span> <span class="n">attrs</span><span class="p">:</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">" attr:"</span><span class="p">,</span> <span class="n">attr</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">handle_endtag</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">tag</span><span class="p">):</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"End tag :"</span><span class="p">,</span> <span class="n">tag</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">handle_data</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">data</span><span class="p">):</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"Data :"</span><span class="p">,</span> <span class="n">data</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">handle_comment</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">data</span><span class="p">):</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"Comment :"</span><span class="p">,</span> <span class="n">data</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">handle_entityref</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">):</span>
<span class="n">c</span> <span class="o">=</span> <span class="nb">chr</span><span class="p">(</span><span class="n">name2codepoint</span><span class="p">[</span><span class="n">name</span><span class="p">])</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"Named ent:"</span><span class="p">,</span> <span class="n">c</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">handle_charref</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">):</span>
<span class="k">if</span> <span class="n">name</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s1">'x'</span><span class="p">):</span>
<span class="n">c</span> <span class="o">=</span> <span class="nb">chr</span><span class="p">(</span><span class="nb">int</span><span class="p">(</span><span class="n">name</span><span class="p">[</span><span class="mi">1</span><span class="p">:],</span> <span class="mi">16</span><span class="p">))</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">c</span> <span class="o">=</span> <span class="nb">chr</span><span class="p">(</span><span class="nb">int</span><span class="p">(</span><span class="n">name</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"Num ent :"</span><span class="p">,</span> <span class="n">c</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">handle_decl</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">data</span><span class="p">):</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"Decl :"</span><span class="p">,</span> <span class="n">data</span><span class="p">)</span>
<span class="n">parser</span> <span class="o">=</span> <span class="n">MyHTMLParser</span><span class="p">()</span>
</code></pre><p><span class="yiyi-st" id="yiyi-102">解析doctype</span></p><pre><code class="language-python"><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">parser</span><span class="o">.</span><span class="n">feed</span><span class="p">(</span><span class="s1">'&lt;!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '</span>
<span class="gp">... </span> <span class="s1">'"http://www.w3.org/TR/html4/strict.dtd"&gt;'</span><span class="p">)</span>
<span class="go">Decl : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"</span>
</code></pre><p><span class="yiyi-st" id="yiyi-103">解析具有几个属性和标题的元素:</span></p><pre><code class="language-python"><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">parser</span><span class="o">.</span><span class="n">feed</span><span class="p">(</span><span class="s1">'&lt;img src="python-logo.png" alt="The Python logo"&gt;'</span><span class="p">)</span>
<span class="go">Start tag: img</span>
<span class="go"> attr: ('src', 'python-logo.png')</span>
<span class="go"> attr: ('alt', 'The Python logo')</span>
<span class="go">&gt;&gt;&gt;</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">parser</span><span class="o">.</span><span class="n">feed</span><span class="p">(</span><span class="s1">'&lt;h1&gt;Python&lt;/h1&gt;'</span><span class="p">)</span>
<span class="go">Start tag: h1</span>
<span class="go">Data : Python</span>
<span class="go">End tag : h1</span>
</code></pre><p><span class="yiyi-st" id="yiyi-104"><code class="docutils literal"><span class="pre">script</span></code><code class="docutils literal"><span class="pre">style</span></code>元素的内容按原样返回,无需进一步解析:</span></p><pre><code class="language-python"><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">parser</span><span class="o">.</span><span class="n">feed</span><span class="p">(</span><span class="s1">'&lt;style type="text/css"&gt;#python { color: green }&lt;/style&gt;'</span><span class="p">)</span>
<span class="go">Start tag: style</span>
<span class="go"> attr: ('type', 'text/css')</span>
<span class="go">Data : #python { color: green }</span>
<span class="go">End tag : style</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">parser</span><span class="o">.</span><span class="n">feed</span><span class="p">(</span><span class="s1">'&lt;script type="text/javascript"&gt;'</span>
<span class="gp">... </span> <span class="s1">'alert("&lt;strong&gt;hello!&lt;/strong&gt;");&lt;/script&gt;'</span><span class="p">)</span>
<span class="go">Start tag: script</span>
<span class="go"> attr: ('type', 'text/javascript')</span>
<span class="go">Data : alert("&lt;strong&gt;hello!&lt;/strong&gt;");</span>
<span class="go">End tag : script</span>
</code></pre><p><span class="yiyi-st" id="yiyi-105">解析注释:</span></p><pre><code class="language-python"><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">parser</span><span class="o">.</span><span class="n">feed</span><span class="p">(</span><span class="s1">'&lt;!-- a comment --&gt;'</span>
<span class="gp">... </span> <span class="s1">'&lt;!--[if IE 9]&gt;IE-specific content&lt;![endif]--&gt;'</span><span class="p">)</span>
<span class="go">Comment : a comment</span>
<span class="go">Comment : [if IE 9]&gt;IE-specific content&lt;![endif]</span>
</code></pre><p><span class="yiyi-st" id="yiyi-106">解析命名和数字字符引用并将它们转换为正确的char注意这3个引用都等效于<code class="docutils literal"><span class="pre">'&gt;'</span></code></span></p><pre><code class="language-python"><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">parser</span><span class="o">.</span><span class="n">feed</span><span class="p">(</span><span class="s1">'&amp;gt;&amp;#62;&amp;#x3E;'</span><span class="p">)</span>
<span class="go">Named ent: &gt;</span>
<span class="go">Num ent : &gt;</span>
<span class="go">Num ent : &gt;</span>
</code></pre><p><span class="yiyi-st" id="yiyi-107"><a class="reference internal" href="#html.parser.HTMLParser.feed" title="html.parser.HTMLParser.feed"><code class="xref py py-meth docutils literal"><span class="pre">feed()</span></code></a>提供不完整的块可以工作,但<a class="reference internal" href="#html.parser.HTMLParser.handle_data" title="html.parser.HTMLParser.handle_data"><code class="xref py py-meth docutils literal"><span class="pre">handle_data()</span></code></a>可能会被调用多次(除非<em>convert_charrefs</em>设置为<code class="docutils literal"><span class="pre">True</span></code></span></p><pre><code class="language-python"><span></span><span class="gp">&gt;&gt;&gt; </span><span class="k">for</span> <span class="n">chunk</span> <span class="ow">in</span> <span class="p">[</span><span class="s1">'&lt;sp'</span><span class="p">,</span> <span class="s1">'an&gt;buff'</span><span class="p">,</span> <span class="s1">'ered '</span><span class="p">,</span> <span class="s1">'text&lt;/s'</span><span class="p">,</span> <span class="s1">'pan&gt;'</span><span class="p">]:</span>
<span class="gp">... </span> <span class="n">parser</span><span class="o">.</span><span class="n">feed</span><span class="p">(</span><span class="n">chunk</span><span class="p">)</span>
<span class="gp">...</span>
<span class="go">Start tag: span</span>
<span class="go">Data : buff</span>
<span class="go">Data : ered</span>
<span class="go">Data : text</span>
<span class="go">End tag : span</span>
</code></pre><p><span class="yiyi-st" id="yiyi-108">解析无效的HTML例如</span><span class="yiyi-st" id="yiyi-109">无参数属性)也工作:</span></p><pre><code class="language-python"><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">parser</span><span class="o">.</span><span class="n">feed</span><span class="p">(</span><span class="s1">'&lt;p&gt;&lt;a class=link href=#main&gt;tag soup&lt;/p &gt;&lt;/a&gt;'</span><span class="p">)</span>
<span class="go">Start tag: p</span>
<span class="go">Start tag: a</span>
<span class="go"> attr: ('class', 'link')</span>
<span class="go"> attr: ('href', '#main')</span>
<span class="go">Data : tag soup</span>
<span class="go">End tag : p</span>
<span class="go">End tag : a</span>
</code></pre></div></div></div>