mirror of
https://github.com/fofolee/uTools-Manuals.git
synced 2025-06-08 23:14:06 +08:00
109 lines
33 KiB
HTML
109 lines
33 KiB
HTML
<div class="body" role="main"><div class="section" id="module-html.parser"><h1><span class="yiyi-st" id="yiyi-10">20.2。 <a class="reference internal" href="#module-html.parser" title="html.parser: A simple parser that can handle HTML and XHTML."><code class="xref py py-mod docutils literal"><span class="pre">html.parser</span></code></a> - 简单的HTML和XHTML解析器</span></h1><p><span class="yiyi-st" id="yiyi-11"><strong>源代码:</strong> <a class="reference external" href="https://hg.python.org/cpython/file/3.5/Lib/html/parser.py">Lib / html / parser.py</a></span></p><p><span class="yiyi-st" id="yiyi-12">此模块定义了一个类<a class="reference internal" href="#html.parser.HTMLParser" title="html.parser.HTMLParser"><code class="xref py py-class docutils literal"><span class="pre">HTMLParser</span></code></a>,作为解析以HTML(超文本标记语言)和XHTML格式化的文本文件的基础。</span></p><dl class="class"><dt id="html.parser.HTMLParser"><span class="yiyi-st" id="yiyi-13"> <em class="property">class </em><code class="descclassname">html.parser.</code><code class="descname">HTMLParser</code><span class="sig-paren">(</span><em>*</em>, <em>convert_charrefs=True</em><span class="sig-paren">)</span></span></dt><dd><p><span class="yiyi-st" id="yiyi-14">创建一个解析器实例能够解析无效标记。</span></p><p><span class="yiyi-st" id="yiyi-15">如果<em>convert_charrefs</em>是<code class="docutils literal"><span class="pre">True</span></code>(默认值),所有字符引用(<code class="docutils literal"><span class="pre">script</span></code> / <code class="docutils literal"><span class="pre">style</span></code>会自动转换为相应的Unicode字符。</span></p><p><span class="yiyi-st" id="yiyi-16">当遇到开始标签,结束标签,文本,注释和其他标记元素时,<a class="reference internal" href="#html.parser.HTMLParser" title="html.parser.HTMLParser"><code class="xref py py-class docutils literal"><span class="pre">HTMLParser</span></code></a>实例提供HTML数据并调用处理程序方法。</span><span class="yiyi-st" id="yiyi-17">用户应该子类化<a class="reference internal" href="#html.parser.HTMLParser" title="html.parser.HTMLParser"><code class="xref py py-class docutils literal"><span class="pre">HTMLParser</span></code></a>并覆盖其方法以实现所需的行为。</span></p><p><span class="yiyi-st" id="yiyi-18">此解析器不检查结束标记是否匹配开始标记,或者通过关闭外部元素来隐式关闭的元素调用结束标记处理程序。</span></p><div class="versionchanged"><p><span class="yiyi-st" id="yiyi-19"><span class="versionmodified">在版本3.4中已更改:</span> <em>convert_charrefs</em>添加了关键字参数。</span></p></div><div class="versionchanged"><p><span class="yiyi-st" id="yiyi-20"><span class="versionmodified">在版本3.5中更改:</span>参数<em>convert_charrefs</em>的默认值现在为<code class="docutils literal"><span class="pre">True</span></code>。</span></p></div></dd></dl><div class="section" id="example-html-parser-application"><h2><span class="yiyi-st" id="yiyi-21">20.2.1. </span><span class="yiyi-st" id="yiyi-22">示例HTML解析器应用程序</span></h2><p><span class="yiyi-st" id="yiyi-23">作为一个基本示例,下面是一个简单的HTML解析器,它使用<a class="reference internal" href="#html.parser.HTMLParser" title="html.parser.HTMLParser"><code class="xref py py-class docutils literal"><span class="pre">HTMLParser</span></code></a>类打印出遇到的开始标签,结束标签和数据:</span></p><pre><code class="language-python"><span></span><span class="kn">from</span> <span class="nn">html.parser</span> <span class="k">import</span> <span class="n">HTMLParser</span>
|
||
|
||
<span class="k">class</span> <span class="nc">MyHTMLParser</span><span class="p">(</span><span class="n">HTMLParser</span><span class="p">):</span>
|
||
<span class="k">def</span> <span class="nf">handle_starttag</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">tag</span><span class="p">,</span> <span class="n">attrs</span><span class="p">):</span>
|
||
<span class="nb">print</span><span class="p">(</span><span class="s2">"Encountered a start tag:"</span><span class="p">,</span> <span class="n">tag</span><span class="p">)</span>
|
||
|
||
<span class="k">def</span> <span class="nf">handle_endtag</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">tag</span><span class="p">):</span>
|
||
<span class="nb">print</span><span class="p">(</span><span class="s2">"Encountered an end tag :"</span><span class="p">,</span> <span class="n">tag</span><span class="p">)</span>
|
||
|
||
<span class="k">def</span> <span class="nf">handle_data</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">data</span><span class="p">):</span>
|
||
<span class="nb">print</span><span class="p">(</span><span class="s2">"Encountered some data :"</span><span class="p">,</span> <span class="n">data</span><span class="p">)</span>
|
||
|
||
<span class="n">parser</span> <span class="o">=</span> <span class="n">MyHTMLParser</span><span class="p">()</span>
|
||
<span class="n">parser</span><span class="o">.</span><span class="n">feed</span><span class="p">(</span><span class="s1">'<html><head><title>Test</title></head>'</span>
|
||
<span class="s1">'<body><h1>Parse me!</h1></body></html>'</span><span class="p">)</span>
|
||
</code></pre><p><span class="yiyi-st" id="yiyi-24">输出将是:</span></p><div class="highlight-none"><div class="highlight"><pre><span></span>Encountered a start tag: html
|
||
Encountered a start tag: head
|
||
Encountered a start tag: title
|
||
Encountered some data : Test
|
||
Encountered an end tag : title
|
||
Encountered an end tag : head
|
||
Encountered a start tag: body
|
||
Encountered a start tag: h1
|
||
Encountered some data : Parse me!
|
||
Encountered an end tag : h1
|
||
Encountered an end tag : body
|
||
Encountered an end tag : html
|
||
</pre></div></div></div><div class="section" id="htmlparser-methods"><h2><span class="yiyi-st" id="yiyi-25">20.2.2. </span><span class="yiyi-st" id="yiyi-26"><a class="reference internal" href="#html.parser.HTMLParser" title="html.parser.HTMLParser"><code class="xref py py-class docutils literal"><span class="pre">HTMLParser</span></code></a>方法</span></h2><p><span class="yiyi-st" id="yiyi-27"><a class="reference internal" href="#html.parser.HTMLParser" title="html.parser.HTMLParser"><code class="xref py py-class docutils literal"><span class="pre">HTMLParser</span></code></a>实例具有以下方法:</span></p><dl class="method"><dt id="html.parser.HTMLParser.feed"><span class="yiyi-st" id="yiyi-28"> <code class="descclassname">HTMLParser.</code><code class="descname">feed</code><span class="sig-paren">(</span><em>data</em><span class="sig-paren">)</span></span></dt><dd><p><span class="yiyi-st" id="yiyi-29">将一些文本提供给解析器。</span><span class="yiyi-st" id="yiyi-30">若包含的元素完整,则它将被解析器处理;不完整的数据将被缓冲,直到送入更多数据或调用<a class="reference internal" href="#html.parser.HTMLParser.close" title="html.parser.HTMLParser.close"><code class="xref py py-meth docutils literal"><span class="pre">close()</span></code></a>。</span><span class="yiyi-st" id="yiyi-31"><em>数据</em>必须为<a class="reference internal" href="stdtypes.html#str" title="str"><code class="xref py py-class docutils literal"><span class="pre">str</span></code></a>。</span></p></dd></dl><dl class="method"><dt id="html.parser.HTMLParser.close"><span class="yiyi-st" id="yiyi-32"> <code class="descclassname">HTMLParser.</code><code class="descname">close</code><span class="sig-paren">(</span><span class="sig-paren">)</span></span></dt><dd><p><span class="yiyi-st" id="yiyi-33">强制处理所有缓冲的数据,就好像它后面是文件结束标记。</span><span class="yiyi-st" id="yiyi-34">此方法可以由派生类重新定义,以在输入结束时定义附加处理,但重新定义的版本应始终调用<a class="reference internal" href="#html.parser.HTMLParser" title="html.parser.HTMLParser"><code class="xref py py-class docutils literal"><span class="pre">HTMLParser</span></code></a>基类方法<a class="reference internal" href="#html.parser.HTMLParser.close" title="html.parser.HTMLParser.close"><code class="xref py py-meth docutils literal"><span class="pre">close()</span></code></a>。</span></p></dd></dl><dl class="method"><dt id="html.parser.HTMLParser.reset"><span class="yiyi-st" id="yiyi-35"> <code class="descclassname">HTMLParser.</code><code class="descname">reset</code><span class="sig-paren">(</span><span class="sig-paren">)</span></span></dt><dd><p><span class="yiyi-st" id="yiyi-36">重置实例。</span><span class="yiyi-st" id="yiyi-37">丢失所有未处理的数据。</span><span class="yiyi-st" id="yiyi-38">这在实例化时隐式调用。</span></p></dd></dl><dl class="method"><dt id="html.parser.HTMLParser.getpos"><span class="yiyi-st" id="yiyi-39"> <code class="descclassname">HTMLParser.</code><code class="descname">getpos</code><span class="sig-paren">(</span><span class="sig-paren">)</span></span></dt><dd><p><span class="yiyi-st" id="yiyi-40">返回当前行号和偏移量。</span></p></dd></dl><dl class="method"><dt id="html.parser.HTMLParser.get_starttag_text"><span class="yiyi-st" id="yiyi-41"> <code class="descclassname">HTMLParser.</code><code class="descname">get_starttag_text</code><span class="sig-paren">(</span><span class="sig-paren">)</span></span></dt><dd><p><span class="yiyi-st" id="yiyi-42">返回最近打开的开始标记的文本。</span><span class="yiyi-st" id="yiyi-43">这通常不需要用于结构化处理,但是可能在处理“部署的”或用于以最小改变重新生成输入(可以保留属性之间的空格等)时有用。</span><span class="yiyi-st" id="yiyi-44">)。</span></p></dd></dl><p><span class="yiyi-st" id="yiyi-45">当遇到数据或标记元素并且它们意图在子类中被覆盖时,调用以下方法。</span><span class="yiyi-st" id="yiyi-46">基类实现什么也不做(除了<a class="reference internal" href="#html.parser.HTMLParser.handle_startendtag" title="html.parser.HTMLParser.handle_startendtag"><code class="xref py py-meth docutils literal"><span class="pre">handle_startendtag()</span></code></a>):</span></p><dl class="method"><dt id="html.parser.HTMLParser.handle_starttag"><span class="yiyi-st" id="yiyi-47"> <code class="descclassname">HTMLParser.</code><code class="descname">handle_starttag</code><span class="sig-paren">(</span><em>tag</em>, <em>attrs</em><span class="sig-paren">)</span></span></dt><dd><p><span class="yiyi-st" id="yiyi-48">调用此方法来处理标记的开始(例如,</span><span class="yiyi-st" id="yiyi-49"><code class="docutils literal"><span class="pre">&lt; div</span> <span class="pre">id =“main”&gt;</span></code>)。</span></p><p><span class="yiyi-st" id="yiyi-50"><em>tag</em>参数是转换为小写的标签的名称。</span><span class="yiyi-st" id="yiyi-51">The <em>attrs</em> argument is a list of <code class="docutils literal"><span class="pre">(name,</span> <span class="pre">value)</span></code> pairs containing the attributes found inside the tag’s <code class="docutils literal"><span class="pre"><></span></code> brackets. </span><span class="yiyi-st" id="yiyi-52"><em>name</em>将被转换为小写,<em>value</em>中的引号已被移除,字符和实体引用已被替换。</span></p><p><span class="yiyi-st" id="yiyi-53">对于实例,对于标签<code class="docutils literal"><span class="pre">&lt; A</span> <span class="pre">HREF =“https://www.cwi.nl/”&gt;</span></code>称为<code class="docutils literal"><span class="pre">handle_starttag('a',</span> <span class="pre">[('href',</span> <span class="pre">'https://www.cwi.nl/')]) </span></code>。</span></p><p><span class="yiyi-st" id="yiyi-54">来自<a class="reference internal" href="html.entities.html#module-html.entities" title="html.entities: Definitions of HTML general entities."><code class="xref py py-mod docutils literal"><span class="pre">html.entities</span></code></a>的所有实体引用都将在属性值中替换。</span></p></dd></dl><dl class="method"><dt id="html.parser.HTMLParser.handle_endtag"><span class="yiyi-st" id="yiyi-55"> <code class="descclassname">HTMLParser.</code><code class="descname">handle_endtag</code><span class="sig-paren">(</span><em>tag</em><span class="sig-paren">)</span></span></dt><dd><p><span class="yiyi-st" id="yiyi-56">调用此方法来处理元素的结束标记(例如,</span><span class="yiyi-st" id="yiyi-57"><code class="docutils literal"><span class="pre"></div></span></code>)。</span></p><p><span class="yiyi-st" id="yiyi-58"><em>标签</em>参数是转换为小写的标签的名称。</span></p></dd></dl><dl class="method"><dt id="html.parser.HTMLParser.handle_startendtag"><span class="yiyi-st" id="yiyi-59"> <code class="descclassname">HTMLParser.</code><code class="descname">handle_startendtag</code><span class="sig-paren">(</span><em>tag</em>, <em>attrs</em><span class="sig-paren">)</span></span></dt><dd><p><span class="yiyi-st" id="yiyi-60">Similar to <a class="reference internal" href="#html.parser.HTMLParser.handle_starttag" title="html.parser.HTMLParser.handle_starttag"><code class="xref py py-meth docutils literal"><span class="pre">handle_starttag()</span></code></a>, but called when the parser encounters an XHTML-style empty tag (<code class="docutils literal"><span class="pre"><img</span> <span class="pre">...</span> <span class="pre">/></span></code>). </span><span class="yiyi-st" id="yiyi-61">这个方法可以被需要这个特定词汇信息的子类覆盖;默认实现只需调用<a class="reference internal" href="#html.parser.HTMLParser.handle_starttag" title="html.parser.HTMLParser.handle_starttag"><code class="xref py py-meth docutils literal"><span class="pre">handle_starttag()</span></code></a>和<a class="reference internal" href="#html.parser.HTMLParser.handle_endtag" title="html.parser.HTMLParser.handle_endtag"><code class="xref py py-meth docutils literal"><span class="pre">handle_endtag()</span></code></a>。</span></p></dd></dl><dl class="method"><dt id="html.parser.HTMLParser.handle_data"><span class="yiyi-st" id="yiyi-62"> <code class="descclassname">HTMLParser.</code><code class="descname">handle_data</code><span class="sig-paren">(</span><em>data</em><span class="sig-paren">)</span></span></dt><dd><p><span class="yiyi-st" id="yiyi-63">调用此方法来处理任意数据(例如,</span><span class="yiyi-st" id="yiyi-64">文本节点和<code class="docutils literal"><span class="pre"><script>...</script></span></code>和<code class="docutils literal"><span class="pre"><style>...</style></span></code>)的内容。</span></p></dd></dl><dl class="method"><dt id="html.parser.HTMLParser.handle_entityref"><span class="yiyi-st" id="yiyi-65"> <code class="descclassname">HTMLParser.</code><code class="descname">handle_entityref</code><span class="sig-paren">(</span><em>name</em><span class="sig-paren">)</span></span></dt><dd><p><span class="yiyi-st" id="yiyi-66">调用此方法来处理形式为<code class="docutils literal"><span class="pre">&name;</span></code>的命名字符引用(例如,</span><span class="yiyi-st" id="yiyi-67"><code class="docutils literal"><span class="pre">&gt;</span></code>),其中<em>name</em>是一般实体引用</span><span class="yiyi-st" id="yiyi-68"><code class="docutils literal"><span class="pre">'gt'</span></code>)。</span><span class="yiyi-st" id="yiyi-69">如果<em>convert_charrefs</em>为<code class="docutils literal"><span class="pre">True</span></code>,则不会调用此方法。</span></p></dd></dl><dl class="method"><dt id="html.parser.HTMLParser.handle_charref"><span class="yiyi-st" id="yiyi-70"> <code class="descclassname">HTMLParser.</code><code class="descname">handle_charref</code><span class="sig-paren">(</span><em>name</em><span class="sig-paren">)</span></span></dt><dd><p><span class="yiyi-st" id="yiyi-71">调用此方法以处理形式为<code class="docutils literal"><span class="pre">&#NNN;</span></code>和<code class="docutils literal"><span class="pre">&#xNNN;</span></code>的十进制和十六进制数字字符引用。</span><span class="yiyi-st" id="yiyi-72">例如,<code class="docutils literal"><span class="pre">&gt;</span></code>的十进制等于<code class="docutils literal"><span class="pre">&#62;</span></code>,而十六进制是<code class="docutils literal"><span class="pre">&#x3E;</span></code>;在这种情况下,该方法将接收<code class="docutils literal"><span class="pre">'62'</span></code>或<code class="docutils literal"><span class="pre">'x3E'</span></code>。</span><span class="yiyi-st" id="yiyi-73">如果<em>convert_charrefs</em>为<code class="docutils literal"><span class="pre">True</span></code>,则不会调用此方法。</span></p></dd></dl><dl class="method"><dt id="html.parser.HTMLParser.handle_comment"><span class="yiyi-st" id="yiyi-74"> <code class="descclassname">HTMLParser.</code><code class="descname">handle_comment</code><span class="sig-paren">(</span><em>data</em><span class="sig-paren">)</span></span></dt><dd><p><span class="yiyi-st" id="yiyi-75">当遇到注释时调用此方法。</span><span class="yiyi-st" id="yiyi-76"><code class="docutils literal"><span class="pre"><!--comment--></span></code>)。</span></p><p><span class="yiyi-st" id="yiyi-77">例如,注释<code class="docutils literal"><span class="pre"> <span class="pre">注释</span> <span class="pre"> - ></span></span></code>将导致调用此方法与参数<code class="docutils literal"><span class="pre">'</span> <span class="pre">注释</span> <span class="pre">'</span></code>。</span></p><p><span class="yiyi-st" id="yiyi-78">The content of Internet Explorer conditional comments (condcoms) will also be sent to this method, so, for <code class="docutils literal"><span class="pre"><!--[if</span> <span class="pre">IE</span> <span class="pre">9]>IE9-specific</span> <span class="pre">content<![endif]--></span></code>, this method will receive <code class="docutils literal"><span class="pre">'[if</span> <span class="pre">IE</span> <span class="pre">9]>IE9-specific</span> <span class="pre">content<![endif]'</span></code>.</span></p></dd></dl><dl class="method"><dt id="html.parser.HTMLParser.handle_decl"><span class="yiyi-st" id="yiyi-79"> <code class="descclassname">HTMLParser.</code><code class="descname">handle_decl</code><span class="sig-paren">(</span><em>decl</em><span class="sig-paren">)</span></span></dt><dd><p><span class="yiyi-st" id="yiyi-80">调用此方法来处理HTML doctype声明。</span><span class="yiyi-st" id="yiyi-81"><code class="docutils literal"><span class="pre">&lt;!DOCTYPE</span> <span class="pre">html&gt;</span></code>)。</span></p><p><span class="yiyi-st" id="yiyi-82"><em>decl t>参数将是<code class="docutils literal"><span class="pre"><!...></span></code>标记内声明的全部内容。</em></span><span class="yiyi-st" id="yiyi-83"><code class="docutils literal"><span class="pre">'DOCTYPE</span> <span class="pre">html'</span></code>)。</span></p></dd></dl><dl class="method"><dt id="html.parser.HTMLParser.handle_pi"><span class="yiyi-st" id="yiyi-84"> <code class="descclassname">HTMLParser.</code><code class="descname">handle_pi</code><span class="sig-paren">(</span><em>data</em><span class="sig-paren">)</span></span></dt><dd><p><span class="yiyi-st" id="yiyi-85">遇到处理指令时调用的方法。</span><span class="yiyi-st" id="yiyi-86"><em>数据</em>参数将包含整个处理指令。</span><span class="yiyi-st" id="yiyi-87">For example, for the processing instruction <code class="docutils literal"><span class="pre"><?proc</span> <span class="pre">color='red'></span></code>, this method would be called as <code class="docutils literal"><span class="pre">handle_pi("proc</span> <span class="pre">color='red'")</span></code>. </span><span class="yiyi-st" id="yiyi-88">它的目的是被一个派生类覆盖;基类实现什么也不做。</span></p><div class="admonition note"><p class="first admonition-title"><span class="yiyi-st" id="yiyi-89">注意</span></p><p class="last"><span class="yiyi-st" id="yiyi-90"><a class="reference internal" href="#html.parser.HTMLParser" title="html.parser.HTMLParser"><code class="xref py py-class docutils literal"><span class="pre">HTMLParser</span></code></a>类使用SGML语法规则来处理指令。</span><span class="yiyi-st" id="yiyi-91">使用尾随<code class="docutils literal"><span class="pre">'?'</span></code>的XHTML处理指令</span><span class="yiyi-st" id="yiyi-92">将导致<code class="docutils literal"><span class="pre">'?'</span></code></span><span class="yiyi-st" id="yiyi-93">以包括在<em>数据</em>中。</span></p></div></dd></dl><dl class="method"><dt id="html.parser.HTMLParser.unknown_decl"><span class="yiyi-st" id="yiyi-94"> <code class="descclassname">HTMLParser.</code><code class="descname">unknown_decl</code><span class="sig-paren">(</span><em>data</em><span class="sig-paren">)</span></span></dt><dd><p><span class="yiyi-st" id="yiyi-95">当解析器读取无法识别的声明时,将调用此方法。</span></p><p><span class="yiyi-st" id="yiyi-96"><em>data</em>参数将是<code class="docutils literal"><span class="pre"><![...]></span></code>标记内声明的全部内容。</span><span class="yiyi-st" id="yiyi-97">有时有用的是被派生类覆盖。</span><span class="yiyi-st" id="yiyi-98">基类实现什么也不做。</span></p></dd></dl></div><div class="section" id="examples"><h2><span class="yiyi-st" id="yiyi-99">20.2.3. </span><span class="yiyi-st" id="yiyi-100">实例</span></h2><p><span class="yiyi-st" id="yiyi-101">下面的类实现了一个解析器,将用来说明更多的例子:</span></p><pre><code class="language-python"><span></span><span class="kn">from</span> <span class="nn">html.parser</span> <span class="k">import</span> <span class="n">HTMLParser</span>
|
||
<span class="kn">from</span> <span class="nn">html.entities</span> <span class="k">import</span> <span class="n">name2codepoint</span>
|
||
|
||
<span class="k">class</span> <span class="nc">MyHTMLParser</span><span class="p">(</span><span class="n">HTMLParser</span><span class="p">):</span>
|
||
<span class="k">def</span> <span class="nf">handle_starttag</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">tag</span><span class="p">,</span> <span class="n">attrs</span><span class="p">):</span>
|
||
<span class="nb">print</span><span class="p">(</span><span class="s2">"Start tag:"</span><span class="p">,</span> <span class="n">tag</span><span class="p">)</span>
|
||
<span class="k">for</span> <span class="n">attr</span> <span class="ow">in</span> <span class="n">attrs</span><span class="p">:</span>
|
||
<span class="nb">print</span><span class="p">(</span><span class="s2">" attr:"</span><span class="p">,</span> <span class="n">attr</span><span class="p">)</span>
|
||
|
||
<span class="k">def</span> <span class="nf">handle_endtag</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">tag</span><span class="p">):</span>
|
||
<span class="nb">print</span><span class="p">(</span><span class="s2">"End tag :"</span><span class="p">,</span> <span class="n">tag</span><span class="p">)</span>
|
||
|
||
<span class="k">def</span> <span class="nf">handle_data</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">data</span><span class="p">):</span>
|
||
<span class="nb">print</span><span class="p">(</span><span class="s2">"Data :"</span><span class="p">,</span> <span class="n">data</span><span class="p">)</span>
|
||
|
||
<span class="k">def</span> <span class="nf">handle_comment</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">data</span><span class="p">):</span>
|
||
<span class="nb">print</span><span class="p">(</span><span class="s2">"Comment :"</span><span class="p">,</span> <span class="n">data</span><span class="p">)</span>
|
||
|
||
<span class="k">def</span> <span class="nf">handle_entityref</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">):</span>
|
||
<span class="n">c</span> <span class="o">=</span> <span class="nb">chr</span><span class="p">(</span><span class="n">name2codepoint</span><span class="p">[</span><span class="n">name</span><span class="p">])</span>
|
||
<span class="nb">print</span><span class="p">(</span><span class="s2">"Named ent:"</span><span class="p">,</span> <span class="n">c</span><span class="p">)</span>
|
||
|
||
<span class="k">def</span> <span class="nf">handle_charref</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">):</span>
|
||
<span class="k">if</span> <span class="n">name</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s1">'x'</span><span class="p">):</span>
|
||
<span class="n">c</span> <span class="o">=</span> <span class="nb">chr</span><span class="p">(</span><span class="nb">int</span><span class="p">(</span><span class="n">name</span><span class="p">[</span><span class="mi">1</span><span class="p">:],</span> <span class="mi">16</span><span class="p">))</span>
|
||
<span class="k">else</span><span class="p">:</span>
|
||
<span class="n">c</span> <span class="o">=</span> <span class="nb">chr</span><span class="p">(</span><span class="nb">int</span><span class="p">(</span><span class="n">name</span><span class="p">))</span>
|
||
<span class="nb">print</span><span class="p">(</span><span class="s2">"Num ent :"</span><span class="p">,</span> <span class="n">c</span><span class="p">)</span>
|
||
|
||
<span class="k">def</span> <span class="nf">handle_decl</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">data</span><span class="p">):</span>
|
||
<span class="nb">print</span><span class="p">(</span><span class="s2">"Decl :"</span><span class="p">,</span> <span class="n">data</span><span class="p">)</span>
|
||
|
||
<span class="n">parser</span> <span class="o">=</span> <span class="n">MyHTMLParser</span><span class="p">()</span>
|
||
</code></pre><p><span class="yiyi-st" id="yiyi-102">解析doctype:</span></p><pre><code class="language-python"><span></span><span class="gp">>>> </span><span class="n">parser</span><span class="o">.</span><span class="n">feed</span><span class="p">(</span><span class="s1">'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '</span>
|
||
<span class="gp">... </span> <span class="s1">'"http://www.w3.org/TR/html4/strict.dtd">'</span><span class="p">)</span>
|
||
<span class="go">Decl : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"</span>
|
||
</code></pre><p><span class="yiyi-st" id="yiyi-103">解析具有几个属性和标题的元素:</span></p><pre><code class="language-python"><span></span><span class="gp">>>> </span><span class="n">parser</span><span class="o">.</span><span class="n">feed</span><span class="p">(</span><span class="s1">'<img src="python-logo.png" alt="The Python logo">'</span><span class="p">)</span>
|
||
<span class="go">Start tag: img</span>
|
||
<span class="go"> attr: ('src', 'python-logo.png')</span>
|
||
<span class="go"> attr: ('alt', 'The Python logo')</span>
|
||
<span class="go">>>></span>
|
||
<span class="gp">>>> </span><span class="n">parser</span><span class="o">.</span><span class="n">feed</span><span class="p">(</span><span class="s1">'<h1>Python</h1>'</span><span class="p">)</span>
|
||
<span class="go">Start tag: h1</span>
|
||
<span class="go">Data : Python</span>
|
||
<span class="go">End tag : h1</span>
|
||
</code></pre><p><span class="yiyi-st" id="yiyi-104"><code class="docutils literal"><span class="pre">script</span></code>和<code class="docutils literal"><span class="pre">style</span></code>元素的内容按原样返回,无需进一步解析:</span></p><pre><code class="language-python"><span></span><span class="gp">>>> </span><span class="n">parser</span><span class="o">.</span><span class="n">feed</span><span class="p">(</span><span class="s1">'<style type="text/css">#python { color: green }</style>'</span><span class="p">)</span>
|
||
<span class="go">Start tag: style</span>
|
||
<span class="go"> attr: ('type', 'text/css')</span>
|
||
<span class="go">Data : #python { color: green }</span>
|
||
<span class="go">End tag : style</span>
|
||
|
||
<span class="gp">>>> </span><span class="n">parser</span><span class="o">.</span><span class="n">feed</span><span class="p">(</span><span class="s1">'<script type="text/javascript">'</span>
|
||
<span class="gp">... </span> <span class="s1">'alert("<strong>hello!</strong>");</script>'</span><span class="p">)</span>
|
||
<span class="go">Start tag: script</span>
|
||
<span class="go"> attr: ('type', 'text/javascript')</span>
|
||
<span class="go">Data : alert("<strong>hello!</strong>");</span>
|
||
<span class="go">End tag : script</span>
|
||
</code></pre><p><span class="yiyi-st" id="yiyi-105">解析注释:</span></p><pre><code class="language-python"><span></span><span class="gp">>>> </span><span class="n">parser</span><span class="o">.</span><span class="n">feed</span><span class="p">(</span><span class="s1">'<!-- a comment -->'</span>
|
||
<span class="gp">... </span> <span class="s1">'<!--[if IE 9]>IE-specific content<![endif]-->'</span><span class="p">)</span>
|
||
<span class="go">Comment : a comment</span>
|
||
<span class="go">Comment : [if IE 9]>IE-specific content<![endif]</span>
|
||
</code></pre><p><span class="yiyi-st" id="yiyi-106">解析命名和数字字符引用并将它们转换为正确的char(注意:这3个引用都等效于<code class="docutils literal"><span class="pre">'>'</span></code>):</span></p><pre><code class="language-python"><span></span><span class="gp">>>> </span><span class="n">parser</span><span class="o">.</span><span class="n">feed</span><span class="p">(</span><span class="s1">'&gt;&#62;&#x3E;'</span><span class="p">)</span>
|
||
<span class="go">Named ent: ></span>
|
||
<span class="go">Num ent : ></span>
|
||
<span class="go">Num ent : ></span>
|
||
</code></pre><p><span class="yiyi-st" id="yiyi-107">向<a class="reference internal" href="#html.parser.HTMLParser.feed" title="html.parser.HTMLParser.feed"><code class="xref py py-meth docutils literal"><span class="pre">feed()</span></code></a>提供不完整的块可以工作,但<a class="reference internal" href="#html.parser.HTMLParser.handle_data" title="html.parser.HTMLParser.handle_data"><code class="xref py py-meth docutils literal"><span class="pre">handle_data()</span></code></a>可能会被调用多次(除非<em>convert_charrefs</em>设置为<code class="docutils literal"><span class="pre">True</span></code>):</span></p><pre><code class="language-python"><span></span><span class="gp">>>> </span><span class="k">for</span> <span class="n">chunk</span> <span class="ow">in</span> <span class="p">[</span><span class="s1">'<sp'</span><span class="p">,</span> <span class="s1">'an>buff'</span><span class="p">,</span> <span class="s1">'ered '</span><span class="p">,</span> <span class="s1">'text</s'</span><span class="p">,</span> <span class="s1">'pan>'</span><span class="p">]:</span>
|
||
<span class="gp">... </span> <span class="n">parser</span><span class="o">.</span><span class="n">feed</span><span class="p">(</span><span class="n">chunk</span><span class="p">)</span>
|
||
<span class="gp">...</span>
|
||
<span class="go">Start tag: span</span>
|
||
<span class="go">Data : buff</span>
|
||
<span class="go">Data : ered</span>
|
||
<span class="go">Data : text</span>
|
||
<span class="go">End tag : span</span>
|
||
</code></pre><p><span class="yiyi-st" id="yiyi-108">解析无效的HTML(例如</span><span class="yiyi-st" id="yiyi-109">无参数属性)也工作:</span></p><pre><code class="language-python"><span></span><span class="gp">>>> </span><span class="n">parser</span><span class="o">.</span><span class="n">feed</span><span class="p">(</span><span class="s1">'<p><a class=link href=#main>tag soup</p ></a>'</span><span class="p">)</span>
|
||
<span class="go">Start tag: p</span>
|
||
<span class="go">Start tag: a</span>
|
||
<span class="go"> attr: ('class', 'link')</span>
|
||
<span class="go"> attr: ('href', '#main')</span>
|
||
<span class="go">Data : tag soup</span>
|
||
<span class="go">End tag : p</span>
|
||
<span class="go">End tag : a</span>
|
||
</code></pre></div></div></div> |