mirror of
https://github.com/fofolee/uTools-Manuals.git
synced 2025-06-08 23:14:06 +08:00
2006 lines
205 KiB
HTML
2006 lines
205 KiB
HTML
<div class="body">
|
||
<div class="section" id="id4">
|
||
<h1>快速开始</h1>
|
||
<p>下面的一段HTML代码将作为例子被多次用到.这是 <em>爱丽丝梦游仙境的</em> 的一段内容(以后内容中简称为 <em>爱丽丝</em> 的文档):</p>
|
||
<pre><code class="language-python"><span class="n">html_doc</span> <span class="o">=</span> <span class="s">"""</span>
|
||
<span class="s"><html><head><title>The Dormouse's story</title></head></span>
|
||
<span class="s"><body></span>
|
||
<span class="s"><p class="title"><b>The Dormouse's story</b></p></span>
|
||
|
||
<span class="s"><p class="story">Once upon a time there were three little sisters; and their names were</span>
|
||
<span class="s"><a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,</span>
|
||
<span class="s"><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and</span>
|
||
<span class="s"><a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;</span>
|
||
<span class="s">and they lived at the bottom of a well.</p></span>
|
||
|
||
<span class="s"><p class="story">...</p></span>
|
||
<span class="s">"""</span>
|
||
</code></pre>
|
||
<p>使用BeautifulSoup解析这段代码,能够得到一个 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 的对象,并能按照标准的缩进格式的结构输出:</p>
|
||
<pre><code class="language-python"><span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">BeautifulSoup</span>
|
||
<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">html_doc</span><span class="p">)</span>
|
||
|
||
<span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">prettify</span><span class="p">())</span>
|
||
<span class="c"># <html></span>
|
||
<span class="c"># <head></span>
|
||
<span class="c"># <title></span>
|
||
<span class="c"># The Dormouse's story</span>
|
||
<span class="c"># </title></span>
|
||
<span class="c"># </head></span>
|
||
<span class="c"># <body></span>
|
||
<span class="c"># <p class="title"></span>
|
||
<span class="c"># <b></span>
|
||
<span class="c"># The Dormouse's story</span>
|
||
<span class="c"># </b></span>
|
||
<span class="c"># </p></span>
|
||
<span class="c"># <p class="story"></span>
|
||
<span class="c"># Once upon a time there were three little sisters; and their names were</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/elsie" id="link1"></span>
|
||
<span class="c"># Elsie</span>
|
||
<span class="c"># </a></span>
|
||
<span class="c"># ,</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/lacie" id="link2"></span>
|
||
<span class="c"># Lacie</span>
|
||
<span class="c"># </a></span>
|
||
<span class="c"># and</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/tillie" id="link2"></span>
|
||
<span class="c"># Tillie</span>
|
||
<span class="c"># </a></span>
|
||
<span class="c"># ; and they lived at the bottom of a well.</span>
|
||
<span class="c"># </p></span>
|
||
<span class="c"># <p class="story"></span>
|
||
<span class="c"># ...</span>
|
||
<span class="c"># </p></span>
|
||
<span class="c"># </body></span>
|
||
<span class="c"># </html></span>
|
||
</code></pre>
|
||
<p>几个简单的浏览结构化数据的方法:</p>
|
||
<pre><code class="language-python"><span class="n">soup</span><span class="o">.</span><span class="n">title</span>
|
||
<span class="c"># <title>The Dormouse's story</title></span>
|
||
|
||
<span class="n">soup</span><span class="o">.</span><span class="n">title</span><span class="o">.</span><span class="n">name</span>
|
||
<span class="c"># u'title'</span>
|
||
|
||
<span class="n">soup</span><span class="o">.</span><span class="n">title</span><span class="o">.</span><span class="n">string</span>
|
||
<span class="c"># u'The Dormouse's story'</span>
|
||
|
||
<span class="n">soup</span><span class="o">.</span><span class="n">title</span><span class="o">.</span><span class="n">parent</span><span class="o">.</span><span class="n">name</span>
|
||
<span class="c"># u'head'</span>
|
||
|
||
<span class="n">soup</span><span class="o">.</span><span class="n">p</span>
|
||
<span class="c"># <p class="title"><b>The Dormouse's story</b></p></span>
|
||
|
||
<span class="n">soup</span><span class="o">.</span><span class="n">p</span><span class="p">[</span><span class="s">'class'</span><span class="p">]</span>
|
||
<span class="c"># u'title'</span>
|
||
|
||
<span class="n">soup</span><span class="o">.</span><span class="n">a</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a></span>
|
||
|
||
<span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">'a'</span><span class="p">)</span>
|
||
<span class="c"># [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]</span>
|
||
|
||
<span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="nb">id</span><span class="o">=</span><span class="s">"link3"</span><span class="p">)</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a></span>
|
||
</code></pre>
|
||
<p>从文档中找到所有<a>标签的链接:</p>
|
||
<pre><code class="language-python"><span class="k">for</span> <span class="n">link</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">'a'</span><span class="p">):</span>
|
||
<span class="k">print</span><span class="p">(</span><span class="n">link</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s">'href'</span><span class="p">))</span>
|
||
<span class="c"># http://example.com/elsie</span>
|
||
<span class="c"># http://example.com/lacie</span>
|
||
<span class="c"># http://example.com/tillie</span>
|
||
</code></pre>
|
||
<p>从文档中获取所有文字内容:</p>
|
||
<pre><code class="language-python"><span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">get_text</span><span class="p">())</span>
|
||
<span class="c"># The Dormouse's story</span>
|
||
<span class="c">#</span>
|
||
<span class="c"># The Dormouse's story</span>
|
||
<span class="c">#</span>
|
||
<span class="c"># Once upon a time there were three little sisters; and their names were</span>
|
||
<span class="c"># Elsie,</span>
|
||
<span class="c"># Lacie and</span>
|
||
<span class="c"># Tillie;</span>
|
||
<span class="c"># and they lived at the bottom of a well.</span>
|
||
<span class="c">#</span>
|
||
<span class="c"># ...</span>
|
||
</code></pre>
|
||
<p>这是你想要的吗?别着急,还有更好用的</p>
|
||
</div>
|
||
<div class="section" id="id5">
|
||
<h1>安装 Beautiful Soup</h1>
|
||
<p>如果你用的是新版的Debain或ubuntu,那么可以通过系统的软件包管理来安装:</p>
|
||
<p><tt class="docutils literal"><span class="pre">$</span> <span class="pre">apt-get</span> <span class="pre">install</span> <span class="pre">Python-bs4</span></tt></p>
|
||
<p>Beautiful Soup 4 通过PyPi发布,所以如果你无法使用系统包管理安装,那么也可以通过 <tt class="docutils literal"><span class="pre">easy_install</span></tt> 或 <tt class="docutils literal"><span class="pre">pip</span></tt> 来安装.包的名字是 <tt class="docutils literal"><span class="pre">beautifulsoup4</span></tt> ,这个包兼容Python2和Python3.</p>
|
||
<p><tt class="docutils literal"><span class="pre">$</span> <span class="pre">easy_install</span> <span class="pre">beautifulsoup4</span></tt></p>
|
||
<p><tt class="docutils literal"><span class="pre">$</span> <span class="pre">pip</span> <span class="pre">install</span> <span class="pre">beautifulsoup4</span></tt></p>
|
||
<p>(在PyPi中还有一个名字是 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 的包,但那可能不是你想要的,那是 <a class="reference external" href="http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html">Beautiful Soup3</a> 的发布版本,因为很多项目还在使用BS3, 所以 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 包依然有效.但是如果你在编写新项目,那么你应该安装的 <tt class="docutils literal"><span class="pre">beautifulsoup4</span></tt> )</p>
|
||
<p>如果你没有安装 <tt class="docutils literal"><span class="pre">easy_install</span></tt> 或 <tt class="docutils literal"><span class="pre">pip</span></tt> ,那你也可以 <a class="reference external" href="http://www.crummy.com/software/BeautifulSoup/download/4.x/">下载BS4的源码</a> ,然后通过setup.py来安装.</p>
|
||
<p><tt class="docutils literal"><span class="pre">$</span> <span class="pre">Python</span> <span class="pre">setup.py</span> <span class="pre">install</span></tt></p>
|
||
<p>如果上述安装方法都行不通,Beautiful Soup的发布协议允许你将BS4的代码打包在你的项目中,这样无须安装即可使用.</p>
|
||
<p>作者在Python2.7和Python3.2的版本下开发Beautiful Soup, 理论上Beautiful Soup应该在所有当前的Python版本中正常工作</p>
|
||
<div class="section" id="id8">
|
||
<h2>安装完成后的问题</h2>
|
||
<p>Beautiful Soup发布时打包成Python2版本的代码,在Python3环境下安装时,会自动转换成Python3的代码,如果没有一个安装的过程,那么代码就不会被转换.</p>
|
||
<p>如果代码抛出了 <tt class="docutils literal"><span class="pre">ImportError</span></tt> 的异常: “No module named HTMLParser”, 这是因为你在Python3版本中执行Python2版本的代码.</p>
|
||
<p>如果代码抛出了 <tt class="docutils literal"><span class="pre">ImportError</span></tt> 的异常: “No module named html.parser”, 这是因为你在Python2版本中执行Python3版本的代码.</p>
|
||
<p>如果遇到上述2种情况,最好的解决方法是重新安装BeautifulSoup4.</p>
|
||
<p>如果在ROOT_TAG_NAME = u’[document]’代码处遇到 <tt class="docutils literal"><span class="pre">SyntaxError</span></tt> “Invalid syntax”错误,需要将把BS4的Python代码版本从Python2转换到Python3. 可以重新安装BS4:</p>
|
||
<p><tt class="docutils literal"><span class="pre">$</span> <span class="pre">Python3</span> <span class="pre">setup.py</span> <span class="pre">install</span></tt></p>
|
||
<p>或在bs4的目录中执行Python代码版本转换脚本</p>
|
||
<p><tt class="docutils literal"><span class="pre">$</span> <span class="pre">2to3-3.2</span> <span class="pre">-w</span> <span class="pre">bs4</span></tt></p>
|
||
</div>
|
||
<div class="section" id="id9">
|
||
<h2>安装解析器</h2>
|
||
<p>Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 <a class="reference external" href="http://lxml.de/">lxml</a> .根据操作系统不同,可以选择下列方法来安装lxml:</p>
|
||
<p><tt class="docutils literal"><span class="pre">$</span> <span class="pre">apt-get</span> <span class="pre">install</span> <span class="pre">Python-lxml</span></tt></p>
|
||
<p><tt class="docutils literal"><span class="pre">$</span> <span class="pre">easy_install</span> <span class="pre">lxml</span></tt></p>
|
||
<p><tt class="docutils literal"><span class="pre">$</span> <span class="pre">pip</span> <span class="pre">install</span> <span class="pre">lxml</span></tt></p>
|
||
<p>另一个可供选择的解析器是纯Python实现的 <a class="reference external" href="http://code.google.com/p/html5lib/">html5lib</a> , html5lib的解析方式与浏览器相同,可以选择下列方法来安装html5lib:</p>
|
||
<p><tt class="docutils literal"><span class="pre">$</span> <span class="pre">apt-get</span> <span class="pre">install</span> <span class="pre">Python-html5lib</span></tt></p>
|
||
<p><tt class="docutils literal"><span class="pre">$</span> <span class="pre">easy_install</span> <span class="pre">html5lib</span></tt></p>
|
||
<p><tt class="docutils literal"><span class="pre">$</span> <span class="pre">pip</span> <span class="pre">install</span> <span class="pre">html5lib</span></tt></p>
|
||
<p>下表列出了主要的解析器,以及它们的优缺点:</p>
|
||
<table border="1" class="docutils">
|
||
<colgroup>
|
||
<col width="22%">
|
||
<col width="26%">
|
||
<col width="26%">
|
||
<col width="26%">
|
||
</colgroup>
|
||
<thead valign="bottom">
|
||
<tr class="row-odd"><th class="head">解析器</th>
|
||
<th class="head">使用方法</th>
|
||
<th class="head">优势</th>
|
||
<th class="head">劣势</th>
|
||
</tr>
|
||
</thead>
|
||
<tbody valign="top">
|
||
<tr class="row-even"><td>Python标准库</td>
|
||
<td><tt class="docutils literal"><span class="pre">BeautifulSoup(markup,</span>
|
||
<span class="pre">"html.parser")</span></tt></td>
|
||
<td><ul class="first last simple">
|
||
<li>Python的内置标准库</li>
|
||
<li>执行速度适中</li>
|
||
<li>文档容错能力强</li>
|
||
</ul>
|
||
</td>
|
||
<td><ul class="first last simple">
|
||
<li>Python 2.7.3 or 3.2.2)前
|
||
的版本中文档容错能力差</li>
|
||
</ul>
|
||
</td>
|
||
</tr>
|
||
<tr class="row-odd"><td>lxml HTML 解析器</td>
|
||
<td><tt class="docutils literal"><span class="pre">BeautifulSoup(markup,</span>
|
||
<span class="pre">"lxml")</span></tt></td>
|
||
<td><ul class="first last simple">
|
||
<li>速度快</li>
|
||
<li>文档容错能力强</li>
|
||
</ul>
|
||
</td>
|
||
<td><ul class="first last simple">
|
||
<li>需要安装C语言库</li>
|
||
</ul>
|
||
</td>
|
||
</tr>
|
||
<tr class="row-even"><td>lxml XML 解析器</td>
|
||
<td><p class="first"><tt class="docutils literal"><span class="pre">BeautifulSoup(markup,</span>
|
||
<span class="pre">["lxml",</span> <span class="pre">"xml"])</span></tt></p>
|
||
<p class="last"><tt class="docutils literal"><span class="pre">BeautifulSoup(markup,</span>
|
||
<span class="pre">"xml")</span></tt></p>
|
||
</td>
|
||
<td><ul class="first last simple">
|
||
<li>速度快</li>
|
||
<li>唯一支持XML的解析器</li>
|
||
</ul>
|
||
</td>
|
||
<td><ul class="first last simple">
|
||
<li>需要安装C语言库</li>
|
||
</ul>
|
||
</td>
|
||
</tr>
|
||
<tr class="row-odd"><td>html5lib</td>
|
||
<td><tt class="docutils literal"><span class="pre">BeautifulSoup(markup,</span>
|
||
<span class="pre">"html5lib")</span></tt></td>
|
||
<td><ul class="first last simple">
|
||
<li>最好的容错性</li>
|
||
<li>以浏览器的方式解析文档</li>
|
||
<li>生成HTML5格式的文档</li>
|
||
</ul>
|
||
</td>
|
||
<td><ul class="first last simple">
|
||
<li>速度慢</li>
|
||
<li>不依赖外部扩展</li>
|
||
</ul>
|
||
</td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
<p>推荐使用lxml作为解析器,因为效率更高. 在Python2.7.3之前的版本和Python3中3.2.2之前的版本,必须安装lxml或html5lib, 因为那些Python版本的标准库中内置的HTML解析方法不够稳定.</p>
|
||
<p>提示: 如果一段HTML或XML文档格式不正确的话,那么在不同的解析器中返回的结果可能是不一样的,查看 <a class="reference internal" href="#id49">解析器之间的区别</a> 了解更多细节</p>
|
||
</div>
|
||
</div>
|
||
<div class="section" id="id10">
|
||
<h1>如何使用</h1>
|
||
<p>将一段文档传入BeautifulSoup 的构造方法,就能得到一个文档的对象, 可以传入一段字符串或一个文件句柄.</p>
|
||
<pre><code class="language-python"><span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">BeautifulSoup</span>
|
||
|
||
<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="nb">open</span><span class="p">(</span><span class="s">"index.html"</span><span class="p">))</span>
|
||
|
||
<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">"<html>data</html>"</span><span class="p">)</span>
|
||
</code></pre>
|
||
<p>首先,文档被转换成Unicode,并且HTML的实例都被转换成Unicode编码</p>
|
||
<div class="highlight-python"><pre>BeautifulSoup("Sacr&eacute; bleu!")
|
||
<html><head></head><body>Sacré bleu!</body></html></pre>
|
||
</div>
|
||
<p>然后,Beautiful Soup选择最合适的解析器来解析这段文档,如果手动指定解析器那么Beautiful Soup会选择指定的解析器来解析文档.(参考 <a class="reference internal" href="#xml">解析成XML</a> ).</p>
|
||
</div>
|
||
<div class="section" id="id11">
|
||
<h1>对象的种类</h1>
|
||
<p>Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种: <tt class="docutils literal"><span class="pre">Tag</span></tt> , <tt class="docutils literal"><span class="pre">NavigableString</span></tt> , <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> , <tt class="docutils literal"><span class="pre">Comment</span></tt> .</p>
|
||
<div class="section" id="tag">
|
||
<h2>Tag</h2>
|
||
<p><tt class="docutils literal"><span class="pre">Tag</span></tt> 对象与XML或HTML原生文档中的tag相同:</p>
|
||
<pre><code class="language-python"><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">'<b class="boldest">Extremely bold</b>'</span><span class="p">)</span>
|
||
<span class="n">tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">b</span>
|
||
<span class="nb">type</span><span class="p">(</span><span class="n">tag</span><span class="p">)</span>
|
||
<span class="c"># <class 'bs4.element.Tag'></span>
|
||
</code></pre>
|
||
<p>Tag有很多方法和属性,在 <a class="reference internal" href="#id15">遍历文档树</a> 和 <a class="reference internal" href="#id24">搜索文档树</a> 中有详细解释.现在介绍一下tag中最重要的属性: name和attributes</p>
|
||
<div class="section" id="name">
|
||
<h3>Name</h3>
|
||
<p>每个tag都有自己的名字,通过 <tt class="docutils literal"><span class="pre">.name</span></tt> 来获取:</p>
|
||
<pre><code class="language-python"><span class="n">tag</span><span class="o">.</span><span class="n">name</span>
|
||
<span class="c"># u'b'</span>
|
||
</code></pre>
|
||
<p>如果改变了tag的name,那将影响所有通过当前Beautiful Soup对象生成的HTML文档:</p>
|
||
<pre><code class="language-python"><span class="n">tag</span><span class="o">.</span><span class="n">name</span> <span class="o">=</span> <span class="s">"blockquote"</span>
|
||
<span class="n">tag</span>
|
||
<span class="c"># <blockquote class="boldest">Extremely bold</blockquote></span>
|
||
</code></pre>
|
||
</div>
|
||
<div class="section" id="attributes">
|
||
<h3>Attributes</h3>
|
||
<p>一个tag可能有很多个属性. tag <tt class="docutils literal"><span class="pre"><b</span> <span class="pre">class="boldest"></span></tt> 有一个 “class” 的属性,值为 “boldest” . tag的属性的操作方法与字典相同:</p>
|
||
<pre><code class="language-python"><span class="n">tag</span><span class="p">[</span><span class="s">'class'</span><span class="p">]</span>
|
||
<span class="c"># u'boldest'</span>
|
||
</code></pre>
|
||
<p>也可以直接”点”取属性, 比如: <tt class="docutils literal"><span class="pre">.attrs</span></tt> :</p>
|
||
<pre><code class="language-python"><span class="n">tag</span><span class="o">.</span><span class="n">attrs</span>
|
||
<span class="c"># {u'class': u'boldest'}</span>
|
||
</code></pre>
|
||
<p>tag的属性可以被添加,删除或修改. 再说一次, tag的属性操作方法与字典一样</p>
|
||
<pre><code class="language-python"><span class="n">tag</span><span class="p">[</span><span class="s">'class'</span><span class="p">]</span> <span class="o">=</span> <span class="s">'verybold'</span>
|
||
<span class="n">tag</span><span class="p">[</span><span class="s">'id'</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span>
|
||
<span class="n">tag</span>
|
||
<span class="c"># <blockquote class="verybold" id="1">Extremely bold</blockquote></span>
|
||
|
||
<span class="k">del</span> <span class="n">tag</span><span class="p">[</span><span class="s">'class'</span><span class="p">]</span>
|
||
<span class="k">del</span> <span class="n">tag</span><span class="p">[</span><span class="s">'id'</span><span class="p">]</span>
|
||
<span class="n">tag</span>
|
||
<span class="c"># <blockquote>Extremely bold</blockquote></span>
|
||
|
||
<span class="n">tag</span><span class="p">[</span><span class="s">'class'</span><span class="p">]</span>
|
||
<span class="c"># KeyError: 'class'</span>
|
||
<span class="k">print</span><span class="p">(</span><span class="n">tag</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s">'class'</span><span class="p">))</span>
|
||
<span class="c"># None</span>
|
||
</code></pre>
|
||
<div class="section" id="id12">
|
||
<h4>多值属性</h4>
|
||
<p>HTML 4定义了一系列可以包含多个值的属性.在HTML5中移除了一些,却增加更多.最常见的多值的属性是 class (一个tag可以有多个CSS的class). 还有一些属性 <tt class="docutils literal"><span class="pre">rel</span></tt> , <tt class="docutils literal"><span class="pre">rev</span></tt> , <tt class="docutils literal"><span class="pre">accept-charset</span></tt> , <tt class="docutils literal"><span class="pre">headers</span></tt> , <tt class="docutils literal"><span class="pre">accesskey</span></tt> . 在Beautiful Soup中多值属性的返回类型是list:</p>
|
||
<pre><code class="language-python"><span class="n">css_soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">'<p class="body strikeout"></p>'</span><span class="p">)</span>
|
||
<span class="n">css_soup</span><span class="o">.</span><span class="n">p</span><span class="p">[</span><span class="s">'class'</span><span class="p">]</span>
|
||
<span class="c"># ["body", "strikeout"]</span>
|
||
|
||
<span class="n">css_soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">'<p class="body"></p>'</span><span class="p">)</span>
|
||
<span class="n">css_soup</span><span class="o">.</span><span class="n">p</span><span class="p">[</span><span class="s">'class'</span><span class="p">]</span>
|
||
<span class="c"># ["body"]</span>
|
||
</code></pre>
|
||
<p>如果某个属性看起来好像有多个值,但在任何版本的HTML定义中都没有被定义为多值属性,那么Beautiful Soup会将这个属性作为字符串返回</p>
|
||
<pre><code class="language-python"><span class="n">id_soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">'<p id="my id"></p>'</span><span class="p">)</span>
|
||
<span class="n">id_soup</span><span class="o">.</span><span class="n">p</span><span class="p">[</span><span class="s">'id'</span><span class="p">]</span>
|
||
<span class="c"># 'my id'</span>
|
||
</code></pre>
|
||
<p>将tag转换成字符串时,多值属性会合并为一个值</p>
|
||
<pre><code class="language-python"><span class="n">rel_soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">'<p>Back to the <a rel="index">homepage</a></p>'</span><span class="p">)</span>
|
||
<span class="n">rel_soup</span><span class="o">.</span><span class="n">a</span><span class="p">[</span><span class="s">'rel'</span><span class="p">]</span>
|
||
<span class="c"># ['index']</span>
|
||
<span class="n">rel_soup</span><span class="o">.</span><span class="n">a</span><span class="p">[</span><span class="s">'rel'</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span><span class="s">'index'</span><span class="p">,</span> <span class="s">'contents'</span><span class="p">]</span>
|
||
<span class="k">print</span><span class="p">(</span><span class="n">rel_soup</span><span class="o">.</span><span class="n">p</span><span class="p">)</span>
|
||
<span class="c"># <p>Back to the <a rel="index contents">homepage</a></p></span>
|
||
</code></pre>
|
||
<p>如果转换的文档是XML格式,那么tag中不包含多值属性</p>
|
||
<pre><code class="language-python"><span class="n">xml_soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">'<p class="body strikeout"></p>'</span><span class="p">,</span> <span class="s">'xml'</span><span class="p">)</span>
|
||
<span class="n">xml_soup</span><span class="o">.</span><span class="n">p</span><span class="p">[</span><span class="s">'class'</span><span class="p">]</span>
|
||
<span class="c"># u'body strikeout'</span>
|
||
</code></pre>
|
||
</div>
|
||
</div>
|
||
</div>
|
||
<div class="section" id="id13">
|
||
<h2>可以遍历的字符串</h2>
|
||
<p>字符串常被包含在tag内.Beautiful Soup用 <tt class="docutils literal"><span class="pre">NavigableString</span></tt> 类来包装tag中的字符串:</p>
|
||
<pre><code class="language-python"><span class="n">tag</span><span class="o">.</span><span class="n">string</span>
|
||
<span class="c"># u'Extremely bold'</span>
|
||
<span class="nb">type</span><span class="p">(</span><span class="n">tag</span><span class="o">.</span><span class="n">string</span><span class="p">)</span>
|
||
<span class="c"># <class 'bs4.element.NavigableString'></span>
|
||
</code></pre>
|
||
<p>一个 <tt class="docutils literal"><span class="pre">NavigableString</span></tt> 字符串与Python中的Unicode字符串相同,并且还支持包含在 <a class="reference internal" href="#id15">遍历文档树</a> 和 <a class="reference internal" href="#id24">搜索文档树</a> 中的一些特性. 通过 <tt class="docutils literal"><span class="pre">unicode()</span></tt> 方法可以直接将 <tt class="docutils literal"><span class="pre">NavigableString</span></tt> 对象转换成Unicode字符串:</p>
|
||
<pre><code class="language-python"><span class="n">unicode_string</span> <span class="o">=</span> <span class="nb">unicode</span><span class="p">(</span><span class="n">tag</span><span class="o">.</span><span class="n">string</span><span class="p">)</span>
|
||
<span class="n">unicode_string</span>
|
||
<span class="c"># u'Extremely bold'</span>
|
||
<span class="nb">type</span><span class="p">(</span><span class="n">unicode_string</span><span class="p">)</span>
|
||
<span class="c"># <type 'unicode'></span>
|
||
</code></pre>
|
||
<p>tag中包含的字符串不能编辑,但是可以被替换成其它的字符串,用 <a class="reference internal" href="#replace-with">replace_with()</a> 方法:</p>
|
||
<pre><code class="language-python"><span class="n">tag</span><span class="o">.</span><span class="n">string</span><span class="o">.</span><span class="n">replace_with</span><span class="p">(</span><span class="s">"No longer bold"</span><span class="p">)</span>
|
||
<span class="n">tag</span>
|
||
<span class="c"># <blockquote>No longer bold</blockquote></span>
|
||
</code></pre>
|
||
<p><tt class="docutils literal"><span class="pre">NavigableString</span></tt> 对象支持 <a class="reference internal" href="#id15">遍历文档树</a> 和 <a class="reference internal" href="#id24">搜索文档树</a> 中定义的大部分属性, 并非全部.尤其是,一个字符串不能包含其它内容(tag能够包含字符串或是其它tag),字符串不支持 <tt class="docutils literal"><span class="pre">.contents</span></tt> 或 <tt class="docutils literal"><span class="pre">.string</span></tt> 属性或 <tt class="docutils literal"><span class="pre">find()</span></tt> 方法.</p>
|
||
<p>如果想在Beautiful Soup之外使用 <tt class="docutils literal"><span class="pre">NavigableString</span></tt> 对象,需要调用 <tt class="docutils literal"><span class="pre">unicode()</span></tt> 方法,将该对象转换成普通的Unicode字符串,否则就算Beautiful Soup已方法已经执行结束,该对象的输出也会带有对象的引用地址.这样会浪费内存.</p>
|
||
</div>
|
||
<div class="section" id="beautifulsoup">
|
||
<h2>BeautifulSoup</h2>
|
||
<p><tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 对象表示的是一个文档的全部内容.大部分时候,可以把它当作 <tt class="docutils literal"><span class="pre">Tag</span></tt> 对象,它支持 <a class="reference internal" href="#id15">遍历文档树</a> 和 <a class="reference internal" href="#id24">搜索文档树</a> 中描述的大部分的方法.</p>
|
||
<p>因为 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 对象并不是真正的HTML或XML的tag,所以它没有name和attribute属性.但有时查看它的 <tt class="docutils literal"><span class="pre">.name</span></tt> 属性是很方便的,所以 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 对象包含了一个值为 “[document]” 的特殊属性 <tt class="docutils literal"><span class="pre">.name</span></tt></p>
|
||
<pre><code class="language-python"><span class="n">soup</span><span class="o">.</span><span class="n">name</span>
|
||
<span class="c"># u'[document]'</span>
|
||
</code></pre>
|
||
</div>
|
||
<div class="section" id="id14">
|
||
<h2>注释及特殊字符串</h2>
|
||
<p><tt class="docutils literal"><span class="pre">Tag</span></tt> , <tt class="docutils literal"><span class="pre">NavigableString</span></tt> , <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 几乎覆盖了html和xml中的所有内容,但是还有一些特殊对象.容易让人担心的内容是文档的注释部分:</p>
|
||
<pre><code class="language-python"><span class="n">markup</span> <span class="o">=</span> <span class="s">"<b><!--Hey, buddy. Want to buy a used parser?--></b>"</span>
|
||
<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
|
||
<span class="n">comment</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">b</span><span class="o">.</span><span class="n">string</span>
|
||
<span class="nb">type</span><span class="p">(</span><span class="n">comment</span><span class="p">)</span>
|
||
<span class="c"># <class 'bs4.element.Comment'></span>
|
||
</code></pre>
|
||
<p><tt class="docutils literal"><span class="pre">Comment</span></tt> 对象是一个特殊类型的 <tt class="docutils literal"><span class="pre">NavigableString</span></tt> 对象:</p>
|
||
<pre><code class="language-python"><span class="n">comment</span>
|
||
<span class="c"># u'Hey, buddy. Want to buy a used parser'</span>
|
||
</code></pre>
|
||
<p>但是当它出现在HTML文档中时, <tt class="docutils literal"><span class="pre">Comment</span></tt> 对象会使用特殊的格式输出:</p>
|
||
<pre><code class="language-python"><span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">b</span><span class="o">.</span><span class="n">prettify</span><span class="p">())</span>
|
||
<span class="c"># <b></span>
|
||
<span class="c"># <!--Hey, buddy. Want to buy a used parser?--></span>
|
||
<span class="c"># </b></span>
|
||
</code></pre>
|
||
<p>Beautiful Soup中定义的其它类型都可能会出现在XML的文档中: <tt class="docutils literal"><span class="pre">CData</span></tt> , <tt class="docutils literal"><span class="pre">ProcessingInstruction</span></tt> , <tt class="docutils literal"><span class="pre">Declaration</span></tt> , <tt class="docutils literal"><span class="pre">Doctype</span></tt> .与 <tt class="docutils literal"><span class="pre">Comment</span></tt> 对象类似,这些类都是 <tt class="docutils literal"><span class="pre">NavigableString</span></tt> 的子类,只是添加了一些额外的方法的字符串独享.下面是用CDATA来替代注释的例子:</p>
|
||
<pre><code class="language-python"><span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">CData</span>
|
||
<span class="n">cdata</span> <span class="o">=</span> <span class="n">CData</span><span class="p">(</span><span class="s">"A CDATA block"</span><span class="p">)</span>
|
||
<span class="n">comment</span><span class="o">.</span><span class="n">replace_with</span><span class="p">(</span><span class="n">cdata</span><span class="p">)</span>
|
||
|
||
<span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">b</span><span class="o">.</span><span class="n">prettify</span><span class="p">())</span>
|
||
<span class="c"># <b></span>
|
||
<span class="c"># <![CDATA[A CDATA block]]></span>
|
||
<span class="c"># </b></span>
|
||
</code></pre>
|
||
</div>
|
||
</div>
|
||
<div class="section" id="id15">
|
||
<h1>遍历文档树</h1>
|
||
<p>还拿”爱丽丝梦游仙境”的文档来做例子:</p>
|
||
<pre><code class="language-python"><span class="n">html_doc</span> <span class="o">=</span> <span class="s">"""</span>
|
||
<span class="s"><html><head><title>The Dormouse's story</title></head></span>
|
||
|
||
<span class="s"><p class="title"><b>The Dormouse's story</b></p></span>
|
||
|
||
<span class="s"><p class="story">Once upon a time there were three little sisters; and their names were</span>
|
||
<span class="s"><a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,</span>
|
||
<span class="s"><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and</span>
|
||
<span class="s"><a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;</span>
|
||
<span class="s">and they lived at the bottom of a well.</p></span>
|
||
|
||
<span class="s"><p class="story">...</p></span>
|
||
<span class="s">"""</span>
|
||
|
||
<span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">BeautifulSoup</span>
|
||
<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">html_doc</span><span class="p">)</span>
|
||
</code></pre>
|
||
<p>通过这段例子来演示怎样从文档的一段内容找到另一段内容</p>
|
||
<div class="section" id="id16">
|
||
<h2>子节点</h2>
|
||
<p>一个Tag可能包含多个字符串或其它的Tag,这些都是这个Tag的子节点.Beautiful Soup提供了许多操作和遍历子节点的属性.</p>
|
||
<p>注意: Beautiful Soup中字符串节点不支持这些属性,因为字符串没有子节点</p>
|
||
<div class="section" id="id17">
|
||
<h3>tag的名字</h3>
|
||
<p>操作文档树最简单的方法就是告诉它你想获取的tag的name.如果想获取 <head> 标签,只要用 <tt class="docutils literal"><span class="pre">soup.head</span></tt> :</p>
|
||
<pre><code class="language-python"><span class="n">soup</span><span class="o">.</span><span class="n">head</span>
|
||
<span class="c"># <head><title>The Dormouse's story</title></head></span>
|
||
|
||
<span class="n">soup</span><span class="o">.</span><span class="n">title</span>
|
||
<span class="c"># <title>The Dormouse's story</title></span>
|
||
</code></pre>
|
||
<p>这是个获取tag的小窍门,可以在文档树的tag中多次调用这个方法.下面的代码可以获取<body>标签中的第一个<b>标签:</p>
|
||
<pre><code class="language-python"><span class="n">soup</span><span class="o">.</span><span class="n">body</span><span class="o">.</span><span class="n">b</span>
|
||
<span class="c"># <b>The Dormouse's story</b></span>
|
||
</code></pre>
|
||
<p>通过点取属性的方式只能获得当前名字的第一个tag:</p>
|
||
<pre><code class="language-python"><span class="n">soup</span><span class="o">.</span><span class="n">a</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a></span>
|
||
</code></pre>
|
||
<p>如果想要得到所有的<a>标签,或是通过名字得到比一个tag更多的内容的时候,就需要用到 <cite>Searching the tree</cite> 中描述的方法,比如: find_all()</p>
|
||
<pre><code class="language-python"><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">'a'</span><span class="p">)</span>
|
||
<span class="c"># [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]</span>
|
||
</code></pre>
|
||
</div>
|
||
<div class="section" id="contents-children">
|
||
<h3>.contents 和 .children</h3>
|
||
<p>tag的 <tt class="docutils literal"><span class="pre">.contents</span></tt> 属性可以将tag的子节点以列表的方式输出:</p>
|
||
<div class="highlight-python"><pre>head_tag = soup.head
|
||
head_tag
|
||
# <head><title>The Dormouse's story</title></head>
|
||
|
||
head_tag.contents
|
||
[<title>The Dormouse's story</title>]
|
||
|
||
title_tag = head_tag.contents[0]
|
||
title_tag
|
||
# <title>The Dormouse's story</title>
|
||
title_tag.contents
|
||
# [u'The Dormouse's story']</pre>
|
||
</div>
|
||
<p><tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 对象本身一定会包含子节点,也就是说<html>标签也是 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 对象的子节点:</p>
|
||
<pre><code class="language-python"><span class="nb">len</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">contents</span><span class="p">)</span>
|
||
<span class="c"># 1</span>
|
||
<span class="n">soup</span><span class="o">.</span><span class="n">contents</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">name</span>
|
||
<span class="c"># u'html'</span>
|
||
</code></pre>
|
||
<p>字符串没有 <tt class="docutils literal"><span class="pre">.contents</span></tt> 属性,因为字符串没有子节点:</p>
|
||
<pre><code class="language-python"><span class="n">text</span> <span class="o">=</span> <span class="n">title_tag</span><span class="o">.</span><span class="n">contents</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
|
||
<span class="n">text</span><span class="o">.</span><span class="n">contents</span>
|
||
<span class="c"># AttributeError: 'NavigableString' object has no attribute 'contents'</span>
|
||
</code></pre>
|
||
<p>通过tag的 <tt class="docutils literal"><span class="pre">.children</span></tt> 生成器,可以对tag的子节点进行循环:</p>
|
||
<pre><code class="language-python"><span class="k">for</span> <span class="n">child</span> <span class="ow">in</span> <span class="n">title_tag</span><span class="o">.</span><span class="n">children</span><span class="p">:</span>
|
||
<span class="k">print</span><span class="p">(</span><span class="n">child</span><span class="p">)</span>
|
||
<span class="c"># The Dormouse's story</span>
|
||
</code></pre>
|
||
</div>
|
||
<div class="section" id="descendants">
|
||
<h3>.descendants</h3>
|
||
<p><tt class="docutils literal"><span class="pre">.contents</span></tt> 和 <tt class="docutils literal"><span class="pre">.children</span></tt> 属性仅包含tag的直接子节点.例如,<head>标签只有一个直接子节点<title></p>
|
||
<pre><code class="language-python"><span class="n">head_tag</span><span class="o">.</span><span class="n">contents</span>
|
||
<span class="c"># [<title>The Dormouse's story</title>]</span>
|
||
</code></pre>
|
||
<p>但是<title>标签也包含一个子节点:字符串 “The Dormouse’s story”,这种情况下字符串 “The Dormouse’s story”也属于<head>标签的子孙节点. <tt class="docutils literal"><span class="pre">.descendants</span></tt> 属性可以对所有tag的子孙节点进行递归循环 <a class="footnote-reference" href="#id86" id="id18">[5]</a> :</p>
|
||
<pre><code class="language-python"><span class="k">for</span> <span class="n">child</span> <span class="ow">in</span> <span class="n">head_tag</span><span class="o">.</span><span class="n">descendants</span><span class="p">:</span>
|
||
<span class="k">print</span><span class="p">(</span><span class="n">child</span><span class="p">)</span>
|
||
<span class="c"># <title>The Dormouse's story</title></span>
|
||
<span class="c"># The Dormouse's story</span>
|
||
</code></pre>
|
||
<p>上面的例子中, <head>标签只有一个子节点,但是有2个子孙节点:<head>节点和<head>的子节点, <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 有一个直接子节点(<html>节点),却有很多子孙节点:</p>
|
||
<pre><code class="language-python"><span class="nb">len</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">children</span><span class="p">))</span>
|
||
<span class="c"># 1</span>
|
||
<span class="nb">len</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">descendants</span><span class="p">))</span>
|
||
<span class="c"># 25</span>
|
||
</code></pre>
|
||
</div>
|
||
<div class="section" id="string">
|
||
<h3>.string</h3>
|
||
<p>如果tag只有一个 <tt class="docutils literal"><span class="pre">NavigableString</span></tt> 类型子节点,那么这个tag可以使用 <tt class="docutils literal"><span class="pre">.string</span></tt> 得到子节点:</p>
|
||
<pre><code class="language-python"><span class="n">title_tag</span><span class="o">.</span><span class="n">string</span>
|
||
<span class="c"># u'The Dormouse's story'</span>
|
||
</code></pre>
|
||
<p>如果一个tag仅有一个子节点,那么这个tag也可以使用 <tt class="docutils literal"><span class="pre">.string</span></tt> 方法,输出结果与当前唯一子节点的 <tt class="docutils literal"><span class="pre">.string</span></tt> 结果相同:</p>
|
||
<pre><code class="language-python"><span class="n">head_tag</span><span class="o">.</span><span class="n">contents</span>
|
||
<span class="c"># [<title>The Dormouse's story</title>]</span>
|
||
|
||
<span class="n">head_tag</span><span class="o">.</span><span class="n">string</span>
|
||
<span class="c"># u'The Dormouse's story'</span>
|
||
</code></pre>
|
||
<p>如果tag包含了多个子节点,tag就无法确定 <tt class="docutils literal"><span class="pre">.string</span></tt> 方法应该调用哪个子节点的内容, <tt class="docutils literal"><span class="pre">.string</span></tt> 的输出结果是 <tt class="docutils literal"><span class="pre">None</span></tt> :</p>
|
||
<pre><code class="language-python"><span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">html</span><span class="o">.</span><span class="n">string</span><span class="p">)</span>
|
||
<span class="c"># None</span>
|
||
</code></pre>
|
||
</div>
|
||
<div class="section" id="strings-stripped-strings">
|
||
<h3>.strings 和 stripped_strings</h3>
|
||
<p>如果tag中包含多个字符串 <a class="footnote-reference" href="#id83" id="id19">[2]</a> ,可以使用 <tt class="docutils literal"><span class="pre">.strings</span></tt> 来循环获取:</p>
|
||
<pre><code class="language-python"><span class="k">for</span> <span class="n">string</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">strings</span><span class="p">:</span>
|
||
<span class="k">print</span><span class="p">(</span><span class="nb">repr</span><span class="p">(</span><span class="n">string</span><span class="p">))</span>
|
||
<span class="c"># u"The Dormouse's story"</span>
|
||
<span class="c"># u'\n\n'</span>
|
||
<span class="c"># u"The Dormouse's story"</span>
|
||
<span class="c"># u'\n\n'</span>
|
||
<span class="c"># u'Once upon a time there were three little sisters; and their names were\n'</span>
|
||
<span class="c"># u'Elsie'</span>
|
||
<span class="c"># u',\n'</span>
|
||
<span class="c"># u'Lacie'</span>
|
||
<span class="c"># u' and\n'</span>
|
||
<span class="c"># u'Tillie'</span>
|
||
<span class="c"># u';\nand they lived at the bottom of a well.'</span>
|
||
<span class="c"># u'\n\n'</span>
|
||
<span class="c"># u'...'</span>
|
||
<span class="c"># u'\n'</span>
|
||
</code></pre>
|
||
<p>输出的字符串中可能包含了很多空格或空行,使用 <tt class="docutils literal"><span class="pre">.stripped_strings</span></tt> 可以去除多余空白内容:</p>
|
||
<pre><code class="language-python"><span class="k">for</span> <span class="n">string</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">stripped_strings</span><span class="p">:</span>
|
||
<span class="k">print</span><span class="p">(</span><span class="nb">repr</span><span class="p">(</span><span class="n">string</span><span class="p">))</span>
|
||
<span class="c"># u"The Dormouse's story"</span>
|
||
<span class="c"># u"The Dormouse's story"</span>
|
||
<span class="c"># u'Once upon a time there were three little sisters; and their names were'</span>
|
||
<span class="c"># u'Elsie'</span>
|
||
<span class="c"># u','</span>
|
||
<span class="c"># u'Lacie'</span>
|
||
<span class="c"># u'and'</span>
|
||
<span class="c"># u'Tillie'</span>
|
||
<span class="c"># u';\nand they lived at the bottom of a well.'</span>
|
||
<span class="c"># u'...'</span>
|
||
</code></pre>
|
||
<p>全部是空格的行会被忽略掉,段首和段末的空白会被删除</p>
|
||
</div>
|
||
</div>
|
||
<div class="section" id="id20">
|
||
<h2>父节点</h2>
|
||
<p>继续分析文档树,每个tag或字符串都有父节点:被包含在某个tag中</p>
|
||
<div class="section" id="parent">
|
||
<h3>.parent</h3>
|
||
<p>通过 <tt class="docutils literal"><span class="pre">.parent</span></tt> 属性来获取某个元素的父节点.在例子“爱丽丝”的文档中,<head>标签是<title>标签的父节点:</p>
|
||
<pre><code class="language-python"><span class="n">title_tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">title</span>
|
||
<span class="n">title_tag</span>
|
||
<span class="c"># <title>The Dormouse's story</title></span>
|
||
<span class="n">title_tag</span><span class="o">.</span><span class="n">parent</span>
|
||
<span class="c"># <head><title>The Dormouse's story</title></head></span>
|
||
</code></pre>
|
||
<p>文档title的字符串也有父节点:<title>标签</p>
|
||
<pre><code class="language-python"><span class="n">title_tag</span><span class="o">.</span><span class="n">string</span><span class="o">.</span><span class="n">parent</span>
|
||
<span class="c"># <title>The Dormouse's story</title></span>
|
||
</code></pre>
|
||
<p>文档的顶层节点比如<html>的父节点是 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 对象:</p>
|
||
<pre><code class="language-python"><span class="n">html_tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">html</span>
|
||
<span class="nb">type</span><span class="p">(</span><span class="n">html_tag</span><span class="o">.</span><span class="n">parent</span><span class="p">)</span>
|
||
<span class="c"># <class 'bs4.BeautifulSoup'></span>
|
||
</code></pre>
|
||
<p><tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 对象的 <tt class="docutils literal"><span class="pre">.parent</span></tt> 是None:</p>
|
||
<pre><code class="language-python"><span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">parent</span><span class="p">)</span>
|
||
<span class="c"># None</span>
|
||
</code></pre>
|
||
</div>
|
||
<div class="section" id="parents">
|
||
<h3>.parents</h3>
|
||
<p>通过元素的 <tt class="docutils literal"><span class="pre">.parents</span></tt> 属性可以递归得到元素的所有父辈节点,下面的例子使用了 <tt class="docutils literal"><span class="pre">.parents</span></tt> 方法遍历了<a>标签到根节点的所有节点.</p>
|
||
<pre><code class="language-python"><span class="n">link</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
|
||
<span class="n">link</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a></span>
|
||
<span class="k">for</span> <span class="n">parent</span> <span class="ow">in</span> <span class="n">link</span><span class="o">.</span><span class="n">parents</span><span class="p">:</span>
|
||
<span class="k">if</span> <span class="n">parent</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
|
||
<span class="k">print</span><span class="p">(</span><span class="n">parent</span><span class="p">)</span>
|
||
<span class="k">else</span><span class="p">:</span>
|
||
<span class="k">print</span><span class="p">(</span><span class="n">parent</span><span class="o">.</span><span class="n">name</span><span class="p">)</span>
|
||
<span class="c"># p</span>
|
||
<span class="c"># body</span>
|
||
<span class="c"># html</span>
|
||
<span class="c"># [document]</span>
|
||
<span class="c"># None</span>
|
||
</code></pre>
|
||
</div>
|
||
</div>
|
||
<div class="section" id="id21">
|
||
<h2>兄弟节点</h2>
|
||
<p>看一段简单的例子:</p>
|
||
<pre><code class="language-python"><span class="n">sibling_soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">"<a><b>text1</b><c>text2</c></b></a>"</span><span class="p">)</span>
|
||
<span class="k">print</span><span class="p">(</span><span class="n">sibling_soup</span><span class="o">.</span><span class="n">prettify</span><span class="p">())</span>
|
||
<span class="c"># <html></span>
|
||
<span class="c"># <body></span>
|
||
<span class="c"># <a></span>
|
||
<span class="c"># <b></span>
|
||
<span class="c"># text1</span>
|
||
<span class="c"># </b></span>
|
||
<span class="c"># <c></span>
|
||
<span class="c"># text2</span>
|
||
<span class="c"># </c></span>
|
||
<span class="c"># </a></span>
|
||
<span class="c"># </body></span>
|
||
<span class="c"># </html></span>
|
||
</code></pre>
|
||
<p>因为<b>标签和<c>标签是同一层:他们是同一个元素的子节点,所以<b>和<c>可以被称为兄弟节点.一段文档以标准格式输出时,兄弟节点有相同的缩进级别.在代码中也可以使用这种关系.</p>
|
||
<div class="section" id="next-sibling-previous-sibling">
|
||
<h3>.next_sibling 和 .previous_sibling</h3>
|
||
<p>在文档树中,使用 <tt class="docutils literal"><span class="pre">.next_sibling</span></tt> 和 <tt class="docutils literal"><span class="pre">.previous_sibling</span></tt> 属性来查询兄弟节点:</p>
|
||
<pre><code class="language-python"><span class="n">sibling_soup</span><span class="o">.</span><span class="n">b</span><span class="o">.</span><span class="n">next_sibling</span>
|
||
<span class="c"># <c>text2</c></span>
|
||
|
||
<span class="n">sibling_soup</span><span class="o">.</span><span class="n">c</span><span class="o">.</span><span class="n">previous_sibling</span>
|
||
<span class="c"># <b>text1</b></span>
|
||
</code></pre>
|
||
<p><b>标签有 <tt class="docutils literal"><span class="pre">.next_sibling</span></tt> 属性,但是没有 <tt class="docutils literal"><span class="pre">.previous_sibling</span></tt> 属性,因为<b>标签在同级节点中是第一个.同理,<c>标签有 <tt class="docutils literal"><span class="pre">.previous_sibling</span></tt> 属性,却没有 <tt class="docutils literal"><span class="pre">.next_sibling</span></tt> 属性:</p>
|
||
<pre><code class="language-python"><span class="k">print</span><span class="p">(</span><span class="n">sibling_soup</span><span class="o">.</span><span class="n">b</span><span class="o">.</span><span class="n">previous_sibling</span><span class="p">)</span>
|
||
<span class="c"># None</span>
|
||
<span class="k">print</span><span class="p">(</span><span class="n">sibling_soup</span><span class="o">.</span><span class="n">c</span><span class="o">.</span><span class="n">next_sibling</span><span class="p">)</span>
|
||
<span class="c"># None</span>
|
||
</code></pre>
|
||
<p>例子中的字符串“text1”和“text2”不是兄弟节点,因为它们的父节点不同:</p>
|
||
<pre><code class="language-python"><span class="n">sibling_soup</span><span class="o">.</span><span class="n">b</span><span class="o">.</span><span class="n">string</span>
|
||
<span class="c"># u'text1'</span>
|
||
|
||
<span class="k">print</span><span class="p">(</span><span class="n">sibling_soup</span><span class="o">.</span><span class="n">b</span><span class="o">.</span><span class="n">string</span><span class="o">.</span><span class="n">next_sibling</span><span class="p">)</span>
|
||
<span class="c"># None</span>
|
||
</code></pre>
|
||
<p>实际文档中的tag的 <tt class="docutils literal"><span class="pre">.next_sibling</span></tt> 和 <tt class="docutils literal"><span class="pre">.previous_sibling</span></tt> 属性通常是字符串或空白. 看看“爱丽丝”文档:</p>
|
||
<div class="highlight-python"><pre><a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
|
||
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
|
||
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a></pre>
|
||
</div>
|
||
<p>如果以为第一个<a>标签的 <tt class="docutils literal"><span class="pre">.next_sibling</span></tt> 结果是第二个<a>标签,那就错了,真实结果是第一个<a>标签和第二个<a>标签之间的顿号和换行符:</p>
|
||
<pre><code class="language-python"><span class="n">link</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
|
||
<span class="n">link</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a></span>
|
||
|
||
<span class="n">link</span><span class="o">.</span><span class="n">next_sibling</span>
|
||
<span class="c"># u',\n'</span>
|
||
</code></pre>
|
||
<p>第二个<a>标签是顿号的 <tt class="docutils literal"><span class="pre">.next_sibling</span></tt> 属性:</p>
|
||
<pre><code class="language-python"><span class="n">link</span><span class="o">.</span><span class="n">next_sibling</span><span class="o">.</span><span class="n">next_sibling</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a></span>
|
||
</code></pre>
|
||
</div>
|
||
<div class="section" id="next-siblings-previous-siblings">
|
||
<h3>.next_siblings 和 .previous_siblings</h3>
|
||
<p>通过 <tt class="docutils literal"><span class="pre">.next_siblings</span></tt> 和 <tt class="docutils literal"><span class="pre">.previous_siblings</span></tt> 属性可以对当前节点的兄弟节点迭代输出:</p>
|
||
<pre><code class="language-python"><span class="k">for</span> <span class="n">sibling</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span><span class="o">.</span><span class="n">next_siblings</span><span class="p">:</span>
|
||
<span class="k">print</span><span class="p">(</span><span class="nb">repr</span><span class="p">(</span><span class="n">sibling</span><span class="p">))</span>
|
||
<span class="c"># u',\n'</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a></span>
|
||
<span class="c"># u' and\n'</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a></span>
|
||
<span class="c"># u'; and they lived at the bottom of a well.'</span>
|
||
<span class="c"># None</span>
|
||
|
||
<span class="k">for</span> <span class="n">sibling</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="nb">id</span><span class="o">=</span><span class="s">"link3"</span><span class="p">)</span><span class="o">.</span><span class="n">previous_siblings</span><span class="p">:</span>
|
||
<span class="k">print</span><span class="p">(</span><span class="nb">repr</span><span class="p">(</span><span class="n">sibling</span><span class="p">))</span>
|
||
<span class="c"># ' and\n'</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a></span>
|
||
<span class="c"># u',\n'</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a></span>
|
||
<span class="c"># u'Once upon a time there were three little sisters; and their names were\n'</span>
|
||
<span class="c"># None</span>
|
||
</code></pre>
|
||
</div>
|
||
</div>
|
||
<div class="section" id="id22">
|
||
<h2>回退和前进</h2>
|
||
<p>看一下“爱丽丝” 文档:</p>
|
||
<div class="highlight-python"><pre><html><head><title>The Dormouse's story</title></head>
|
||
<p class="title"><b>The Dormouse's story</b></p></pre>
|
||
</div>
|
||
<p>HTML解析器把这段字符串转换成一连串的事件: “打开<html>标签”,”打开一个<head>标签”,”打开一个<title>标签”,”添加一段字符串”,”关闭<title>标签”,”打开<p>标签”,等等.Beautiful Soup提供了重现解析器初始化过程的方法.</p>
|
||
<div class="section" id="next-element-previous-element">
|
||
<h3>.next_element 和 .previous_element</h3>
|
||
<p><tt class="docutils literal"><span class="pre">.next_element</span></tt> 属性指向解析过程中下一个被解析的对象(字符串或tag),结果可能与 <tt class="docutils literal"><span class="pre">.next_sibling</span></tt> 相同,但通常是不一样的.</p>
|
||
<p>这是“爱丽丝”文档中最后一个<a>标签,它的 <tt class="docutils literal"><span class="pre">.next_sibling</span></tt> 结果是一个字符串,因为当前的解析过程 <a class="footnote-reference" href="#id83" id="id23">[2]</a> 因为当前的解析过程因为遇到了<a>标签而中断了:</p>
|
||
<pre><code class="language-python"><span class="n">last_a_tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">"a"</span><span class="p">,</span> <span class="nb">id</span><span class="o">=</span><span class="s">"link3"</span><span class="p">)</span>
|
||
<span class="n">last_a_tag</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a></span>
|
||
|
||
<span class="n">last_a_tag</span><span class="o">.</span><span class="n">next_sibling</span>
|
||
<span class="c"># '; and they lived at the bottom of a well.'</span>
|
||
</code></pre>
|
||
<p>但这个<a>标签的 <tt class="docutils literal"><span class="pre">.next_element</span></tt> 属性结果是在<a>标签被解析之后的解析内容,不是<a>标签后的句子部分,应该是字符串”Tillie”:</p>
|
||
<pre><code class="language-python"><span class="n">last_a_tag</span><span class="o">.</span><span class="n">next_element</span>
|
||
<span class="c"># u'Tillie'</span>
|
||
</code></pre>
|
||
<p>这是因为在原始文档中,字符串“Tillie” 在分号前出现,解析器先进入<a>标签,然后是字符串“Tillie”,然后关闭</a>标签,然后是分号和剩余部分.分号与<a>标签在同一层级,但是字符串“Tillie”会被先解析.</p>
|
||
<p><tt class="docutils literal"><span class="pre">.previous_element</span></tt> 属性刚好与 <tt class="docutils literal"><span class="pre">.next_element</span></tt> 相反,它指向当前被解析的对象的前一个解析对象:</p>
|
||
<pre><code class="language-python"><span class="n">last_a_tag</span><span class="o">.</span><span class="n">previous_element</span>
|
||
<span class="c"># u' and\n'</span>
|
||
<span class="n">last_a_tag</span><span class="o">.</span><span class="n">previous_element</span><span class="o">.</span><span class="n">next_element</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a></span>
|
||
</code></pre>
|
||
</div>
|
||
<div class="section" id="next-elements-previous-elements">
|
||
<h3>.next_elements 和 .previous_elements</h3>
|
||
<p>通过 <tt class="docutils literal"><span class="pre">.next_elements</span></tt> 和 <tt class="docutils literal"><span class="pre">.previous_elements</span></tt> 的迭代器就可以向前或向后访问文档的解析内容,就好像文档正在被解析一样:</p>
|
||
<pre><code class="language-python"><span class="k">for</span> <span class="n">element</span> <span class="ow">in</span> <span class="n">last_a_tag</span><span class="o">.</span><span class="n">next_elements</span><span class="p">:</span>
|
||
<span class="k">print</span><span class="p">(</span><span class="nb">repr</span><span class="p">(</span><span class="n">element</span><span class="p">))</span>
|
||
<span class="c"># u'Tillie'</span>
|
||
<span class="c"># u';\nand they lived at the bottom of a well.'</span>
|
||
<span class="c"># u'\n\n'</span>
|
||
<span class="c"># <p class="story">...</p></span>
|
||
<span class="c"># u'...'</span>
|
||
<span class="c"># u'\n'</span>
|
||
<span class="c"># None</span>
|
||
</code></pre>
|
||
</div>
|
||
</div>
|
||
</div>
|
||
<div class="section" id="id24">
|
||
<h1>搜索文档树</h1>
|
||
<p>Beautiful Soup定义了很多搜索方法,这里着重介绍2个: <tt class="docutils literal"><span class="pre">find()</span></tt> 和 <tt class="docutils literal"><span class="pre">find_all()</span></tt> .其它方法的参数和用法类似,请读者举一反三.</p>
|
||
<p>再以“爱丽丝”文档作为例子:</p>
|
||
<pre><code class="language-python"><span class="n">html_doc</span> <span class="o">=</span> <span class="s">"""</span>
|
||
<span class="s"><html><head><title>The Dormouse's story</title></head></span>
|
||
|
||
<span class="s"><p class="title"><b>The Dormouse's story</b></p></span>
|
||
|
||
<span class="s"><p class="story">Once upon a time there were three little sisters; and their names were</span>
|
||
<span class="s"><a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,</span>
|
||
<span class="s"><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and</span>
|
||
<span class="s"><a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;</span>
|
||
<span class="s">and they lived at the bottom of a well.</p></span>
|
||
|
||
<span class="s"><p class="story">...</p></span>
|
||
<span class="s">"""</span>
|
||
|
||
<span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">BeautifulSoup</span>
|
||
<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">html_doc</span><span class="p">)</span>
|
||
</code></pre>
|
||
<p>使用 <tt class="docutils literal"><span class="pre">find_all()</span></tt> 类似的方法可以查找到想要查找的文档内容</p>
|
||
<div class="section" id="id25">
|
||
<h2>过滤器</h2>
|
||
<p>介绍 <tt class="docutils literal"><span class="pre">find_all()</span></tt> 方法前,先介绍一下过滤器的类型 <a class="footnote-reference" href="#id84" id="id26">[3]</a> ,这些过滤器贯穿整个搜索的API.过滤器可以被用在tag的name中,节点的属性中,字符串中或他们的混合中.</p>
|
||
<div class="section" id="id27">
|
||
<h3>字符串</h3>
|
||
<p>最简单的过滤器是字符串.在搜索方法中传入一个字符串参数,Beautiful Soup会查找与字符串完整匹配的内容,下面的例子用于查找文档中所有的<b>标签:</p>
|
||
<pre><code class="language-python"><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">'b'</span><span class="p">)</span>
|
||
<span class="c"># [<b>The Dormouse's story</b>]</span>
|
||
</code></pre>
|
||
<p>如果传入字节码参数,Beautiful Soup会当作UTF-8编码,可以传入一段Unicode 编码来避免Beautiful Soup解析编码出错</p>
|
||
</div>
|
||
<div class="section" id="id28">
|
||
<h3>正则表达式</h3>
|
||
<p>如果传入正则表达式作为参数,Beautiful Soup会通过正则表达式的 <tt class="docutils literal"><span class="pre">match()</span></tt> 来匹配内容.下面例子中找出所有以b开头的标签,这表示<body>和<b>标签都应该被找到:</p>
|
||
<pre><code class="language-python"><span class="kn">import</span> <span class="nn">re</span>
|
||
<span class="k">for</span> <span class="n">tag</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s">"^b"</span><span class="p">)):</span>
|
||
<span class="k">print</span><span class="p">(</span><span class="n">tag</span><span class="o">.</span><span class="n">name</span><span class="p">)</span>
|
||
<span class="c"># body</span>
|
||
<span class="c"># b</span>
|
||
</code></pre>
|
||
<p>下面代码找出所有名字中包含”t”的标签:</p>
|
||
<pre><code class="language-python"><span class="k">for</span> <span class="n">tag</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s">"t"</span><span class="p">)):</span>
|
||
<span class="k">print</span><span class="p">(</span><span class="n">tag</span><span class="o">.</span><span class="n">name</span><span class="p">)</span>
|
||
<span class="c"># html</span>
|
||
<span class="c"># title</span>
|
||
</code></pre>
|
||
</div>
|
||
<div class="section" id="id29">
|
||
<h3>列表</h3>
|
||
<p>如果传入列表参数,Beautiful Soup会将与列表中任一元素匹配的内容返回.下面代码找到文档中所有<a>标签和<b>标签:</p>
|
||
<pre><code class="language-python"><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">([</span><span class="s">"a"</span><span class="p">,</span> <span class="s">"b"</span><span class="p">])</span>
|
||
<span class="c"># [<b>The Dormouse's story</b>,</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]</span>
|
||
</code></pre>
|
||
</div>
|
||
<div class="section" id="true">
|
||
<h3>True</h3>
|
||
<p><tt class="docutils literal"><span class="pre">True</span></tt> 可以匹配任何值,下面代码查找到所有的tag,但是不会返回字符串节点</p>
|
||
<pre><code class="language-python"><span class="k">for</span> <span class="n">tag</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="bp">True</span><span class="p">):</span>
|
||
<span class="k">print</span><span class="p">(</span><span class="n">tag</span><span class="o">.</span><span class="n">name</span><span class="p">)</span>
|
||
<span class="c"># html</span>
|
||
<span class="c"># head</span>
|
||
<span class="c"># title</span>
|
||
<span class="c"># body</span>
|
||
<span class="c"># p</span>
|
||
<span class="c"># b</span>
|
||
<span class="c"># p</span>
|
||
<span class="c"># a</span>
|
||
<span class="c"># a</span>
|
||
<span class="c"># a</span>
|
||
<span class="c"># p</span>
|
||
</code></pre>
|
||
</div>
|
||
<div class="section" id="id30">
|
||
<h3>方法</h3>
|
||
<p>如果没有合适过滤器,那么还可以定义一个方法,方法只接受一个元素参数 <a class="footnote-reference" href="#id85" id="id31">[4]</a> ,如果这个方法返回 <tt class="docutils literal"><span class="pre">True</span></tt> 表示当前元素匹配并且被找到,如果不是则反回 <tt class="docutils literal"><span class="pre">False</span></tt></p>
|
||
<p>下面方法校验了当前元素,如果包含 <tt class="docutils literal"><span class="pre">class</span></tt> 属性却不包含 <tt class="docutils literal"><span class="pre">id</span></tt> 属性,那么将返回 <tt class="docutils literal"><span class="pre">True</span></tt>:</p>
|
||
<pre><code class="language-python"><span class="k">def</span> <span class="nf">has_class_but_no_id</span><span class="p">(</span><span class="n">tag</span><span class="p">):</span>
|
||
<span class="k">return</span> <span class="n">tag</span><span class="o">.</span><span class="n">has_attr</span><span class="p">(</span><span class="s">'class'</span><span class="p">)</span> <span class="ow">and</span> <span class="ow">not</span> <span class="n">tag</span><span class="o">.</span><span class="n">has_attr</span><span class="p">(</span><span class="s">'id'</span><span class="p">)</span>
|
||
</code></pre>
|
||
<p>将这个方法作为参数传入 <tt class="docutils literal"><span class="pre">find_all()</span></tt> 方法,将得到所有<p>标签:</p>
|
||
<pre><code class="language-python"><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">has_class_but_no_id</span><span class="p">)</span>
|
||
<span class="c"># [<p class="title"><b>The Dormouse's story</b></p>,</span>
|
||
<span class="c"># <p class="story">Once upon a time there were...</p>,</span>
|
||
<span class="c"># <p class="story">...</p>]</span>
|
||
</code></pre>
|
||
<p>返回结果中只有<p>标签没有<a>标签,因为<a>标签还定义了”id”,没有返回<html>和<head>,因为<html>和<head>中没有定义”class”属性.</p>
|
||
<p>下面代码找到所有被文字包含的节点内容:</p>
|
||
<pre><code class="language-python"><span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">NavigableString</span>
|
||
<span class="k">def</span> <span class="nf">surrounded_by_strings</span><span class="p">(</span><span class="n">tag</span><span class="p">):</span>
|
||
<span class="k">return</span> <span class="p">(</span><span class="nb">isinstance</span><span class="p">(</span><span class="n">tag</span><span class="o">.</span><span class="n">next_element</span><span class="p">,</span> <span class="n">NavigableString</span><span class="p">)</span>
|
||
<span class="ow">and</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">tag</span><span class="o">.</span><span class="n">previous_element</span><span class="p">,</span> <span class="n">NavigableString</span><span class="p">))</span>
|
||
|
||
<span class="k">for</span> <span class="n">tag</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">surrounded_by_strings</span><span class="p">):</span>
|
||
<span class="k">print</span> <span class="n">tag</span><span class="o">.</span><span class="n">name</span>
|
||
<span class="c"># p</span>
|
||
<span class="c"># a</span>
|
||
<span class="c"># a</span>
|
||
<span class="c"># a</span>
|
||
<span class="c"># p</span>
|
||
</code></pre>
|
||
<p>现在来了解一下搜索方法的细节</p>
|
||
</div>
|
||
</div>
|
||
<div class="section" id="find-all">
|
||
<h2>find_all()</h2>
|
||
<p>find_all( <a class="reference internal" href="#id32">name</a> , <a class="reference internal" href="#css">attrs</a> , <a class="reference internal" href="#recursive">recursive</a> , <a class="reference internal" href="#text">text</a> , <a class="reference internal" href="#keyword">**kwargs</a> )</p>
|
||
<p><tt class="docutils literal"><span class="pre">find_all()</span></tt> 方法搜索当前tag的所有tag子节点,并判断是否符合过滤器的条件.这里有几个例子:</p>
|
||
<pre><code class="language-python"><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">"title"</span><span class="p">)</span>
|
||
<span class="c"># [<title>The Dormouse's story</title>]</span>
|
||
|
||
<span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">"p"</span><span class="p">,</span> <span class="s">"title"</span><span class="p">)</span>
|
||
<span class="c"># [<p class="title"><b>The Dormouse's story</b></p>]</span>
|
||
|
||
<span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">"a"</span><span class="p">)</span>
|
||
<span class="c"># [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]</span>
|
||
|
||
<span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="nb">id</span><span class="o">=</span><span class="s">"link2"</span><span class="p">)</span>
|
||
<span class="c"># [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]</span>
|
||
|
||
<span class="kn">import</span> <span class="nn">re</span>
|
||
<span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s">"sisters"</span><span class="p">))</span>
|
||
<span class="c"># u'Once upon a time there were three little sisters; and their names were\n'</span>
|
||
</code></pre>
|
||
<p>有几个方法很相似,还有几个方法是新的,参数中的 <tt class="docutils literal"><span class="pre">text</span></tt> 和 <tt class="docutils literal"><span class="pre">id</span></tt> 是什么含义? 为什么 <tt class="docutils literal"><span class="pre">find_all("p",</span> <span class="pre">"title")</span></tt> 返回的是CSS Class为”title”的<p>标签? 我们来仔细看一下 <tt class="docutils literal"><span class="pre">find_all()</span></tt> 的参数</p>
|
||
<div class="section" id="id32">
|
||
<h3>name 参数</h3>
|
||
<p><tt class="docutils literal"><span class="pre">name</span></tt> 参数可以查找所有名字为 <tt class="docutils literal"><span class="pre">name</span></tt> 的tag,字符串对象会被自动忽略掉.</p>
|
||
<p>简单的用法如下:</p>
|
||
<pre><code class="language-python"><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">"title"</span><span class="p">)</span>
|
||
<span class="c"># [<title>The Dormouse's story</title>]</span>
|
||
</code></pre>
|
||
<p>重申: 搜索 <tt class="docutils literal"><span class="pre">name</span></tt> 参数的值可以使任一类型的 <a class="reference internal" href="#id25">过滤器</a> ,字符窜,正则表达式,列表,方法或是 <tt class="docutils literal"><span class="pre">True</span></tt> .</p>
|
||
</div>
|
||
<div class="section" id="keyword">
|
||
<h3>keyword 参数</h3>
|
||
<p>如果一个指定名字的参数不是搜索内置的参数名,搜索时会把该参数当作指定名字tag的属性来搜索,如果包含一个名字为 <tt class="docutils literal"><span class="pre">id</span></tt> 的参数,Beautiful Soup会搜索每个tag的”id”属性.</p>
|
||
<pre><code class="language-python"><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="nb">id</span><span class="o">=</span><span class="s">'link2'</span><span class="p">)</span>
|
||
<span class="c"># [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]</span>
|
||
</code></pre>
|
||
<p>如果传入 <tt class="docutils literal"><span class="pre">href</span></tt> 参数,Beautiful Soup会搜索每个tag的”href”属性:</p>
|
||
<pre><code class="language-python"><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">href</span><span class="o">=</span><span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s">"elsie"</span><span class="p">))</span>
|
||
<span class="c"># [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]</span>
|
||
</code></pre>
|
||
<p>搜索指定名字的属性时可以使用的参数值包括 <a class="reference internal" href="#id27">字符串</a> , <a class="reference internal" href="#id28">正则表达式</a> , <a class="reference internal" href="#id29">列表</a>, <a class="reference internal" href="#true">True</a> .</p>
|
||
<p>下面的例子在文档树中查找所有包含 <tt class="docutils literal"><span class="pre">id</span></tt> 属性的tag,无论 <tt class="docutils literal"><span class="pre">id</span></tt> 的值是什么:</p>
|
||
<pre><code class="language-python"><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="nb">id</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
|
||
<span class="c"># [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]</span>
|
||
</code></pre>
|
||
<p>使用多个指定名字的参数可以同时过滤tag的多个属性:</p>
|
||
<pre><code class="language-python"><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">href</span><span class="o">=</span><span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s">"elsie"</span><span class="p">),</span> <span class="nb">id</span><span class="o">=</span><span class="s">'link1'</span><span class="p">)</span>
|
||
<span class="c"># [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]</span>
|
||
</code></pre>
|
||
<p>有些tag属性在搜索不能使用,比如HTML5中的 data-* 属性:</p>
|
||
<pre><code class="language-python"><span class="n">data_soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">'<div data-foo="value">foo!</div>'</span><span class="p">)</span>
|
||
<span class="n">data_soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">data</span><span class="o">-</span><span class="n">foo</span><span class="o">=</span><span class="s">"value"</span><span class="p">)</span>
|
||
<span class="c"># SyntaxError: keyword can't be an expression</span>
|
||
</code></pre>
|
||
<p>但是可以通过 <tt class="docutils literal"><span class="pre">find_all()</span></tt> 方法的 <tt class="docutils literal"><span class="pre">attrs</span></tt> 参数定义一个字典参数来搜索包含特殊属性的tag:</p>
|
||
<pre><code class="language-python"><span class="n">data_soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">attrs</span><span class="o">=</span><span class="p">{</span><span class="s">"data-foo"</span><span class="p">:</span> <span class="s">"value"</span><span class="p">})</span>
|
||
<span class="c"># [<div data-foo="value">foo!</div>]</span>
|
||
</code></pre>
|
||
</div>
|
||
<div class="section" id="css">
|
||
<h3>按CSS搜索</h3>
|
||
<p>按照CSS类名搜索tag的功能非常实用,但标识CSS类名的关键字 <tt class="docutils literal"><span class="pre">class</span></tt> 在Python中是保留字,使用 <tt class="docutils literal"><span class="pre">class</span></tt> 做参数会导致语法错误.从Beautiful Soup的4.1.1版本开始,可以通过 <tt class="docutils literal"><span class="pre">class_</span></tt> 参数搜索有指定CSS类名的tag:</p>
|
||
<pre><code class="language-python"><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">"a"</span><span class="p">,</span> <span class="n">class_</span><span class="o">=</span><span class="s">"sister"</span><span class="p">)</span>
|
||
<span class="c"># [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]</span>
|
||
</code></pre>
|
||
<p><tt class="docutils literal"><span class="pre">class_</span></tt> 参数同样接受不同类型的 <tt class="docutils literal"><span class="pre">过滤器</span></tt> ,字符串,正则表达式,方法或 <tt class="docutils literal"><span class="pre">True</span></tt> :</p>
|
||
<pre><code class="language-python"><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">class_</span><span class="o">=</span><span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s">"itl"</span><span class="p">))</span>
|
||
<span class="c"># [<p class="title"><b>The Dormouse's story</b></p>]</span>
|
||
|
||
<span class="k">def</span> <span class="nf">has_six_characters</span><span class="p">(</span><span class="n">css_class</span><span class="p">):</span>
|
||
<span class="k">return</span> <span class="n">css_class</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span> <span class="ow">and</span> <span class="nb">len</span><span class="p">(</span><span class="n">css_class</span><span class="p">)</span> <span class="o">==</span> <span class="mi">6</span>
|
||
|
||
<span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">class_</span><span class="o">=</span><span class="n">has_six_characters</span><span class="p">)</span>
|
||
<span class="c"># [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]</span>
|
||
</code></pre>
|
||
<p>tag的 <tt class="docutils literal"><span class="pre">class</span></tt> 属性是 <a class="reference internal" href="#id12">多值属性</a> .按照CSS类名搜索tag时,可以分别搜索tag中的每个CSS类名:</p>
|
||
<pre><code class="language-python"><span class="n">css_soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">'<p class="body strikeout"></p>'</span><span class="p">)</span>
|
||
<span class="n">css_soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">"p"</span><span class="p">,</span> <span class="n">class_</span><span class="o">=</span><span class="s">"strikeout"</span><span class="p">)</span>
|
||
<span class="c"># [<p class="body strikeout"></p>]</span>
|
||
|
||
<span class="n">css_soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">"p"</span><span class="p">,</span> <span class="n">class_</span><span class="o">=</span><span class="s">"body"</span><span class="p">)</span>
|
||
<span class="c"># [<p class="body strikeout"></p>]</span>
|
||
</code></pre>
|
||
<p>搜索 <tt class="docutils literal"><span class="pre">class</span></tt> 属性时也可以通过CSS值完全匹配:</p>
|
||
<pre><code class="language-python"><span class="n">css_soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">"p"</span><span class="p">,</span> <span class="n">class_</span><span class="o">=</span><span class="s">"body strikeout"</span><span class="p">)</span>
|
||
<span class="c"># [<p class="body strikeout"></p>]</span>
|
||
</code></pre>
|
||
<p>完全匹配 <tt class="docutils literal"><span class="pre">class</span></tt> 的值时,如果CSS类名的顺序与实际不符,将搜索不到结果:</p>
|
||
<pre><code class="language-python"><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">"a"</span><span class="p">,</span> <span class="n">attrs</span><span class="o">=</span><span class="p">{</span><span class="s">"class"</span><span class="p">:</span> <span class="s">"sister"</span><span class="p">})</span>
|
||
<span class="c"># [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]</span>
|
||
</code></pre>
|
||
</div>
|
||
<div class="section" id="text">
|
||
<h3><tt class="docutils literal"><span class="pre">text</span></tt> 参数</h3>
|
||
<p>通过 <tt class="docutils literal"><span class="pre">text</span></tt> 参数可以搜搜文档中的字符串内容.与 <tt class="docutils literal"><span class="pre">name</span></tt> 参数的可选值一样, <tt class="docutils literal"><span class="pre">text</span></tt> 参数接受 <a class="reference internal" href="#id27">字符串</a> , <a class="reference internal" href="#id28">正则表达式</a> , <a class="reference internal" href="#id29">列表</a>, <a class="reference internal" href="#true">True</a> . 看例子:</p>
|
||
<div class="highlight-python"><pre>soup.find_all(text="Elsie")
|
||
# [u'Elsie']
|
||
|
||
soup.find_all(text=["Tillie", "Elsie", "Lacie"])
|
||
# [u'Elsie', u'Lacie', u'Tillie']
|
||
|
||
soup.find_all(text=re.compile("Dormouse"))
|
||
[u"The Dormouse's story", u"The Dormouse's story"]
|
||
|
||
def is_the_only_string_within_a_tag(s):
|
||
""Return True if this string is the only child of its parent tag.""
|
||
return (s == s.parent.string)
|
||
|
||
soup.find_all(text=is_the_only_string_within_a_tag)
|
||
# [u"The Dormouse's story", u"The Dormouse's story", u'Elsie', u'Lacie', u'Tillie', u'...']</pre>
|
||
</div>
|
||
<p>虽然 <tt class="docutils literal"><span class="pre">text</span></tt> 参数用于搜索字符串,还可以与其它参数混合使用来过滤tag.Beautiful Soup会找到 <tt class="docutils literal"><span class="pre">.string</span></tt> 方法与 <tt class="docutils literal"><span class="pre">text</span></tt> 参数值相符的tag.下面代码用来搜索内容里面包含“Elsie”的<a>标签:</p>
|
||
<pre><code class="language-python"><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">"a"</span><span class="p">,</span> <span class="n">text</span><span class="o">=</span><span class="s">"Elsie"</span><span class="p">)</span>
|
||
<span class="c"># [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]</span>
|
||
</code></pre>
|
||
</div>
|
||
<div class="section" id="limit">
|
||
<h3><tt class="docutils literal"><span class="pre">limit</span></tt> 参数</h3>
|
||
<p><tt class="docutils literal"><span class="pre">find_all()</span></tt> 方法返回全部的搜索结构,如果文档树很大那么搜索会很慢.如果我们不需要全部结果,可以使用 <tt class="docutils literal"><span class="pre">limit</span></tt> 参数限制返回结果的数量.效果与SQL中的limit关键字类似,当搜索到的结果数量达到 <tt class="docutils literal"><span class="pre">limit</span></tt> 的限制时,就停止搜索返回结果.</p>
|
||
<p>文档树中有3个tag符合搜索条件,但结果只返回了2个,因为我们限制了返回数量:</p>
|
||
<pre><code class="language-python"><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">"a"</span><span class="p">,</span> <span class="n">limit</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
|
||
<span class="c"># [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]</span>
|
||
</code></pre>
|
||
</div>
|
||
<div class="section" id="recursive">
|
||
<h3><tt class="docutils literal"><span class="pre">recursive</span></tt> 参数</h3>
|
||
<p>调用tag的 <tt class="docutils literal"><span class="pre">find_all()</span></tt> 方法时,Beautiful Soup会检索当前tag的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数 <tt class="docutils literal"><span class="pre">recursive=False</span></tt> .</p>
|
||
<p>一段简单的文档:</p>
|
||
<div class="highlight-python"><pre><html>
|
||
<head>
|
||
<title>
|
||
The Dormouse's story
|
||
</title>
|
||
</head>
|
||
...</pre>
|
||
</div>
|
||
<p>是否使用 <tt class="docutils literal"><span class="pre">recursive</span></tt> 参数的搜索结果:</p>
|
||
<pre><code class="language-python"><span class="n">soup</span><span class="o">.</span><span class="n">html</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">"title"</span><span class="p">)</span>
|
||
<span class="c"># [<title>The Dormouse's story</title>]</span>
|
||
|
||
<span class="n">soup</span><span class="o">.</span><span class="n">html</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">"title"</span><span class="p">,</span> <span class="n">recursive</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
|
||
<span class="c"># []</span>
|
||
</code></pre>
|
||
</div>
|
||
</div>
|
||
<div class="section" id="find-all-tag">
|
||
<h2>像调用 <tt class="docutils literal"><span class="pre">find_all()</span></tt> 一样调用tag</h2>
|
||
<p><tt class="docutils literal"><span class="pre">find_all()</span></tt> 几乎是Beautiful Soup中最常用的搜索方法,所以我们定义了它的简写方法. <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 对象和 <tt class="docutils literal"><span class="pre">tag</span></tt> 对象可以被当作一个方法来使用,这个方法的执行结果与调用这个对象的 <tt class="docutils literal"><span class="pre">find_all()</span></tt> 方法相同,下面两行代码是等价的:</p>
|
||
<pre><code class="language-python"><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">"a"</span><span class="p">)</span>
|
||
<span class="n">soup</span><span class="p">(</span><span class="s">"a"</span><span class="p">)</span>
|
||
</code></pre>
|
||
<p>这两行代码也是等价的:</p>
|
||
<pre><code class="language-python"><span class="n">soup</span><span class="o">.</span><span class="n">title</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
|
||
<span class="n">soup</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
|
||
</code></pre>
|
||
</div>
|
||
<div class="section" id="find">
|
||
<h2>find()</h2>
|
||
<p>find( <a class="reference internal" href="#id32">name</a> , <a class="reference internal" href="#css">attrs</a> , <a class="reference internal" href="#recursive">recursive</a> , <a class="reference internal" href="#text">text</a> , <a class="reference internal" href="#keyword">**kwargs</a> )</p>
|
||
<p><tt class="docutils literal"><span class="pre">find_all()</span></tt> 方法将返回文档中符合条件的所有tag,尽管有时候我们只想得到一个结果.比如文档中只有一个<body>标签,那么使用 <tt class="docutils literal"><span class="pre">find_all()</span></tt> 方法来查找<body>标签就不太合适, 使用 <tt class="docutils literal"><span class="pre">find_all</span></tt> 方法并设置 <tt class="docutils literal"><span class="pre">limit=1</span></tt> 参数不如直接使用 <tt class="docutils literal"><span class="pre">find()</span></tt> 方法.下面两行代码是等价的:</p>
|
||
<pre><code class="language-python"><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">'title'</span><span class="p">,</span> <span class="n">limit</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
|
||
<span class="c"># [<title>The Dormouse's story</title>]</span>
|
||
|
||
<span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">'title'</span><span class="p">)</span>
|
||
<span class="c"># <title>The Dormouse's story</title></span>
|
||
</code></pre>
|
||
<p>唯一的区别是 <tt class="docutils literal"><span class="pre">find_all()</span></tt> 方法的返回结果是值包含一个元素的列表,而 <tt class="docutils literal"><span class="pre">find()</span></tt> 方法直接返回结果.</p>
|
||
<p><tt class="docutils literal"><span class="pre">find_all()</span></tt> 方法没有找到目标是返回空列表, <tt class="docutils literal"><span class="pre">find()</span></tt> 方法找不到目标时,返回 <tt class="docutils literal"><span class="pre">None</span></tt> .</p>
|
||
<pre><code class="language-python"><span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">"nosuchtag"</span><span class="p">))</span>
|
||
<span class="c"># None</span>
|
||
</code></pre>
|
||
<p><tt class="docutils literal"><span class="pre">soup.head.title</span></tt> 是 <a class="reference internal" href="#id17">tag的名字</a> 方法的简写.这个简写的原理就是多次调用当前tag的 <tt class="docutils literal"><span class="pre">find()</span></tt> 方法:</p>
|
||
<pre><code class="language-python"><span class="n">soup</span><span class="o">.</span><span class="n">head</span><span class="o">.</span><span class="n">title</span>
|
||
<span class="c"># <title>The Dormouse's story</title></span>
|
||
|
||
<span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">"head"</span><span class="p">)</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">"title"</span><span class="p">)</span>
|
||
<span class="c"># <title>The Dormouse's story</title></span>
|
||
</code></pre>
|
||
</div>
|
||
<div class="section" id="find-parents-find-parent">
|
||
<h2>find_parents() 和 find_parent()</h2>
|
||
<p>find_parents( <a class="reference internal" href="#id32">name</a> , <a class="reference internal" href="#css">attrs</a> , <a class="reference internal" href="#recursive">recursive</a> , <a class="reference internal" href="#text">text</a> , <a class="reference internal" href="#keyword">**kwargs</a> )</p>
|
||
<p>find_parent( <a class="reference internal" href="#id32">name</a> , <a class="reference internal" href="#css">attrs</a> , <a class="reference internal" href="#recursive">recursive</a> , <a class="reference internal" href="#text">text</a> , <a class="reference internal" href="#keyword">**kwargs</a> )</p>
|
||
<p>我们已经用了很大篇幅来介绍 <tt class="docutils literal"><span class="pre">find_all()</span></tt> 和 <tt class="docutils literal"><span class="pre">find()</span></tt> 方法,Beautiful Soup中还有10个用于搜索的API.它们中的五个用的是与 <tt class="docutils literal"><span class="pre">find_all()</span></tt> 相同的搜索参数,另外5个与 <tt class="docutils literal"><span class="pre">find()</span></tt> 方法的搜索参数类似.区别仅是它们搜索文档的不同部分.</p>
|
||
<p>记住: <tt class="docutils literal"><span class="pre">find_all()</span></tt> 和 <tt class="docutils literal"><span class="pre">find()</span></tt> 只搜索当前节点的所有子节点,孙子节点等. <tt class="docutils literal"><span class="pre">find_parents()</span></tt> 和 <tt class="docutils literal"><span class="pre">find_parent()</span></tt> 用来搜索当前节点的父辈节点,搜索方法与普通tag的搜索方法相同,搜索文档搜索文档包含的内容. 我们从一个文档中的一个叶子节点开始:</p>
|
||
<div class="highlight-python"><pre>a_string = soup.find(text="Lacie")
|
||
a_string
|
||
# u'Lacie'
|
||
|
||
a_string.find_parents("a")
|
||
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
|
||
|
||
a_string.find_parent("p")
|
||
# <p class="story">Once upon a time there were three little sisters; and their names were
|
||
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
|
||
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
|
||
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
|
||
# and they lived at the bottom of a well.</p>
|
||
|
||
a_string.find_parents("p", class="title")
|
||
# []</pre>
|
||
</div>
|
||
<p>文档中的一个<a>标签是是当前叶子节点的直接父节点,所以可以被找到.还有一个<p>标签,是目标叶子节点的间接父辈节点,所以也可以被找到.包含class值为”title”的<p>标签不是不是目标叶子节点的父辈节点,所以通过 <tt class="docutils literal"><span class="pre">find_parents()</span></tt> 方法搜索不到.</p>
|
||
<p><tt class="docutils literal"><span class="pre">find_parent()</span></tt> 和 <tt class="docutils literal"><span class="pre">find_parents()</span></tt> 方法会让人联想到 <a class="reference internal" href="#parent">.parent</a> 和 <a class="reference internal" href="#parents">.parents</a> 属性.它们之间的联系非常紧密.搜索父辈节点的方法实际上就是对 <tt class="docutils literal"><span class="pre">.parents</span></tt> 属性的迭代搜索.</p>
|
||
</div>
|
||
<div class="section" id="find-next-siblings-find-next-sibling">
|
||
<h2>find_next_siblings() 合 find_next_sibling()</h2>
|
||
<p>find_next_siblings( <a class="reference internal" href="#id32">name</a> , <a class="reference internal" href="#css">attrs</a> , <a class="reference internal" href="#recursive">recursive</a> , <a class="reference internal" href="#text">text</a> , <a class="reference internal" href="#keyword">**kwargs</a> )</p>
|
||
<p>find_next_sibling( <a class="reference internal" href="#id32">name</a> , <a class="reference internal" href="#css">attrs</a> , <a class="reference internal" href="#recursive">recursive</a> , <a class="reference internal" href="#text">text</a> , <a class="reference internal" href="#keyword">**kwargs</a> )</p>
|
||
<p>这2个方法通过 <a class="reference internal" href="#next-siblings-previous-siblings">.next_siblings</a> 属性对当tag的所有后面解析 <a class="footnote-reference" href="#id86" id="id33">[5]</a> 的兄弟tag节点进行迭代, <tt class="docutils literal"><span class="pre">find_next_siblings()</span></tt> 方法返回所有符合条件的后面的兄弟节点, <tt class="docutils literal"><span class="pre">find_next_sibling()</span></tt> 只返回符合条件的后面的第一个tag节点.</p>
|
||
<pre><code class="language-python"><span class="n">first_link</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
|
||
<span class="n">first_link</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a></span>
|
||
|
||
<span class="n">first_link</span><span class="o">.</span><span class="n">find_next_siblings</span><span class="p">(</span><span class="s">"a"</span><span class="p">)</span>
|
||
<span class="c"># [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]</span>
|
||
|
||
<span class="n">first_story_paragraph</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">"p"</span><span class="p">,</span> <span class="s">"story"</span><span class="p">)</span>
|
||
<span class="n">first_story_paragraph</span><span class="o">.</span><span class="n">find_next_sibling</span><span class="p">(</span><span class="s">"p"</span><span class="p">)</span>
|
||
<span class="c"># <p class="story">...</p></span>
|
||
</code></pre>
|
||
</div>
|
||
<div class="section" id="find-previous-siblings-find-previous-sibling">
|
||
<h2>find_previous_siblings() 和 find_previous_sibling()</h2>
|
||
<p>find_previous_siblings( <a class="reference internal" href="#id32">name</a> , <a class="reference internal" href="#css">attrs</a> , <a class="reference internal" href="#recursive">recursive</a> , <a class="reference internal" href="#text">text</a> , <a class="reference internal" href="#keyword">**kwargs</a> )</p>
|
||
<p>find_previous_sibling( <a class="reference internal" href="#id32">name</a> , <a class="reference internal" href="#css">attrs</a> , <a class="reference internal" href="#recursive">recursive</a> , <a class="reference internal" href="#text">text</a> , <a class="reference internal" href="#keyword">**kwargs</a> )</p>
|
||
<p>这2个方法通过 <a class="reference internal" href="#next-siblings-previous-siblings">.previous_siblings</a> 属性对当前tag的前面解析 <a class="footnote-reference" href="#id86" id="id34">[5]</a> 的兄弟tag节点进行迭代, <tt class="docutils literal"><span class="pre">find_previous_siblings()</span></tt> 方法返回所有符合条件的前面的兄弟节点, <tt class="docutils literal"><span class="pre">find_previous_sibling()</span></tt> 方法返回第一个符合条件的前面的兄弟节点:</p>
|
||
<pre><code class="language-python"><span class="n">last_link</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">"a"</span><span class="p">,</span> <span class="nb">id</span><span class="o">=</span><span class="s">"link3"</span><span class="p">)</span>
|
||
<span class="n">last_link</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a></span>
|
||
|
||
<span class="n">last_link</span><span class="o">.</span><span class="n">find_previous_siblings</span><span class="p">(</span><span class="s">"a"</span><span class="p">)</span>
|
||
<span class="c"># [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]</span>
|
||
|
||
<span class="n">first_story_paragraph</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">"p"</span><span class="p">,</span> <span class="s">"story"</span><span class="p">)</span>
|
||
<span class="n">first_story_paragraph</span><span class="o">.</span><span class="n">find_previous_sibling</span><span class="p">(</span><span class="s">"p"</span><span class="p">)</span>
|
||
<span class="c"># <p class="title"><b>The Dormouse's story</b></p></span>
|
||
</code></pre>
|
||
</div>
|
||
<div class="section" id="find-all-next-find-next">
|
||
<h2>find_all_next() 和 find_next()</h2>
|
||
<p>find_all_next( <a class="reference internal" href="#id32">name</a> , <a class="reference internal" href="#css">attrs</a> , <a class="reference internal" href="#recursive">recursive</a> , <a class="reference internal" href="#text">text</a> , <a class="reference internal" href="#keyword">**kwargs</a> )</p>
|
||
<p>find_next( <a class="reference internal" href="#id32">name</a> , <a class="reference internal" href="#css">attrs</a> , <a class="reference internal" href="#recursive">recursive</a> , <a class="reference internal" href="#text">text</a> , <a class="reference internal" href="#keyword">**kwargs</a> )</p>
|
||
<p>这2个方法通过 <a class="reference internal" href="#next-elements-previous-elements">.next_elements</a> 属性对当前tag的之后的 <a class="footnote-reference" href="#id86" id="id35">[5]</a> tag和字符串进行迭代, <tt class="docutils literal"><span class="pre">find_all_next()</span></tt> 方法返回所有符合条件的节点, <tt class="docutils literal"><span class="pre">find_next()</span></tt> 方法返回第一个符合条件的节点:</p>
|
||
<pre><code class="language-python"><span class="n">first_link</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
|
||
<span class="n">first_link</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a></span>
|
||
|
||
<span class="n">first_link</span><span class="o">.</span><span class="n">find_all_next</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
|
||
<span class="c"># [u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie',</span>
|
||
<span class="c"># u';\nand they lived at the bottom of a well.', u'\n\n', u'...', u'\n']</span>
|
||
|
||
<span class="n">first_link</span><span class="o">.</span><span class="n">find_next</span><span class="p">(</span><span class="s">"p"</span><span class="p">)</span>
|
||
<span class="c"># <p class="story">...</p></span>
|
||
</code></pre>
|
||
<p>第一个例子中,字符串 “Elsie”也被显示出来,尽管它被包含在我们开始查找的<a>标签的里面.第二个例子中,最后一个<p>标签也被显示出来,尽管它与我们开始查找位置的<a>标签不属于同一部分.例子中,搜索的重点是要匹配过滤器的条件,并且在文档中出现的顺序而不是开始查找的元素的位置.</p>
|
||
</div>
|
||
<div class="section" id="find-all-previous-find-previous">
|
||
<h2>find_all_previous() 和 find_previous()</h2>
|
||
<p>find_all_previous( <a class="reference internal" href="#id32">name</a> , <a class="reference internal" href="#css">attrs</a> , <a class="reference internal" href="#recursive">recursive</a> , <a class="reference internal" href="#text">text</a> , <a class="reference internal" href="#keyword">**kwargs</a> )</p>
|
||
<p>find_previous( <a class="reference internal" href="#id32">name</a> , <a class="reference internal" href="#css">attrs</a> , <a class="reference internal" href="#recursive">recursive</a> , <a class="reference internal" href="#text">text</a> , <a class="reference internal" href="#keyword">**kwargs</a> )</p>
|
||
<p>这2个方法通过 <a class="reference internal" href="#next-elements-previous-elements">.previous_elements</a> 属性对当前节点前面 <a class="footnote-reference" href="#id86" id="id36">[5]</a> 的tag和字符串进行迭代, <tt class="docutils literal"><span class="pre">find_all_previous()</span></tt> 方法返回所有符合条件的节点, <tt class="docutils literal"><span class="pre">find_previous()</span></tt> 方法返回第一个符合条件的节点.</p>
|
||
<pre><code class="language-python"><span class="n">first_link</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
|
||
<span class="n">first_link</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a></span>
|
||
|
||
<span class="n">first_link</span><span class="o">.</span><span class="n">find_all_previous</span><span class="p">(</span><span class="s">"p"</span><span class="p">)</span>
|
||
<span class="c"># [<p class="story">Once upon a time there were three little sisters; ...</p>,</span>
|
||
<span class="c"># <p class="title"><b>The Dormouse's story</b></p>]</span>
|
||
|
||
<span class="n">first_link</span><span class="o">.</span><span class="n">find_previous</span><span class="p">(</span><span class="s">"title"</span><span class="p">)</span>
|
||
<span class="c"># <title>The Dormouse's story</title></span>
|
||
</code></pre>
|
||
<p><tt class="docutils literal"><span class="pre">find_all_previous("p")</span></tt> 返回了文档中的第一段(class=”title”的那段),但还返回了第二段,<p>标签包含了我们开始查找的<a>标签.不要惊讶,这段代码的功能是查找所有出现在指定<a>标签之前的<p>标签,因为这个<p>标签包含了开始的<a>标签,所以<p>标签一定是在<a>之前出现的.</p>
|
||
</div>
|
||
<div class="section" id="id37">
|
||
<h2>CSS选择器</h2>
|
||
<p>Beautiful Soup支持大部分的CSS选择器 <a class="footnote-reference" href="#id87" id="id38">[6]</a> ,在 <tt class="docutils literal"><span class="pre">Tag</span></tt> 或 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 对象的 <tt class="docutils literal"><span class="pre">.select()</span></tt> 方法中传入字符串参数,即可使用CSS选择器的语法找到tag:</p>
|
||
<pre><code class="language-python"><span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">"title"</span><span class="p">)</span>
|
||
<span class="c"># [<title>The Dormouse's story</title>]</span>
|
||
|
||
<span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">"p nth-of-type(3)"</span><span class="p">)</span>
|
||
<span class="c"># [<p class="story">...</p>]</span>
|
||
</code></pre>
|
||
<p>通过tag标签逐层查找:</p>
|
||
<pre><code class="language-python"><span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">"body a"</span><span class="p">)</span>
|
||
<span class="c"># [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]</span>
|
||
|
||
<span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">"html head title"</span><span class="p">)</span>
|
||
<span class="c"># [<title>The Dormouse's story</title>]</span>
|
||
</code></pre>
|
||
<p>找到某个tag标签下的直接子标签 <a class="footnote-reference" href="#id87" id="id39">[6]</a> :</p>
|
||
<pre><code class="language-python"><span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">"head > title"</span><span class="p">)</span>
|
||
<span class="c"># [<title>The Dormouse's story</title>]</span>
|
||
|
||
<span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">"p > a"</span><span class="p">)</span>
|
||
<span class="c"># [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]</span>
|
||
|
||
<span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">"p > a:nth-of-type(2)"</span><span class="p">)</span>
|
||
<span class="c"># [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]</span>
|
||
|
||
<span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">"p > #link1"</span><span class="p">)</span>
|
||
<span class="c"># [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]</span>
|
||
|
||
<span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">"body > a"</span><span class="p">)</span>
|
||
<span class="c"># []</span>
|
||
</code></pre>
|
||
<p>找到兄弟节点标签:</p>
|
||
<pre><code class="language-python"><span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">"#link1 ~ .sister"</span><span class="p">)</span>
|
||
<span class="c"># [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]</span>
|
||
|
||
<span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">"#link1 + .sister"</span><span class="p">)</span>
|
||
<span class="c"># [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]</span>
|
||
</code></pre>
|
||
<p>通过CSS的类名查找:</p>
|
||
<pre><code class="language-python"><span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">".sister"</span><span class="p">)</span>
|
||
<span class="c"># [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]</span>
|
||
|
||
<span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">"[class~=sister]"</span><span class="p">)</span>
|
||
<span class="c"># [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]</span>
|
||
</code></pre>
|
||
<p>通过tag的id查找:</p>
|
||
<pre><code class="language-python"><span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">"#link1"</span><span class="p">)</span>
|
||
<span class="c"># [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]</span>
|
||
|
||
<span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">"a#link2"</span><span class="p">)</span>
|
||
<span class="c"># [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]</span>
|
||
</code></pre>
|
||
<p>通过是否存在某个属性来查找:</p>
|
||
<pre><code class="language-python"><span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">'a[href]'</span><span class="p">)</span>
|
||
<span class="c"># [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]</span>
|
||
</code></pre>
|
||
<p>通过属性的值来查找:</p>
|
||
<pre><code class="language-python"><span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">'a[href="http://example.com/elsie"]'</span><span class="p">)</span>
|
||
<span class="c"># [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]</span>
|
||
|
||
<span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">'a[href^="http://example.com/"]'</span><span class="p">)</span>
|
||
<span class="c"># [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]</span>
|
||
|
||
<span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">'a[href$="tillie"]'</span><span class="p">)</span>
|
||
<span class="c"># [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]</span>
|
||
|
||
<span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">'a[href*=".com/el"]'</span><span class="p">)</span>
|
||
<span class="c"># [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]</span>
|
||
</code></pre>
|
||
<p>通过语言设置来查找:</p>
|
||
<pre><code class="language-python"><span class="n">multilingual_markup</span> <span class="o">=</span> <span class="s">"""</span>
|
||
<span class="s"> <p lang="en">Hello</p></span>
|
||
<span class="s"> <p lang="en-us">Howdy, y'all</p></span>
|
||
<span class="s"> <p lang="en-gb">Pip-pip, old fruit</p></span>
|
||
<span class="s"> <p lang="fr">Bonjour mes amis</p></span>
|
||
<span class="s">"""</span>
|
||
<span class="n">multilingual_soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">multilingual_markup</span><span class="p">)</span>
|
||
<span class="n">multilingual_soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">'p[lang|=en]'</span><span class="p">)</span>
|
||
<span class="c"># [<p lang="en">Hello</p>,</span>
|
||
<span class="c"># <p lang="en-us">Howdy, y'all</p>,</span>
|
||
<span class="c"># <p lang="en-gb">Pip-pip, old fruit</p>]</span>
|
||
</code></pre>
|
||
<p>对于熟悉CSS选择器语法的人来说这是个非常方便的方法.Beautiful Soup也支持CSS选择器API,如果你仅仅需要CSS选择器的功能,那么直接使用 <tt class="docutils literal"><span class="pre">lxml</span></tt> 也可以,而且速度更快,支持更多的CSS选择器语法,但Beautiful Soup整合了CSS选择器的语法和自身方便使用API.</p>
|
||
</div>
|
||
</div>
|
||
<div class="section" id="id40">
|
||
<h1>修改文档树</h1>
|
||
<p>Beautiful Soup的强项是文档树的搜索,但同时也可以方便的修改文档树</p>
|
||
<div class="section" id="id41">
|
||
<h2>修改tag的名称和属性</h2>
|
||
<p>在 <a class="reference internal" href="#attributes">Attributes</a> 的章节中已经介绍过这个功能,但是再看一遍也无妨. 重命名一个tag,改变属性的值,添加或删除属性:</p>
|
||
<pre><code class="language-python"><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">'<b class="boldest">Extremely bold</b>'</span><span class="p">)</span>
|
||
<span class="n">tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">b</span>
|
||
|
||
<span class="n">tag</span><span class="o">.</span><span class="n">name</span> <span class="o">=</span> <span class="s">"blockquote"</span>
|
||
<span class="n">tag</span><span class="p">[</span><span class="s">'class'</span><span class="p">]</span> <span class="o">=</span> <span class="s">'verybold'</span>
|
||
<span class="n">tag</span><span class="p">[</span><span class="s">'id'</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span>
|
||
<span class="n">tag</span>
|
||
<span class="c"># <blockquote class="verybold" id="1">Extremely bold</blockquote></span>
|
||
|
||
<span class="k">del</span> <span class="n">tag</span><span class="p">[</span><span class="s">'class'</span><span class="p">]</span>
|
||
<span class="k">del</span> <span class="n">tag</span><span class="p">[</span><span class="s">'id'</span><span class="p">]</span>
|
||
<span class="n">tag</span>
|
||
<span class="c"># <blockquote>Extremely bold</blockquote></span>
|
||
</code></pre>
|
||
</div>
|
||
<div class="section" id="id42">
|
||
<h2>修改 .string</h2>
|
||
<p>给tag的 <tt class="docutils literal"><span class="pre">.string</span></tt> 属性赋值,就相当于用当前的内容替代了原来的内容:</p>
|
||
<pre><code class="language-python"><span class="n">markup</span> <span class="o">=</span> <span class="s">'<a href="http://example.com/">I linked to <i>example.com</i></a>'</span>
|
||
<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
|
||
|
||
<span class="n">tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
|
||
<span class="n">tag</span><span class="o">.</span><span class="n">string</span> <span class="o">=</span> <span class="s">"New link text."</span>
|
||
<span class="n">tag</span>
|
||
<span class="c"># <a href="http://example.com/">New link text.</a></span>
|
||
</code></pre>
|
||
<p>注意: 如果当前的tag包含了其它tag,那么给它的 <tt class="docutils literal"><span class="pre">.string</span></tt> 属性赋值会覆盖掉原有的所有内容包括子tag</p>
|
||
</div>
|
||
<div class="section" id="append">
|
||
<h2>append()</h2>
|
||
<p><tt class="docutils literal"><span class="pre">Tag.append()</span></tt> 方法想tag中添加内容,就好像Python的列表的 <tt class="docutils literal"><span class="pre">.append()</span></tt> 方法:</p>
|
||
<pre><code class="language-python"><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">"<a>Foo</a>"</span><span class="p">)</span>
|
||
<span class="n">soup</span><span class="o">.</span><span class="n">a</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="s">"Bar"</span><span class="p">)</span>
|
||
|
||
<span class="n">soup</span>
|
||
<span class="c"># <html><head></head><body><a>FooBar</a></body></html></span>
|
||
<span class="n">soup</span><span class="o">.</span><span class="n">a</span><span class="o">.</span><span class="n">contents</span>
|
||
<span class="c"># [u'Foo', u'Bar']</span>
|
||
</code></pre>
|
||
</div>
|
||
<div class="section" id="beautifulsoup-new-string-new-tag">
|
||
<h2>BeautifulSoup.new_string() 和 .new_tag()</h2>
|
||
<p>如果想添加一段文本内容到文档中也没问题,可以调用Python的 <tt class="docutils literal"><span class="pre">append()</span></tt> 方法或调用工厂方法 <tt class="docutils literal"><span class="pre">BeautifulSoup.new_string()</span></tt> :</p>
|
||
<pre><code class="language-python"><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">"<b></b>"</span><span class="p">)</span>
|
||
<span class="n">tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">b</span>
|
||
<span class="n">tag</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="s">"Hello"</span><span class="p">)</span>
|
||
<span class="n">new_string</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">new_string</span><span class="p">(</span><span class="s">" there"</span><span class="p">)</span>
|
||
<span class="n">tag</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">new_string</span><span class="p">)</span>
|
||
<span class="n">tag</span>
|
||
<span class="c"># <b>Hello there.</b></span>
|
||
<span class="n">tag</span><span class="o">.</span><span class="n">contents</span>
|
||
<span class="c"># [u'Hello', u' there']</span>
|
||
</code></pre>
|
||
<p>如果想要创建一段注释,或 <tt class="docutils literal"><span class="pre">NavigableString</span></tt> 的任何子类,将子类作为 <tt class="docutils literal"><span class="pre">new_string()</span></tt> 方法的第二个参数传入:</p>
|
||
<pre><code class="language-python"><span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">Comment</span>
|
||
<span class="n">new_comment</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">new_string</span><span class="p">(</span><span class="s">"Nice to see you."</span><span class="p">,</span> <span class="n">Comment</span><span class="p">)</span>
|
||
<span class="n">tag</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">new_comment</span><span class="p">)</span>
|
||
<span class="n">tag</span>
|
||
<span class="c"># <b>Hello there<!--Nice to see you.--></b></span>
|
||
<span class="n">tag</span><span class="o">.</span><span class="n">contents</span>
|
||
<span class="c"># [u'Hello', u' there', u'Nice to see you.']</span>
|
||
</code></pre>
|
||
<p># 这是Beautiful Soup 4.2.1 中新增的方法</p>
|
||
<p>创建一个tag最好的方法是调用工厂方法 <tt class="docutils literal"><span class="pre">BeautifulSoup.new_tag()</span></tt> :</p>
|
||
<pre><code class="language-python"><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">"<b></b>"</span><span class="p">)</span>
|
||
<span class="n">original_tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">b</span>
|
||
|
||
<span class="n">new_tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">new_tag</span><span class="p">(</span><span class="s">"a"</span><span class="p">,</span> <span class="n">href</span><span class="o">=</span><span class="s">"http://www.example.com"</span><span class="p">)</span>
|
||
<span class="n">original_tag</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">new_tag</span><span class="p">)</span>
|
||
<span class="n">original_tag</span>
|
||
<span class="c"># <b><a href="http://www.example.com"></a></b></span>
|
||
|
||
<span class="n">new_tag</span><span class="o">.</span><span class="n">string</span> <span class="o">=</span> <span class="s">"Link text."</span>
|
||
<span class="n">original_tag</span>
|
||
<span class="c"># <b><a href="http://www.example.com">Link text.</a></b></span>
|
||
</code></pre>
|
||
<p>第一个参数作为tag的name,是必填,其它参数选填</p>
|
||
</div>
|
||
<div class="section" id="insert">
|
||
<h2>insert()</h2>
|
||
<p><tt class="docutils literal"><span class="pre">Tag.insert()</span></tt> 方法与 <tt class="docutils literal"><span class="pre">Tag.append()</span></tt> 方法类似,区别是不会把新元素添加到父节点 <tt class="docutils literal"><span class="pre">.contents</span></tt> 属性的最后,而是把元素插入到指定的位置.与Python列表总的 <tt class="docutils literal"><span class="pre">.insert()</span></tt> 方法的用法下同:</p>
|
||
<pre><code class="language-python"><span class="n">markup</span> <span class="o">=</span> <span class="s">'<a href="http://example.com/">I linked to <i>example.com</i></a>'</span>
|
||
<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
|
||
<span class="n">tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
|
||
|
||
<span class="n">tag</span><span class="o">.</span><span class="n">insert</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="s">"but did not endorse "</span><span class="p">)</span>
|
||
<span class="n">tag</span>
|
||
<span class="c"># <a href="http://example.com/">I linked to but did not endorse <i>example.com</i></a></span>
|
||
<span class="n">tag</span><span class="o">.</span><span class="n">contents</span>
|
||
<span class="c"># [u'I linked to ', u'but did not endorse', <i>example.com</i>]</span>
|
||
</code></pre>
|
||
</div>
|
||
<div class="section" id="insert-before-insert-after">
|
||
<h2>insert_before() 和 insert_after()</h2>
|
||
<p><tt class="docutils literal"><span class="pre">insert_before()</span></tt> 方法在当前tag或文本节点前插入内容:</p>
|
||
<pre><code class="language-python"><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">"<b>stop</b>"</span><span class="p">)</span>
|
||
<span class="n">tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">new_tag</span><span class="p">(</span><span class="s">"i"</span><span class="p">)</span>
|
||
<span class="n">tag</span><span class="o">.</span><span class="n">string</span> <span class="o">=</span> <span class="s">"Don't"</span>
|
||
<span class="n">soup</span><span class="o">.</span><span class="n">b</span><span class="o">.</span><span class="n">string</span><span class="o">.</span><span class="n">insert_before</span><span class="p">(</span><span class="n">tag</span><span class="p">)</span>
|
||
<span class="n">soup</span><span class="o">.</span><span class="n">b</span>
|
||
<span class="c"># <b><i>Don't</i>stop</b></span>
|
||
</code></pre>
|
||
<p><tt class="docutils literal"><span class="pre">insert_after()</span></tt> 方法在当前tag或文本节点后插入内容:</p>
|
||
<pre><code class="language-python"><span class="n">soup</span><span class="o">.</span><span class="n">b</span><span class="o">.</span><span class="n">i</span><span class="o">.</span><span class="n">insert_after</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">new_string</span><span class="p">(</span><span class="s">" ever "</span><span class="p">))</span>
|
||
<span class="n">soup</span><span class="o">.</span><span class="n">b</span>
|
||
<span class="c"># <b><i>Don't</i> ever stop</b></span>
|
||
<span class="n">soup</span><span class="o">.</span><span class="n">b</span><span class="o">.</span><span class="n">contents</span>
|
||
<span class="c"># [<i>Don't</i>, u' ever ', u'stop']</span>
|
||
</code></pre>
|
||
</div>
|
||
<div class="section" id="clear">
|
||
<h2>clear()</h2>
|
||
<p><tt class="docutils literal"><span class="pre">Tag.clear()</span></tt> 方法移除当前tag的内容:</p>
|
||
<pre><code class="language-python"><span class="n">markup</span> <span class="o">=</span> <span class="s">'<a href="http://example.com/">I linked to <i>example.com</i></a>'</span>
|
||
<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
|
||
<span class="n">tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
|
||
|
||
<span class="n">tag</span><span class="o">.</span><span class="n">clear</span><span class="p">()</span>
|
||
<span class="n">tag</span>
|
||
<span class="c"># <a href="http://example.com/"></a></span>
|
||
</code></pre>
|
||
</div>
|
||
<div class="section" id="extract">
|
||
<h2>extract()</h2>
|
||
<p><tt class="docutils literal"><span class="pre">PageElement.extract()</span></tt> 方法将当前tag移除文档树,并作为方法结果返回:</p>
|
||
<pre><code class="language-python"><span class="n">markup</span> <span class="o">=</span> <span class="s">'<a href="http://example.com/">I linked to <i>example.com</i></a>'</span>
|
||
<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
|
||
<span class="n">a_tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
|
||
|
||
<span class="n">i_tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">i</span><span class="o">.</span><span class="n">extract</span><span class="p">()</span>
|
||
|
||
<span class="n">a_tag</span>
|
||
<span class="c"># <a href="http://example.com/">I linked to</a></span>
|
||
|
||
<span class="n">i_tag</span>
|
||
<span class="c"># <i>example.com</i></span>
|
||
|
||
<span class="k">print</span><span class="p">(</span><span class="n">i_tag</span><span class="o">.</span><span class="n">parent</span><span class="p">)</span>
|
||
<span class="bp">None</span>
|
||
</code></pre>
|
||
<p>这个方法实际上产生了2个文档树: 一个是用来解析原始文档的 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 对象,另一个是被移除并且返回的tag.被移除并返回的tag可以继续调用 <tt class="docutils literal"><span class="pre">extract</span></tt> 方法:</p>
|
||
<pre><code class="language-python"><span class="n">my_string</span> <span class="o">=</span> <span class="n">i_tag</span><span class="o">.</span><span class="n">string</span><span class="o">.</span><span class="n">extract</span><span class="p">()</span>
|
||
<span class="n">my_string</span>
|
||
<span class="c"># u'example.com'</span>
|
||
|
||
<span class="k">print</span><span class="p">(</span><span class="n">my_string</span><span class="o">.</span><span class="n">parent</span><span class="p">)</span>
|
||
<span class="c"># None</span>
|
||
<span class="n">i_tag</span>
|
||
<span class="c"># <i></i></span>
|
||
</code></pre>
|
||
</div>
|
||
<div class="section" id="decompose">
|
||
<h2>decompose()</h2>
|
||
<p><tt class="docutils literal"><span class="pre">Tag.decompose()</span></tt> 方法将当前节点移除文档树并完全销毁:</p>
|
||
<pre><code class="language-python"><span class="n">markup</span> <span class="o">=</span> <span class="s">'<a href="http://example.com/">I linked to <i>example.com</i></a>'</span>
|
||
<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
|
||
<span class="n">a_tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
|
||
|
||
<span class="n">soup</span><span class="o">.</span><span class="n">i</span><span class="o">.</span><span class="n">decompose</span><span class="p">()</span>
|
||
|
||
<span class="n">a_tag</span>
|
||
<span class="c"># <a href="http://example.com/">I linked to</a></span>
|
||
</code></pre>
|
||
</div>
|
||
<div class="section" id="replace-with">
|
||
<h2>replace_with()</h2>
|
||
<p><tt class="docutils literal"><span class="pre">PageElement.replace_with()</span></tt> 方法移除文档树中的某段内容,并用新tag或文本节点替代它:</p>
|
||
<pre><code class="language-python"><span class="n">markup</span> <span class="o">=</span> <span class="s">'<a href="http://example.com/">I linked to <i>example.com</i></a>'</span>
|
||
<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
|
||
<span class="n">a_tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
|
||
|
||
<span class="n">new_tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">new_tag</span><span class="p">(</span><span class="s">"b"</span><span class="p">)</span>
|
||
<span class="n">new_tag</span><span class="o">.</span><span class="n">string</span> <span class="o">=</span> <span class="s">"example.net"</span>
|
||
<span class="n">a_tag</span><span class="o">.</span><span class="n">i</span><span class="o">.</span><span class="n">replace_with</span><span class="p">(</span><span class="n">new_tag</span><span class="p">)</span>
|
||
|
||
<span class="n">a_tag</span>
|
||
<span class="c"># <a href="http://example.com/">I linked to <b>example.net</b></a></span>
|
||
</code></pre>
|
||
<p><tt class="docutils literal"><span class="pre">replace_with()</span></tt> 方法返回被替代的tag或文本节点,可以用来浏览或添加到文档树其它地方</p>
|
||
</div>
|
||
<div class="section" id="wrap">
|
||
<h2>wrap()</h2>
|
||
<p><tt class="docutils literal"><span class="pre">PageElement.wrap()</span></tt> 方法可以对指定的tag元素进行包装 <a class="footnote-reference" href="#id89" id="id43">[8]</a> ,并返回包装后的结果:</p>
|
||
<pre><code class="language-python"><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">"<p>I wish I was bold.</p>"</span><span class="p">)</span>
|
||
<span class="n">soup</span><span class="o">.</span><span class="n">p</span><span class="o">.</span><span class="n">string</span><span class="o">.</span><span class="n">wrap</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">new_tag</span><span class="p">(</span><span class="s">"b"</span><span class="p">))</span>
|
||
<span class="c"># <b>I wish I was bold.</b></span>
|
||
|
||
<span class="n">soup</span><span class="o">.</span><span class="n">p</span><span class="o">.</span><span class="n">wrap</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">new_tag</span><span class="p">(</span><span class="s">"div"</span><span class="p">))</span>
|
||
<span class="c"># <div><p><b>I wish I was bold.</b></p></div></span>
|
||
</code></pre>
|
||
<p>该方法在 Beautiful Soup 4.0.5 中添加</p>
|
||
</div>
|
||
<div class="section" id="unwrap">
|
||
<h2>unwrap()</h2>
|
||
<p><tt class="docutils literal"><span class="pre">Tag.unwrap()</span></tt> 方法与 <tt class="docutils literal"><span class="pre">wrap()</span></tt> 方法相反.将移除tag内的所有tag标签,该方法常被用来进行标记的解包:</p>
|
||
<pre><code class="language-python"><span class="n">markup</span> <span class="o">=</span> <span class="s">'<a href="http://example.com/">I linked to <i>example.com</i></a>'</span>
|
||
<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
|
||
<span class="n">a_tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
|
||
|
||
<span class="n">a_tag</span><span class="o">.</span><span class="n">i</span><span class="o">.</span><span class="n">unwrap</span><span class="p">()</span>
|
||
<span class="n">a_tag</span>
|
||
<span class="c"># <a href="http://example.com/">I linked to example.com</a></span>
|
||
</code></pre>
|
||
<p>与 <tt class="docutils literal"><span class="pre">replace_with()</span></tt> 方法相同, <tt class="docutils literal"><span class="pre">unwrap()</span></tt> 方法返回被移除的tag</p>
|
||
</div>
|
||
</div>
|
||
<div class="section" id="id44">
|
||
<h1>输出</h1>
|
||
<div class="section" id="id45">
|
||
<h2>格式化输出</h2>
|
||
<p><tt class="docutils literal"><span class="pre">prettify()</span></tt> 方法将Beautiful Soup的文档树格式化后以Unicode编码输出,每个XML/HTML标签都独占一行</p>
|
||
<pre><code class="language-python"><span class="n">markup</span> <span class="o">=</span> <span class="s">'<a href="http://example.com/">I linked to <i>example.com</i></a>'</span>
|
||
<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
|
||
<span class="n">soup</span><span class="o">.</span><span class="n">prettify</span><span class="p">()</span>
|
||
<span class="c"># '<html>\n <head>\n </head>\n <body>\n <a href="http://example.com/">\n...'</span>
|
||
|
||
<span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">prettify</span><span class="p">())</span>
|
||
<span class="c"># <html></span>
|
||
<span class="c"># <head></span>
|
||
<span class="c"># </head></span>
|
||
<span class="c"># <body></span>
|
||
<span class="c"># <a href="http://example.com/"></span>
|
||
<span class="c"># I linked to</span>
|
||
<span class="c"># <i></span>
|
||
<span class="c"># example.com</span>
|
||
<span class="c"># </i></span>
|
||
<span class="c"># </a></span>
|
||
<span class="c"># </body></span>
|
||
<span class="c"># </html></span>
|
||
</code></pre>
|
||
<p><tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 对象和它的tag节点都可以调用 <tt class="docutils literal"><span class="pre">prettify()</span></tt> 方法:</p>
|
||
<pre><code class="language-python"><span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">a</span><span class="o">.</span><span class="n">prettify</span><span class="p">())</span>
|
||
<span class="c"># <a href="http://example.com/"></span>
|
||
<span class="c"># I linked to</span>
|
||
<span class="c"># <i></span>
|
||
<span class="c"># example.com</span>
|
||
<span class="c"># </i></span>
|
||
<span class="c"># </a></span>
|
||
</code></pre>
|
||
</div>
|
||
<div class="section" id="id46">
|
||
<h2>压缩输出</h2>
|
||
<p>如果只想得到结果字符串,不重视格式,那么可以对一个 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 对象或 <tt class="docutils literal"><span class="pre">Tag</span></tt> 对象使用Python的 <tt class="docutils literal"><span class="pre">unicode()</span></tt> 或 <tt class="docutils literal"><span class="pre">str()</span></tt> 方法:</p>
|
||
<pre><code class="language-python"><span class="nb">str</span><span class="p">(</span><span class="n">soup</span><span class="p">)</span>
|
||
<span class="c"># '<html><head></head><body><a href="http://example.com/">I linked to <i>example.com</i></a></body></html>'</span>
|
||
|
||
<span class="nb">unicode</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">a</span><span class="p">)</span>
|
||
<span class="c"># u'<a href="http://example.com/">I linked to <i>example.com</i></a>'</span>
|
||
</code></pre>
|
||
<p><tt class="docutils literal"><span class="pre">str()</span></tt> 方法返回UTF-8编码的字符串,可以指定 <a class="reference internal" href="#id51">编码</a> 的设置.</p>
|
||
<p>还可以调用 <tt class="docutils literal"><span class="pre">encode()</span></tt> 方法获得字节码或调用 <tt class="docutils literal"><span class="pre">decode()</span></tt> 方法获得Unicode.</p>
|
||
</div>
|
||
<div class="section" id="id47">
|
||
<h2>输出格式</h2>
|
||
<p>Beautiful Soup输出是会将HTML中的特殊字符转换成Unicode,比如“&lquot;”:</p>
|
||
<pre><code class="language-python"><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">"&ldquo;Dammit!&rdquo; he said."</span><span class="p">)</span>
|
||
<span class="nb">unicode</span><span class="p">(</span><span class="n">soup</span><span class="p">)</span>
|
||
<span class="c"># u'<html><head></head><body>\u201cDammit!\u201d he said.</body></html>'</span>
|
||
</code></pre>
|
||
<p>如果将文档转换成字符串,Unicode编码会被编码成UTF-8.这样就无法正确显示HTML特殊字符了:</p>
|
||
<pre><code class="language-python"><span class="nb">str</span><span class="p">(</span><span class="n">soup</span><span class="p">)</span>
|
||
<span class="c"># '<html><head></head><body>\xe2\x80\x9cDammit!\xe2\x80\x9d he said.</body></html>'</span>
|
||
</code></pre>
|
||
</div>
|
||
<div class="section" id="get-text">
|
||
<h2>get_text()</h2>
|
||
<p>如果只想得到tag中包含的文本内容,那么可以嗲用 <tt class="docutils literal"><span class="pre">get_text()</span></tt> 方法,这个方法获取到tag中包含的所有文版内容包括子孙tag中的内容,并将结果作为Unicode字符串返回:</p>
|
||
<pre><code class="language-python"><span class="n">markup</span> <span class="o">=</span> <span class="s">'<a href="http://example.com/"></span><span class="se">\n</span><span class="s">I linked to <i>example.com</i></span><span class="se">\n</span><span class="s"></a>'</span>
|
||
<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
|
||
|
||
<span class="n">soup</span><span class="o">.</span><span class="n">get_text</span><span class="p">()</span>
|
||
<span class="s">u'</span><span class="se">\n</span><span class="s">I linked to example.com</span><span class="se">\n</span><span class="s">'</span>
|
||
<span class="n">soup</span><span class="o">.</span><span class="n">i</span><span class="o">.</span><span class="n">get_text</span><span class="p">()</span>
|
||
<span class="s">u'example.com'</span>
|
||
</code></pre>
|
||
<p>可以通过参数指定tag的文本内容的分隔符:</p>
|
||
<pre><code class="language-python"><span class="c"># soup.get_text("|")</span>
|
||
<span class="s">u'</span><span class="se">\n</span><span class="s">I linked to |example.com|</span><span class="se">\n</span><span class="s">'</span>
|
||
</code></pre>
|
||
<p>还可以去除获得文本内容的前后空白:</p>
|
||
<pre><code class="language-python"><span class="c"># soup.get_text("|", strip=True)</span>
|
||
<span class="s">u'I linked to|example.com'</span>
|
||
</code></pre>
|
||
<p>或者使用 <a class="reference internal" href="#strings-stripped-strings">.stripped_strings</a> 生成器,获得文本列表后手动处理列表:</p>
|
||
<pre><code class="language-python"><span class="p">[</span><span class="n">text</span> <span class="k">for</span> <span class="n">text</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">stripped_strings</span><span class="p">]</span>
|
||
<span class="c"># [u'I linked to', u'example.com']</span>
|
||
</code></pre>
|
||
</div>
|
||
</div>
|
||
<div class="section" id="id48">
|
||
<h1>指定文档解析器</h1>
|
||
<p>如果仅是想要解析HTML文档,只要用文档创建 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 对象就可以了.Beautiful Soup会自动选择一个解析器来解析文档.但是还可以通过参数指定使用那种解析器来解析当前文档.</p>
|
||
<p><tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 第一个参数应该是要被解析的文档字符串或是文件句柄,第二个参数用来标识怎样解析文档.如果第二个参数为空,那么Beautiful Soup根据当前系统安装的库自动选择解析器,解析器的优先数序: lxml, html5lib, Python标准库.在下面两种条件下解析器优先顺序会变化:</p>
|
||
<blockquote>
|
||
<div><ul class="simple">
|
||
<li>要解析的文档是什么类型: 目前支持, “html”, “xml”, 和 “html5”</li>
|
||
<li>指定使用哪种解析器: 目前支持, “lxml”, “html5lib”, 和 “html.parser”</li>
|
||
</ul>
|
||
</div></blockquote>
|
||
<p><a class="reference internal" href="#id9">安装解析器</a> 章节介绍了可以使用哪种解析器,以及如何安装.</p>
|
||
<p>如果指定的解析器没有安装,Beautiful Soup会自动选择其它方案.目前只有 lxml 解析器支持XML文档的解析,在没有安装lxml库的情况下,创建 <tt class="docutils literal"><span class="pre">beautifulsoup</span></tt> 对象时无论是否指定使用lxml,都无法得到解析后的对象</p>
|
||
<div class="section" id="id49">
|
||
<h2>解析器之间的区别</h2>
|
||
<p>Beautiful Soup为不同的解析器提供了相同的接口,但解析器本身时有区别的.同一篇文档被不同的解析器解析后可能会生成不同结构的树型文档.区别最大的是HTML解析器和XML解析器,看下面片段被解析成HTML结构:</p>
|
||
<pre><code class="language-python"><span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">"<a><b /></a>"</span><span class="p">)</span>
|
||
<span class="c"># <html><head></head><body><a><b></b></a></body></html></span>
|
||
</code></pre>
|
||
<p>因为空标签<b />不符合HTML标准,所以解析器把它解析成<b></b></p>
|
||
<p>同样的文档使用XML解析如下(解析XML需要安装lxml库).注意,空标签<b />依然被保留,并且文档前添加了XML头,而不是被包含在<html>标签内:</p>
|
||
<pre><code class="language-python"><span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">"<a><b /></a>"</span><span class="p">,</span> <span class="s">"xml"</span><span class="p">)</span>
|
||
<span class="c"># <?xml version="1.0" encoding="utf-8"?></span>
|
||
<span class="c"># <a><b/></a></span>
|
||
</code></pre>
|
||
<p>HTML解析器之间也有区别,如果被解析的HTML文档是标准格式,那么解析器之间没有任何差别,只是解析速度不同,结果都会返回正确的文档树.</p>
|
||
<p>但是如果被解析文档不是标准格式,那么不同的解析器返回结果可能不同.下面例子中,使用lxml解析错误格式的文档,结果</p>标签被直接忽略掉了:</p>
|
||
<pre><code class="language-python"><span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">"<a></p>"</span><span class="p">,</span> <span class="s">"lxml"</span><span class="p">)</span>
|
||
<span class="c"># <html><body><a></a></body></html></span>
|
||
</code></pre>
|
||
<p>使用html5lib库解析相同文档会得到不同的结果:</p>
|
||
<pre><code class="language-python"><span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">"<a></p>"</span><span class="p">,</span> <span class="s">"html5lib"</span><span class="p">)</span>
|
||
<span class="c"># <html><head></head><body><a><p></p></a></body></html></span>
|
||
</code></pre>
|
||
<p>html5lib库没有忽略掉</p>标签,而是自动补全了标签,还给文档树添加了<head>标签.</p>
|
||
<p>使用pyhton内置库解析结果如下:</p>
|
||
<pre><code class="language-python"><span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">"<a></p>"</span><span class="p">,</span> <span class="s">"html.parser"</span><span class="p">)</span>
|
||
<span class="c"># <a></a></span>
|
||
</code></pre>
|
||
<p>与lxml <a class="footnote-reference" href="#id88" id="id50">[7]</a> 库类似的,Python内置库忽略掉了</p>标签,与html5lib库不同的是标准库没有尝试创建符合标准的文档格式或将文档片段包含在<body>标签内,与lxml不同的是标准库甚至连<html>标签都没有尝试去添加.</p>
|
||
<p>因为文档片段“<a></p>”是错误格式,所以以上解析方式都能算作”正确”,html5lib库使用的是HTML5的部分标准,所以最接近”正确”.不过所有解析器的结构都能够被认为是”正常”的.</p>
|
||
<p>不同的解析器可能影响代码执行结果,如果在分发给别人的代码中使用了 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> ,那么最好注明使用了哪种解析器,以减少不必要的麻烦.</p>
|
||
</div>
|
||
</div>
|
||
<div class="section" id="id51">
|
||
<h1>编码</h1>
|
||
<p>任何HTML或XML文档都有自己的编码方式,比如ASCII 或 UTF-8,但是使用Beautiful Soup解析后,文档都被转换成了Unicode:</p>
|
||
<pre><code class="language-python"><span class="n">markup</span> <span class="o">=</span> <span class="s">"<h1>Sacr</span><span class="se">\xc3\xa9</span><span class="s"> bleu!</h1>"</span>
|
||
<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
|
||
<span class="n">soup</span><span class="o">.</span><span class="n">h1</span>
|
||
<span class="c"># <h1>Sacré bleu!</h1></span>
|
||
<span class="n">soup</span><span class="o">.</span><span class="n">h1</span><span class="o">.</span><span class="n">string</span>
|
||
<span class="c"># u'Sacr\xe9 bleu!'</span>
|
||
</code></pre>
|
||
<p>这不是魔术(但很神奇),Beautiful Soup用了 <a class="reference internal" href="#unicode-dammit">编码自动检测</a> 子库来识别当前文档编码并转换成Unicode编码. <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 对象的 <tt class="docutils literal"><span class="pre">.original_encoding</span></tt> 属性记录了自动识别编码的结果:</p>
|
||
<pre><code class="language-python"><span class="n">soup</span><span class="o">.</span><span class="n">original_encoding</span>
|
||
<span class="s">'utf-8'</span>
|
||
</code></pre>
|
||
<p><a class="reference internal" href="#unicode-dammit">编码自动检测</a> 功能大部分时候都能猜对编码格式,但有时候也会出错.有时候即使猜测正确,也是在逐个字节的遍历整个文档后才猜对的,这样很慢.如果预先知道文档编码,可以设置编码参数来减少自动检查编码出错的概率并且提高文档解析速度.在创建 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 对象的时候设置 <tt class="docutils literal"><span class="pre">from_encoding</span></tt> 参数.</p>
|
||
<p>下面一段文档用了ISO-8859-8编码方式,这段文档太短,结果Beautiful Soup以为文档是用ISO-8859-7编码:</p>
|
||
<div class="highlight-python"><pre>markup = b"<h1>\xed\xe5\xec\xf9</h1>"
|
||
soup = BeautifulSoup(markup)
|
||
soup.h1
|
||
<h1>νεμω</h1>
|
||
soup.original_encoding
|
||
'ISO-8859-7'</pre>
|
||
</div>
|
||
<p>通过传入 <tt class="docutils literal"><span class="pre">from_encoding</span></tt> 参数来指定编码方式:</p>
|
||
<div class="highlight-python"><pre>soup = BeautifulSoup(markup, from_encoding="iso-8859-8")
|
||
soup.h1
|
||
<h1>םולש</h1>
|
||
soup.original_encoding
|
||
'iso8859-8'</pre>
|
||
</div>
|
||
<p>少数情况下(通常是UTF-8编码的文档中包含了其它编码格式的文件),想获得正确的Unicode编码就不得不将文档中少数特殊编码字符替换成特殊Unicode编码,“REPLACEMENT CHARACTER” (U+FFFD, <20>) <a class="footnote-reference" href="#id90" id="id52">[9]</a> . 如果Beautifu Soup猜测文档编码时作了特殊字符的替换,那么Beautiful Soup会把 <tt class="docutils literal"><span class="pre">UnicodeDammit</span></tt> 或 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 对象的 <tt class="docutils literal"><span class="pre">.contains_replacement_characters</span></tt> 属性标记为 <tt class="docutils literal"><span class="pre">True</span></tt> .这样就可以知道当前文档进行Unicode编码后丢失了一部分特殊内容字符.如果文档中包含<E58C85>而 <tt class="docutils literal"><span class="pre">.contains_replacement_characters</span></tt> 属性是 <tt class="docutils literal"><span class="pre">False</span></tt> ,则表示<E8A1A8>就是文档中原来的字符,不是转码失败.</p>
|
||
<div class="section" id="id53">
|
||
<h2>输出编码</h2>
|
||
<p>通过Beautiful Soup输出文档时,不管输入文档是什么编码方式,输出编码均为UTF-8编码,下面例子输入文档是Latin-1编码:</p>
|
||
<pre><code class="language-python"><span class="n">markup</span> <span class="o">=</span> <span class="n">b</span><span class="s">'''</span>
|
||
<span class="s"><html></span>
|
||
<span class="s"> <head></span>
|
||
<span class="s"> <meta content="text/html; charset=ISO-Latin-1" http-equiv="Content-type" /></span>
|
||
<span class="s"> </head></span>
|
||
<span class="s"> <body></span>
|
||
<span class="s"> <p>Sacr</span><span class="se">\xe9</span><span class="s"> bleu!</p></span>
|
||
<span class="s"> </body></span>
|
||
<span class="s"></html></span>
|
||
<span class="s">'''</span>
|
||
|
||
<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
|
||
<span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">prettify</span><span class="p">())</span>
|
||
<span class="c"># <html></span>
|
||
<span class="c"># <head></span>
|
||
<span class="c"># <meta content="text/html; charset=utf-8" http-equiv="Content-type" /></span>
|
||
<span class="c"># </head></span>
|
||
<span class="c"># <body></span>
|
||
<span class="c"># <p></span>
|
||
<span class="c"># Sacré bleu!</span>
|
||
<span class="c"># </p></span>
|
||
<span class="c"># </body></span>
|
||
<span class="c"># </html></span>
|
||
</code></pre>
|
||
<p>注意,输出文档中的<meta>标签的编码设置已经修改成了与输出编码一致的UTF-8.</p>
|
||
<p>如果不想用UTF-8编码输出,可以将编码方式传入 <tt class="docutils literal"><span class="pre">prettify()</span></tt> 方法:</p>
|
||
<pre><code class="language-python"><span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">prettify</span><span class="p">(</span><span class="s">"latin-1"</span><span class="p">))</span>
|
||
<span class="c"># <html></span>
|
||
<span class="c"># <head></span>
|
||
<span class="c"># <meta content="text/html; charset=latin-1" http-equiv="Content-type" /></span>
|
||
<span class="c"># ...</span>
|
||
</code></pre>
|
||
<p>还可以调用 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 对象或任意节点的 <tt class="docutils literal"><span class="pre">encode()</span></tt> 方法,就像Python的字符串调用 <tt class="docutils literal"><span class="pre">encode()</span></tt> 方法一样:</p>
|
||
<pre><code class="language-python"><span class="n">soup</span><span class="o">.</span><span class="n">p</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s">"latin-1"</span><span class="p">)</span>
|
||
<span class="c"># '<p>Sacr\xe9 bleu!</p>'</span>
|
||
|
||
<span class="n">soup</span><span class="o">.</span><span class="n">p</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s">"utf-8"</span><span class="p">)</span>
|
||
<span class="c"># '<p>Sacr\xc3\xa9 bleu!</p>'</span>
|
||
</code></pre>
|
||
<p>如果文档中包含当前编码不支持的字符,那么这些字符将呗转换成一系列XML特殊字符引用,下面例子中包含了Unicode编码字符SNOWMAN:</p>
|
||
<pre><code class="language-python"><span class="n">markup</span> <span class="o">=</span> <span class="s">u"<b></span><span class="se">\N{SNOWMAN}</span><span class="s"></b>"</span>
|
||
<span class="n">snowman_soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
|
||
<span class="n">tag</span> <span class="o">=</span> <span class="n">snowman_soup</span><span class="o">.</span><span class="n">b</span>
|
||
</code></pre>
|
||
<p>SNOWMAN字符在UTF-8编码中可以正常显示(看上去像是☃),但有些编码不支持SNOWMAN字符,比如ISO-Latin-1或ASCII,那么在这些编码中SNOWMAN字符会被转换成“&#9731”:</p>
|
||
<pre><code class="language-python"><span class="k">print</span><span class="p">(</span><span class="n">tag</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s">"utf-8"</span><span class="p">))</span>
|
||
<span class="c"># <b>☃</b></span>
|
||
|
||
<span class="k">print</span> <span class="n">tag</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s">"latin-1"</span><span class="p">)</span>
|
||
<span class="c"># <b>&#9731;</b></span>
|
||
|
||
<span class="k">print</span> <span class="n">tag</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s">"ascii"</span><span class="p">)</span>
|
||
<span class="c"># <b>&#9731;</b></span>
|
||
</code></pre>
|
||
</div>
|
||
<div class="section" id="unicode-dammit">
|
||
<h2>Unicode, dammit! (靠!)</h2>
|
||
<p><a class="reference internal" href="#unicode-dammit">编码自动检测</a> 功能可以在Beautiful Soup以外使用,检测某段未知编码时,可以使用这个方法:</p>
|
||
<pre><code class="language-python"><span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">UnicodeDammit</span>
|
||
<span class="n">dammit</span> <span class="o">=</span> <span class="n">UnicodeDammit</span><span class="p">(</span><span class="s">"Sacr</span><span class="se">\xc3\xa9</span><span class="s"> bleu!"</span><span class="p">)</span>
|
||
<span class="k">print</span><span class="p">(</span><span class="n">dammit</span><span class="o">.</span><span class="n">unicode_markup</span><span class="p">)</span>
|
||
<span class="c"># Sacré bleu!</span>
|
||
<span class="n">dammit</span><span class="o">.</span><span class="n">original_encoding</span>
|
||
<span class="c"># 'utf-8'</span>
|
||
</code></pre>
|
||
<p>如果Python中安装了 <tt class="docutils literal"><span class="pre">chardet</span></tt> 或 <tt class="docutils literal"><span class="pre">cchardet</span></tt> 那么编码检测功能的准确率将大大提高.输入的字符越多,检测结果越精确,如果事先猜测到一些可能编码,那么可以将猜测的编码作为参数,这样将优先检测这些编码:</p>
|
||
<pre><code class="language-python"><span class="n">dammit</span> <span class="o">=</span> <span class="n">UnicodeDammit</span><span class="p">(</span><span class="s">"Sacr</span><span class="se">\xe9</span><span class="s"> bleu!"</span><span class="p">,</span> <span class="p">[</span><span class="s">"latin-1"</span><span class="p">,</span> <span class="s">"iso-8859-1"</span><span class="p">])</span>
|
||
<span class="k">print</span><span class="p">(</span><span class="n">dammit</span><span class="o">.</span><span class="n">unicode_markup</span><span class="p">)</span>
|
||
<span class="c"># Sacré bleu!</span>
|
||
<span class="n">dammit</span><span class="o">.</span><span class="n">original_encoding</span>
|
||
<span class="c"># 'latin-1'</span>
|
||
</code></pre>
|
||
<p><a class="reference internal" href="#unicode-dammit">编码自动检测</a> 功能中有2项功能是Beautiful Soup库中用不到的</p>
|
||
<div class="section" id="id54">
|
||
<h3>智能引号</h3>
|
||
<p>使用Unicode时,Beautiful Soup还会智能的把引号 <a class="footnote-reference" href="#id91" id="id55">[10]</a> 转换成HTML或XML中的特殊字符:</p>
|
||
<pre><code class="language-python"><span class="n">markup</span> <span class="o">=</span> <span class="n">b</span><span class="s">"<p>I just </span><span class="se">\x93</span><span class="s">love</span><span class="se">\x94</span><span class="s"> Microsoft Word</span><span class="se">\x92</span><span class="s">s smart quotes</p>"</span>
|
||
|
||
<span class="n">UnicodeDammit</span><span class="p">(</span><span class="n">markup</span><span class="p">,</span> <span class="p">[</span><span class="s">"windows-1252"</span><span class="p">],</span> <span class="n">smart_quotes_to</span><span class="o">=</span><span class="s">"html"</span><span class="p">)</span><span class="o">.</span><span class="n">unicode_markup</span>
|
||
<span class="c"># u'<p>I just &ldquo;love&rdquo; Microsoft Word&rsquo;s smart quotes</p>'</span>
|
||
|
||
<span class="n">UnicodeDammit</span><span class="p">(</span><span class="n">markup</span><span class="p">,</span> <span class="p">[</span><span class="s">"windows-1252"</span><span class="p">],</span> <span class="n">smart_quotes_to</span><span class="o">=</span><span class="s">"xml"</span><span class="p">)</span><span class="o">.</span><span class="n">unicode_markup</span>
|
||
<span class="c"># u'<p>I just &#x201C;love&#x201D; Microsoft Word&#x2019;s smart quotes</p>'</span>
|
||
</code></pre>
|
||
<p>也可以把引号转换为ASCII码:</p>
|
||
<pre><code class="language-python"><span class="n">UnicodeDammit</span><span class="p">(</span><span class="n">markup</span><span class="p">,</span> <span class="p">[</span><span class="s">"windows-1252"</span><span class="p">],</span> <span class="n">smart_quotes_to</span><span class="o">=</span><span class="s">"ascii"</span><span class="p">)</span><span class="o">.</span><span class="n">unicode_markup</span>
|
||
<span class="c"># u'<p>I just "love" Microsoft Word\'s smart quotes</p>'</span>
|
||
</code></pre>
|
||
<p>很有用的功能,但是Beautiful Soup没有使用这种方式.默认情况下,Beautiful Soup把引号转换成Unicode:</p>
|
||
<pre><code class="language-python"><span class="n">UnicodeDammit</span><span class="p">(</span><span class="n">markup</span><span class="p">,</span> <span class="p">[</span><span class="s">"windows-1252"</span><span class="p">])</span><span class="o">.</span><span class="n">unicode_markup</span>
|
||
<span class="c"># u'<p>I just \u201clove\u201d Microsoft Word\u2019s smart quotes</p>'</span>
|
||
</code></pre>
|
||
</div>
|
||
<div class="section" id="id56">
|
||
<h3>矛盾的编码</h3>
|
||
<p>有时文档的大部分都是用UTF-8,但同时还包含了Windows-1252编码的字符,就像微软的智能引号 <a class="footnote-reference" href="#id91" id="id57">[10]</a> 一样.一些包含多个信息的来源网站容易出现这种情况. <tt class="docutils literal"><span class="pre">UnicodeDammit.detwingle()</span></tt> 方法可以把这类文档转换成纯UTF-8编码格式,看个简单的例子:</p>
|
||
<pre><code class="language-python"><span class="n">snowmen</span> <span class="o">=</span> <span class="p">(</span><span class="s">u"</span><span class="se">\N{SNOWMAN}</span><span class="s">"</span> <span class="o">*</span> <span class="mi">3</span><span class="p">)</span>
|
||
<span class="n">quote</span> <span class="o">=</span> <span class="p">(</span><span class="s">u"</span><span class="se">\N{LEFT DOUBLE QUOTATION MARK}</span><span class="s">I like snowmen!</span><span class="se">\N{RIGHT DOUBLE QUOTATION MARK}</span><span class="s">"</span><span class="p">)</span>
|
||
<span class="n">doc</span> <span class="o">=</span> <span class="n">snowmen</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s">"utf8"</span><span class="p">)</span> <span class="o">+</span> <span class="n">quote</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s">"windows_1252"</span><span class="p">)</span>
|
||
</code></pre>
|
||
<p>这段文档很杂乱,snowmen是UTF-8编码,引号是Windows-1252编码,直接输出时不能同时显示snowmen和引号,因为它们编码不同:</p>
|
||
<pre><code class="language-python"><span class="k">print</span><span class="p">(</span><span class="n">doc</span><span class="p">)</span>
|
||
<span class="c"># ☃☃☃<E29883>I like snowmen!<21></span>
|
||
|
||
<span class="k">print</span><span class="p">(</span><span class="n">doc</span><span class="o">.</span><span class="n">decode</span><span class="p">(</span><span class="s">"windows-1252"</span><span class="p">))</span>
|
||
<span class="c"># ☃☃☃“I like snowmen!”</span>
|
||
</code></pre>
|
||
<p>如果对这段文档用UTF-8解码就会得到 <tt class="docutils literal"><span class="pre">UnicodeDecodeError</span></tt> 异常,如果用Windows-1252解码就回得到一堆乱码.幸好, <tt class="docutils literal"><span class="pre">UnicodeDammit.detwingle()</span></tt> 方法会吧这段字符串转换成UTF-8编码,允许我们同时显示出文档中的snowmen和引号:</p>
|
||
<pre><code class="language-python"><span class="n">new_doc</span> <span class="o">=</span> <span class="n">UnicodeDammit</span><span class="o">.</span><span class="n">detwingle</span><span class="p">(</span><span class="n">doc</span><span class="p">)</span>
|
||
<span class="k">print</span><span class="p">(</span><span class="n">new_doc</span><span class="o">.</span><span class="n">decode</span><span class="p">(</span><span class="s">"utf8"</span><span class="p">))</span>
|
||
<span class="c"># ☃☃☃“I like snowmen!”</span>
|
||
</code></pre>
|
||
<p><tt class="docutils literal"><span class="pre">UnicodeDammit.detwingle()</span></tt> 方法只能解码包含在UTF-8编码中的Windows-1252编码内容,但这解决了最常见的一类问题.</p>
|
||
<p>在创建 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 或 <tt class="docutils literal"><span class="pre">UnicodeDammit</span></tt> 对象前一定要先对文档调用 <tt class="docutils literal"><span class="pre">UnicodeDammit.detwingle()</span></tt> 确保文档的编码方式正确.如果尝试去解析一段包含Windows-1252编码的UTF-8文档,就会得到一堆乱码,比如: ☃☃☃“I like snowmen!”.</p>
|
||
<p><tt class="docutils literal"><span class="pre">UnicodeDammit.detwingle()</span></tt> 方法在Beautiful Soup 4.1.0版本中新增</p>
|
||
</div>
|
||
</div>
|
||
</div>
|
||
<div class="section" id="id58">
|
||
<h1>解析部分文档</h1>
|
||
<p>如果仅仅因为想要查找文档中的<a>标签而将整片文档进行解析,实在是浪费内存和时间.最快的方法是从一开始就把<a>标签以外的东西都忽略掉. <tt class="docutils literal"><span class="pre">SoupStrainer</span></tt> 类可以定义文档的某段内容,这样搜索文档时就不必先解析整篇文档,只会解析在 <tt class="docutils literal"><span class="pre">SoupStrainer</span></tt> 中定义过的文档. 创建一个 <tt class="docutils literal"><span class="pre">SoupStrainer</span></tt> 对象并作为 <tt class="docutils literal"><span class="pre">parse_only</span></tt> 参数给 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 的构造方法即可.</p>
|
||
<div class="section" id="soupstrainer">
|
||
<h2>SoupStrainer</h2>
|
||
<p><tt class="docutils literal"><span class="pre">SoupStrainer</span></tt> 类接受与典型搜索方法相同的参数:<a class="reference internal" href="#id32">name</a> , <a class="reference internal" href="#css">attrs</a> , <a class="reference internal" href="#recursive">recursive</a> , <a class="reference internal" href="#text">text</a> , <a class="reference internal" href="#keyword">**kwargs</a> 。下面举例说明三种 <tt class="docutils literal"><span class="pre">SoupStrainer</span></tt> 对象:</p>
|
||
<pre><code class="language-python"><span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">SoupStrainer</span>
|
||
|
||
<span class="n">only_a_tags</span> <span class="o">=</span> <span class="n">SoupStrainer</span><span class="p">(</span><span class="s">"a"</span><span class="p">)</span>
|
||
|
||
<span class="n">only_tags_with_id_link2</span> <span class="o">=</span> <span class="n">SoupStrainer</span><span class="p">(</span><span class="nb">id</span><span class="o">=</span><span class="s">"link2"</span><span class="p">)</span>
|
||
|
||
<span class="k">def</span> <span class="nf">is_short_string</span><span class="p">(</span><span class="n">string</span><span class="p">):</span>
|
||
<span class="k">return</span> <span class="nb">len</span><span class="p">(</span><span class="n">string</span><span class="p">)</span> <span class="o"><</span> <span class="mi">10</span>
|
||
|
||
<span class="n">only_short_strings</span> <span class="o">=</span> <span class="n">SoupStrainer</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="n">is_short_string</span><span class="p">)</span>
|
||
</code></pre>
|
||
<p>再拿“爱丽丝”文档来举例,来看看使用三种 <tt class="docutils literal"><span class="pre">SoupStrainer</span></tt> 对象做参数会有什么不同:</p>
|
||
<pre><code class="language-python"><span class="n">html_doc</span> <span class="o">=</span> <span class="s">"""</span>
|
||
<span class="s"><html><head><title>The Dormouse's story</title></head></span>
|
||
|
||
<span class="s"><p class="title"><b>The Dormouse's story</b></p></span>
|
||
|
||
<span class="s"><p class="story">Once upon a time there were three little sisters; and their names were</span>
|
||
<span class="s"><a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,</span>
|
||
<span class="s"><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and</span>
|
||
<span class="s"><a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;</span>
|
||
<span class="s">and they lived at the bottom of a well.</p></span>
|
||
|
||
<span class="s"><p class="story">...</p></span>
|
||
<span class="s">"""</span>
|
||
|
||
<span class="k">print</span><span class="p">(</span><span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">html_doc</span><span class="p">,</span> <span class="s">"html.parser"</span><span class="p">,</span> <span class="n">parse_only</span><span class="o">=</span><span class="n">only_a_tags</span><span class="p">)</span><span class="o">.</span><span class="n">prettify</span><span class="p">())</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/elsie" id="link1"></span>
|
||
<span class="c"># Elsie</span>
|
||
<span class="c"># </a></span>
|
||
<span class="c"># <a class="sister" href="http://example.com/lacie" id="link2"></span>
|
||
<span class="c"># Lacie</span>
|
||
<span class="c"># </a></span>
|
||
<span class="c"># <a class="sister" href="http://example.com/tillie" id="link3"></span>
|
||
<span class="c"># Tillie</span>
|
||
<span class="c"># </a></span>
|
||
|
||
<span class="k">print</span><span class="p">(</span><span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">html_doc</span><span class="p">,</span> <span class="s">"html.parser"</span><span class="p">,</span> <span class="n">parse_only</span><span class="o">=</span><span class="n">only_tags_with_id_link2</span><span class="p">)</span><span class="o">.</span><span class="n">prettify</span><span class="p">())</span>
|
||
<span class="c"># <a class="sister" href="http://example.com/lacie" id="link2"></span>
|
||
<span class="c"># Lacie</span>
|
||
<span class="c"># </a></span>
|
||
|
||
<span class="k">print</span><span class="p">(</span><span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">html_doc</span><span class="p">,</span> <span class="s">"html.parser"</span><span class="p">,</span> <span class="n">parse_only</span><span class="o">=</span><span class="n">only_short_strings</span><span class="p">)</span><span class="o">.</span><span class="n">prettify</span><span class="p">())</span>
|
||
<span class="c"># Elsie</span>
|
||
<span class="c"># ,</span>
|
||
<span class="c"># Lacie</span>
|
||
<span class="c"># and</span>
|
||
<span class="c"># Tillie</span>
|
||
<span class="c"># ...</span>
|
||
<span class="c">#</span>
|
||
</code></pre>
|
||
<p>还可以将 <tt class="docutils literal"><span class="pre">SoupStrainer</span></tt> 作为参数传入 <a class="reference internal" href="#id24">搜索文档树</a> 中提到的方法.这可能不是个常用用法,所以还是提一下:</p>
|
||
<pre><code class="language-python"><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">html_doc</span><span class="p">)</span>
|
||
<span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">only_short_strings</span><span class="p">)</span>
|
||
<span class="c"># [u'\n\n', u'\n\n', u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie',</span>
|
||
<span class="c"># u'\n\n', u'...', u'\n']</span>
|
||
</code></pre>
|
||
</div>
|
||
</div>
|
||
<div class="section" id="id59">
|
||
<h1>常见问题</h1>
|
||
<div class="section" id="id60">
|
||
<h2>代码诊断</h2>
|
||
<p>如果想知道Beautiful Soup到底怎样处理一份文档,可以将文档传入 <tt class="docutils literal"><span class="pre">diagnose()</span></tt> 方法(Beautiful Soup 4.2.0中新增),Beautiful Soup会输出一份报告,说明不同的解析器会怎样处理这段文档,并标出当前的解析过程会使用哪种解析器:</p>
|
||
<pre><code class="language-python"><span class="kn">from</span> <span class="nn">bs4.diagnose</span> <span class="kn">import</span> <span class="n">diagnose</span>
|
||
<span class="n">data</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">"bad.html"</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>
|
||
<span class="n">diagnose</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
|
||
|
||
<span class="c"># Diagnostic running on Beautiful Soup 4.2.0</span>
|
||
<span class="c"># Python version 2.7.3 (default, Aug 1 2012, 05:16:07)</span>
|
||
<span class="c"># I noticed that html5lib is not installed. Installing it may help.</span>
|
||
<span class="c"># Found lxml version 2.3.2.0</span>
|
||
<span class="c">#</span>
|
||
<span class="c"># Trying to parse your data with html.parser</span>
|
||
<span class="c"># Here's what html.parser did with the document:</span>
|
||
<span class="c"># ...</span>
|
||
</code></pre>
|
||
<p><tt class="docutils literal"><span class="pre">diagnose()</span></tt> 方法的输出结果可能帮助你找到问题的原因,如果不行,还可以把结果复制出来以便寻求他人的帮助</p>
|
||
</div>
|
||
<div class="section" id="id61">
|
||
<h2>文档解析错误</h2>
|
||
<p>文档解析错误有两种.一种是崩溃,Beautiful Soup尝试解析一段文档结果却抛除了异常,通常是 <tt class="docutils literal"><span class="pre">HTMLParser.HTMLParseError</span></tt> .还有一种异常情况,是Beautiful Soup解析后的文档树看起来与原来的内容相差很多.</p>
|
||
<p>这些错误几乎都不是Beautiful Soup的原因,这不会是因为Beautiful Soup得代码写的太优秀,而是因为Beautiful Soup没有包含任何文档解析代码.异常产生自被依赖的解析器,如果解析器不能很好的解析出当前的文档,那么最好的办法是换一个解析器.更多细节查看 <a class="reference internal" href="#id9">安装解析器</a> 章节.</p>
|
||
<p>最常见的解析错误是 <tt class="docutils literal"><span class="pre">HTMLParser.HTMLParseError:</span> <span class="pre">malformed</span> <span class="pre">start</span> <span class="pre">tag</span></tt> 和 <tt class="docutils literal"><span class="pre">HTMLParser.HTMLParseError:</span> <span class="pre">bad</span> <span class="pre">end</span> <span class="pre">tag</span></tt> .这都是由Python内置的解析器引起的,解决方法是 <a class="reference internal" href="#id9">安装lxml或html5lib</a></p>
|
||
<p>最常见的异常现象是当前文档找不到指定的Tag,而这个Tag光是用眼睛就足够发现的了. <tt class="docutils literal"><span class="pre">find_all()</span></tt> 方法返回 [] ,而 <tt class="docutils literal"><span class="pre">find()</span></tt> 方法返回 None .这是Python内置解析器的又一个问题: 解析器会跳过那些它不知道的tag.解决方法还是 <a class="reference internal" href="#id9">安装lxml或html5lib</a></p>
|
||
</div>
|
||
<div class="section" id="id62">
|
||
<h2>版本错误</h2>
|
||
<ul class="simple">
|
||
<li><tt class="docutils literal"><span class="pre">SyntaxError:</span> <span class="pre">Invalid</span> <span class="pre">syntax</span></tt> (异常位置在代码行: <tt class="docutils literal"><span class="pre">ROOT_TAG_NAME</span> <span class="pre">=</span> <span class="pre">u'[document]'</span></tt> ),因为Python2版本的代码没有经过迁移就在Python3中窒息感</li>
|
||
<li><tt class="docutils literal"><span class="pre">ImportError:</span> <span class="pre">No</span> <span class="pre">module</span> <span class="pre">named</span> <span class="pre">HTMLParser</span></tt> 因为在Python3中执行Python2版本的Beautiful Soup</li>
|
||
<li><tt class="docutils literal"><span class="pre">ImportError:</span> <span class="pre">No</span> <span class="pre">module</span> <span class="pre">named</span> <span class="pre">html.parser</span></tt> 因为在Python2中执行Python3版本的Beautiful Soup</li>
|
||
<li><tt class="docutils literal"><span class="pre">ImportError:</span> <span class="pre">No</span> <span class="pre">module</span> <span class="pre">named</span> <span class="pre">BeautifulSoup</span></tt> 因为在没有安装BeautifulSoup3库的Python环境下执行代码,或忘记了BeautifulSoup4的代码需要从 <tt class="docutils literal"><span class="pre">bs4</span></tt> 包中引入</li>
|
||
<li><tt class="docutils literal"><span class="pre">ImportError:</span> <span class="pre">No</span> <span class="pre">module</span> <span class="pre">named</span> <span class="pre">bs4</span></tt> 因为当前Python环境下还没有安装BeautifulSoup4</li>
|
||
</ul>
|
||
</div>
|
||
<div class="section" id="xml">
|
||
<h2>解析成XML</h2>
|
||
<p>默认情况下,Beautiful Soup会将当前文档作为HTML格式解析,如果要解析XML文档,要在 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 构造方法中加入第二个参数 “xml”:</p>
|
||
<pre><code class="language-python"><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">,</span> <span class="s">"xml"</span><span class="p">)</span>
|
||
</code></pre>
|
||
<p>当然,还需要 <a class="reference internal" href="#id9">安装lxml</a></p>
|
||
</div>
|
||
<div class="section" id="id63">
|
||
<h2>解析器的错误</h2>
|
||
<ul class="simple">
|
||
<li>如果同样的代码在不同环境下结果不同,可能是因为两个环境下使用不同的解析器造成的.例如这个环境中安装了lxml,而另一个环境中只有html5lib, <a class="reference internal" href="#id49">解析器之间的区别</a> 中说明了原因.修复方法是在 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 的构造方法中中指定解析器</li>
|
||
<li>因为HTML标签是 <a class="reference external" href="http://www.w3.org/TR/html5/syntax.html#syntax">大小写敏感</a> 的,所以3种解析器再出来文档时都将tag和属性转换成小写.例如文档中的 <TAG></TAG> 会被转换为 <tag></tag> .如果想要保留tag的大写的话,那么应该将文档 <a class="reference internal" href="#xml">解析成XML</a> .</li>
|
||
</ul>
|
||
</div>
|
||
<div class="section" id="id65">
|
||
<h2>杂项错误</h2>
|
||
<ul class="simple">
|
||
<li><tt class="docutils literal"><span class="pre">UnicodeEncodeError:</span> <span class="pre">'charmap'</span> <span class="pre">codec</span> <span class="pre">can't</span> <span class="pre">encode</span> <span class="pre">character</span> <span class="pre">u'\xfoo'</span> <span class="pre">in</span> <span class="pre">position</span> <span class="pre">bar</span></tt> (或其它类型的 <tt class="docutils literal"><span class="pre">UnicodeEncodeError</span></tt> )的错误,主要是两方面的错误(都不是Beautiful Soup的原因),第一种是正在使用的终端(console)无法显示部分Unicode,参考 <a class="reference external" href="http://wiki.Python.org/moin/PrintFails">Python wiki</a> ,第二种是向文件写入时,被写入文件不支持部分Unicode,这时只要用 <tt class="docutils literal"><span class="pre">u.encode("utf8")</span></tt> 方法将编码转换为UTF-8.</li>
|
||
<li><tt class="docutils literal"><span class="pre">KeyError:</span> <span class="pre">[attr]</span></tt> 因为调用 <tt class="docutils literal"><span class="pre">tag['attr']</span></tt> 方法而引起,因为这个tag没有定义该属性.出错最多的是 <tt class="docutils literal"><span class="pre">KeyError:</span> <span class="pre">'href'</span></tt> 和 <tt class="docutils literal"><span class="pre">KeyError:</span> <span class="pre">'class'</span></tt> .如果不确定某个属性是否存在时,用 <tt class="docutils literal"><span class="pre">tag.get('attr')</span></tt> 方法去获取它,跟获取Python字典的key一样</li>
|
||
<li><tt class="docutils literal"><span class="pre">AttributeError:</span> <span class="pre">'ResultSet'</span> <span class="pre">object</span> <span class="pre">has</span> <span class="pre">no</span> <span class="pre">attribute</span> <span class="pre">'foo'</span></tt> 错误通常是因为把 <tt class="docutils literal"><span class="pre">find_all()</span></tt> 的返回结果当作一个tag或文本节点使用,实际上返回结果是一个列表或 <tt class="docutils literal"><span class="pre">ResultSet</span></tt> 对象的字符串,需要对结果进行循环才能得到每个节点的 <tt class="docutils literal"><span class="pre">.foo</span></tt> 属性.或者使用 <tt class="docutils literal"><span class="pre">find()</span></tt> 方法仅获取到一个节点</li>
|
||
<li><tt class="docutils literal"><span class="pre">AttributeError:</span> <span class="pre">'NoneType'</span> <span class="pre">object</span> <span class="pre">has</span> <span class="pre">no</span> <span class="pre">attribute</span> <span class="pre">'foo'</span></tt> 这个错误通常是在调用了 <tt class="docutils literal"><span class="pre">find()</span></tt> 方法后直节点取某个属性 .foo 但是 <tt class="docutils literal"><span class="pre">find()</span></tt> 方法并没有找到任何结果,所以它的返回值是 <tt class="docutils literal"><span class="pre">None</span></tt> .需要找出为什么 <tt class="docutils literal"><span class="pre">find()</span></tt> 的返回值是 <tt class="docutils literal"><span class="pre">None</span></tt> .</li>
|
||
</ul>
|
||
</div>
|
||
<div class="section" id="id66">
|
||
<h2>如何提高效率</h2>
|
||
<p>Beautiful Soup对文档的解析速度不会比它所依赖的解析器更快,如果对计算时间要求很高或者计算机的时间比程序员的时间更值钱,那么就应该直接使用 <a class="reference external" href="http://lxml.de/">lxml</a> .</p>
|
||
<p>换句话说,还有提高Beautiful Soup效率的办法,使用lxml作为解析器.Beautiful Soup用lxml做解析器比用html5lib或Python内置解析器速度快很多.</p>
|
||
<p>安装 <a class="reference external" href="http://pypi.Python.org/pypi/cchardet/">cchardet</a> 后文档的解码的编码检测会速度更快</p>
|
||
<p><a class="reference internal" href="#id58">解析部分文档</a> 不会节省多少解析时间,但是会节省很多内存,并且搜索时也会变得更快.</p>
|
||
</div>
|
||
</div>
|
||
<div class="section" id="beautiful-soup-3">
|
||
<h1>Beautiful Soup 3</h1>
|
||
<p>Beautiful Soup 3是上一个发布版本,目前已经停止维护.Beautiful Soup 3库目前已经被几个主要的linux平台添加到源里:</p>
|
||
<p><tt class="docutils literal"><span class="pre">$</span> <span class="pre">apt-get</span> <span class="pre">install</span> <span class="pre">Python-beautifulsoup</span></tt></p>
|
||
<p>在PyPi中分发的包名字是 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> :</p>
|
||
<p><tt class="docutils literal"><span class="pre">$</span> <span class="pre">easy_install</span> <span class="pre">BeautifulSoup</span></tt></p>
|
||
<p><tt class="docutils literal"><span class="pre">$</span> <span class="pre">pip</span> <span class="pre">install</span> <span class="pre">BeautifulSoup</span></tt></p>
|
||
<p>或通过 <a class="reference external" href="http://www.crummy.com/software/BeautifulSoup/bs3/download/3.x/BeautifulSoup-3.2.0.tar.gz">Beautiful Soup 3.2.0源码包</a> 安装</p>
|
||
<p>Beautiful Soup 3的在线文档查看 <a class="reference external" href="http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html">这里</a> ,当然还有 <a class="reference external" href="http://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html">中文版</a> ,然后再读本片文档,来对比Beautiful Soup 4中有什新变化.</p>
|
||
<div class="section" id="id70">
|
||
<h2>迁移到BS4</h2>
|
||
<p>只要一个小变动就能让大部分的Beautiful Soup 3代码使用Beautiful Soup 4的库和方法—-修改 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 对象的引入方式:</p>
|
||
<pre><code class="language-python"><span class="kn">from</span> <span class="nn">BeautifulSoup</span> <span class="kn">import</span> <span class="n">BeautifulSoup</span>
|
||
</code></pre>
|
||
<p>修改为:</p>
|
||
<pre><code class="language-python"><span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">BeautifulSoup</span>
|
||
</code></pre>
|
||
<ul class="simple">
|
||
<li>如果代码抛出 <tt class="docutils literal"><span class="pre">ImportError</span></tt> 异常“No module named BeautifulSoup”,原因可能是尝试执行Beautiful Soup 3,但环境中只安装了Beautiful Soup 4库</li>
|
||
<li>如果代码跑出 <tt class="docutils literal"><span class="pre">ImportError</span></tt> 异常“No module named bs4”,原因可能是尝试运行Beautiful Soup 4的代码,但环境中只安装了Beautiful Soup 3.</li>
|
||
</ul>
|
||
<p>虽然BS4兼容绝大部分BS3的功能,但BS3中的大部分方法已经不推荐使用了,就方法按照 <a class="reference external" href="http://www.Python.org/dev/peps/pep-0008/">PEP8标准</a> 重新定义了方法名.很多方法都重新定义了方法名,但只有少数几个方法没有向下兼容.</p>
|
||
<p>上述内容就是BS3迁移到BS4的注意事项</p>
|
||
<div class="section" id="id71">
|
||
<h3>需要的解析器</h3>
|
||
<p>Beautiful Soup 3曾使用Python的 <tt class="docutils literal"><span class="pre">SGMLParser</span></tt> 解析器,这个模块在Python3中已经被移除了.Beautiful Soup 4默认使用系统的 <tt class="docutils literal"><span class="pre">html.parser</span></tt> ,也可以使用lxml或html5lib扩展库代替.查看 <a class="reference internal" href="#id9">安装解析器</a> 章节</p>
|
||
<p>因为 <tt class="docutils literal"><span class="pre">html.parser</span></tt> 解析器与 <tt class="docutils literal"><span class="pre">SGMLParser</span></tt> 解析器不同,它们在处理格式不正确的文档时也会产生不同结果.通常 <tt class="docutils literal"><span class="pre">html.parser</span></tt> 解析器会抛出异常.所以推荐安装扩展库作为解析器.有时 <tt class="docutils literal"><span class="pre">html.parser</span></tt> 解析出的文档树结构与 <tt class="docutils literal"><span class="pre">SGMLParser</span></tt> 的不同.如果发生这种情况,那么需要升级BS3来处理新的文档树.</p>
|
||
</div>
|
||
<div class="section" id="id72">
|
||
<h3>方法名的变化</h3>
|
||
<ul class="simple">
|
||
<li><tt class="docutils literal"><span class="pre">renderContents</span></tt> -> <tt class="docutils literal"><span class="pre">encode_contents</span></tt></li>
|
||
<li><tt class="docutils literal"><span class="pre">replaceWith</span></tt> -> <tt class="docutils literal"><span class="pre">replace_with</span></tt></li>
|
||
<li><tt class="docutils literal"><span class="pre">replaceWithChildren</span></tt> -> <tt class="docutils literal"><span class="pre">unwrap</span></tt></li>
|
||
<li><tt class="docutils literal"><span class="pre">findAll</span></tt> -> <tt class="docutils literal"><span class="pre">find_all</span></tt></li>
|
||
<li><tt class="docutils literal"><span class="pre">findAllNext</span></tt> -> <tt class="docutils literal"><span class="pre">find_all_next</span></tt></li>
|
||
<li><tt class="docutils literal"><span class="pre">findAllPrevious</span></tt> -> <tt class="docutils literal"><span class="pre">find_all_previous</span></tt></li>
|
||
<li><tt class="docutils literal"><span class="pre">findNext</span></tt> -> <tt class="docutils literal"><span class="pre">find_next</span></tt></li>
|
||
<li><tt class="docutils literal"><span class="pre">findNextSibling</span></tt> -> <tt class="docutils literal"><span class="pre">find_next_sibling</span></tt></li>
|
||
<li><tt class="docutils literal"><span class="pre">findNextSiblings</span></tt> -> <tt class="docutils literal"><span class="pre">find_next_siblings</span></tt></li>
|
||
<li><tt class="docutils literal"><span class="pre">findParent</span></tt> -> <tt class="docutils literal"><span class="pre">find_parent</span></tt></li>
|
||
<li><tt class="docutils literal"><span class="pre">findParents</span></tt> -> <tt class="docutils literal"><span class="pre">find_parents</span></tt></li>
|
||
<li><tt class="docutils literal"><span class="pre">findPrevious</span></tt> -> <tt class="docutils literal"><span class="pre">find_previous</span></tt></li>
|
||
<li><tt class="docutils literal"><span class="pre">findPreviousSibling</span></tt> -> <tt class="docutils literal"><span class="pre">find_previous_sibling</span></tt></li>
|
||
<li><tt class="docutils literal"><span class="pre">findPreviousSiblings</span></tt> -> <tt class="docutils literal"><span class="pre">find_previous_siblings</span></tt></li>
|
||
<li><tt class="docutils literal"><span class="pre">nextSibling</span></tt> -> <tt class="docutils literal"><span class="pre">next_sibling</span></tt></li>
|
||
<li><tt class="docutils literal"><span class="pre">previousSibling</span></tt> -> <tt class="docutils literal"><span class="pre">previous_sibling</span></tt></li>
|
||
</ul>
|
||
<p>Beautiful Soup构造方法的参数部分也有名字变化:</p>
|
||
<ul class="simple">
|
||
<li><tt class="docutils literal"><span class="pre">BeautifulSoup(parseOnlyThese=...)</span></tt> -> <tt class="docutils literal"><span class="pre">BeautifulSoup(parse_only=...)</span></tt></li>
|
||
<li><tt class="docutils literal"><span class="pre">BeautifulSoup(fromEncoding=...)</span></tt> -> <tt class="docutils literal"><span class="pre">BeautifulSoup(from_encoding=...)</span></tt></li>
|
||
</ul>
|
||
<p>为了适配Python3,修改了一个方法名:</p>
|
||
<ul class="simple">
|
||
<li><tt class="docutils literal"><span class="pre">Tag.has_key()</span></tt> -> <tt class="docutils literal"><span class="pre">Tag.has_attr()</span></tt></li>
|
||
</ul>
|
||
<p>修改了一个属性名,让它看起来更专业点:</p>
|
||
<ul class="simple">
|
||
<li><tt class="docutils literal"><span class="pre">Tag.isSelfClosing</span></tt> -> <tt class="docutils literal"><span class="pre">Tag.is_empty_element</span></tt></li>
|
||
</ul>
|
||
<p>修改了下面3个属性的名字,以免雨Python保留字冲突.这些变动不是向下兼容的,如果在BS3中使用了这些属性,那么在BS4中这些代码无法执行.</p>
|
||
<ul class="simple">
|
||
<li>UnicodeDammit.Unicode -> UnicodeDammit.Unicode_markup``</li>
|
||
<li><tt class="docutils literal"><span class="pre">Tag.next</span></tt> -> <tt class="docutils literal"><span class="pre">Tag.next_element</span></tt></li>
|
||
<li><tt class="docutils literal"><span class="pre">Tag.previous</span></tt> -> <tt class="docutils literal"><span class="pre">Tag.previous_element</span></tt></li>
|
||
</ul>
|
||
</div>
|
||
<div class="section" id="id73">
|
||
<h3>生成器</h3>
|
||
<p>将下列生成器按照PEP8标准重新命名,并转换成对象的属性:</p>
|
||
<ul class="simple">
|
||
<li><tt class="docutils literal"><span class="pre">childGenerator()</span></tt> -> <tt class="docutils literal"><span class="pre">children</span></tt></li>
|
||
<li><tt class="docutils literal"><span class="pre">nextGenerator()</span></tt> -> <tt class="docutils literal"><span class="pre">next_elements</span></tt></li>
|
||
<li><tt class="docutils literal"><span class="pre">nextSiblingGenerator()</span></tt> -> <tt class="docutils literal"><span class="pre">next_siblings</span></tt></li>
|
||
<li><tt class="docutils literal"><span class="pre">previousGenerator()</span></tt> -> <tt class="docutils literal"><span class="pre">previous_elements</span></tt></li>
|
||
<li><tt class="docutils literal"><span class="pre">previousSiblingGenerator()</span></tt> -> <tt class="docutils literal"><span class="pre">previous_siblings</span></tt></li>
|
||
<li><tt class="docutils literal"><span class="pre">recursiveChildGenerator()</span></tt> -> <tt class="docutils literal"><span class="pre">descendants</span></tt></li>
|
||
<li><tt class="docutils literal"><span class="pre">parentGenerator()</span></tt> -> <tt class="docutils literal"><span class="pre">parents</span></tt></li>
|
||
</ul>
|
||
<p>所以迁移到BS4版本时要替换这些代码:</p>
|
||
<pre><code class="language-python"><span class="k">for</span> <span class="n">parent</span> <span class="ow">in</span> <span class="n">tag</span><span class="o">.</span><span class="n">parentGenerator</span><span class="p">():</span>
|
||
<span class="o">...</span>
|
||
</code></pre>
|
||
<p>替换为:</p>
|
||
<pre><code class="language-python"><span class="k">for</span> <span class="n">parent</span> <span class="ow">in</span> <span class="n">tag</span><span class="o">.</span><span class="n">parents</span><span class="p">:</span>
|
||
<span class="o">...</span>
|
||
</code></pre>
|
||
<p>(两种调用方法现在都能使用)</p>
|
||
<p>BS3中有的生成器循环结束后会返回 <tt class="docutils literal"><span class="pre">None</span></tt> 然后结束.这是个bug.新版生成器不再返回 <tt class="docutils literal"><span class="pre">None</span></tt> .</p>
|
||
<p>BS4中增加了2个新的生成器, <a class="reference internal" href="#strings-stripped-strings">.strings 和 stripped_strings</a> . <tt class="docutils literal"><span class="pre">.strings</span></tt> 生成器返回NavigableString对象, <tt class="docutils literal"><span class="pre">.stripped_strings</span></tt> 方法返回去除前后空白的Python的string对象.</p>
|
||
</div>
|
||
<div class="section" id="id74">
|
||
<h3>XML</h3>
|
||
<p>BS4中移除了解析XML的 <tt class="docutils literal"><span class="pre">BeautifulStoneSoup</span></tt> 类.如果要解析一段XML文档,使用 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 构造方法并在第二个参数设置为“xml”.同时 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 构造方法也不再识别 <tt class="docutils literal"><span class="pre">isHTML</span></tt> 参数.</p>
|
||
<p>Beautiful Soup处理XML空标签的方法升级了.旧版本中解析XML时必须指明哪个标签是空标签. 构造方法的 <tt class="docutils literal"><span class="pre">selfClosingTags</span></tt> 参数已经不再使用.新版Beautiful Soup将所有空标签解析为空元素,如果向空元素中添加子节点,那么这个元素就不再是空元素了.</p>
|
||
</div>
|
||
<div class="section" id="id75">
|
||
<h3>实体</h3>
|
||
<p>HTML或XML实体都会被解析成Unicode字符,Beautiful Soup 3版本中有很多处理实体的方法,在新版中都被移除了. <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 构造方法也不再接受 <tt class="docutils literal"><span class="pre">smartQuotesTo</span></tt> 或 <tt class="docutils literal"><span class="pre">convertEntities</span></tt> 参数. <a class="reference internal" href="#unicode-dammit">编码自动检测</a> 方法依然有 <tt class="docutils literal"><span class="pre">smart_quotes_to</span></tt> 参数,但是默认会将引号转换成Unicode.内容配置项 <tt class="docutils literal"><span class="pre">HTML_ENTITIES</span></tt> , <tt class="docutils literal"><span class="pre">XML_ENTITIES</span></tt> 和 <tt class="docutils literal"><span class="pre">XHTML_ENTITIES</span></tt> 在新版中被移除.因为它们代表的特性已经不再被支持.</p>
|
||
<p>如果在输出文档时想把Unicode字符转换成HTML实体,而不是输出成UTF-8编码,那就需要用到 <a class="reference internal" href="#id47">输出格式</a> 的方法.</p>
|
||
</div>
|
||
<div class="section" id="id76">
|
||
<h3>迁移杂项</h3>
|
||
<p><a class="reference internal" href="#string">Tag.string</a> 属性现在是一个递归操作.如果A标签只包含了一个B标签,那么A标签的.string属性值与B标签的.string属性值相同.</p>
|
||
<p><a class="reference internal" href="#id12">多值属性</a> 比如 <tt class="docutils literal"><span class="pre">class</span></tt> 属性包含一个他们的值的列表,而不是一个字符串.这可能会影响到如何按照CSS类名哦搜索tag.</p>
|
||
<p>如果使用 <tt class="docutils literal"><span class="pre">find*</span></tt> 方法时同时传入了 <a class="reference internal" href="#text">text 参数</a> 和 <a class="reference internal" href="#id32">name 参数</a> .Beautiful Soup会搜索指定name的tag,并且这个tag的 <a class="reference internal" href="#string">Tag.string</a> 属性包含text参数的内容.结果中不会包含字符串本身.旧版本中Beautiful Soup会忽略掉tag参数,只搜索text参数.</p>
|
||
<p><tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 构造方法不再支持 markupMassage 参数.现在由解析器负责文档的解析正确性.</p>
|
||
<p>很少被用到的几个解析器方法在新版中被移除,比如 <tt class="docutils literal"><span class="pre">ICantBelieveItsBeautifulSoup</span></tt> 和 <tt class="docutils literal"><span class="pre">BeautifulSOAP</span></tt> .现在由解析器完全负责如何解释模糊不清的文档标记.</p>
|
||
<p><tt class="docutils literal"><span class="pre">prettify()</span></tt> 方法在新版中返回Unicode字符串,不再返回字节流.</p>
|
||
<p><a class="reference external" href="http://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html">BeautifulSoup3 文档</a></p>
|
||
<table class="docutils footnote" frame="void" id="id82" rules="none">
|
||
<colgroup><col class="label"><col></colgroup>
|
||
<tbody valign="top">
|
||
<tr><td class="label"><a class="fn-backref" href="#id3">[1]</a></td><td>BeautifulSoup的google讨论组不是很活跃,可能是因为库已经比较完善了吧,但是作者还是会很热心的尽量帮你解决问题的.</td></tr>
|
||
</tbody>
|
||
</table>
|
||
<table class="docutils footnote" frame="void" id="id83" rules="none">
|
||
<colgroup><col class="label"><col></colgroup>
|
||
<tbody valign="top">
|
||
<tr><td class="label">[2]</td><td><em>(<a class="fn-backref" href="#id19">1</a>, <a class="fn-backref" href="#id23">2</a>)</em> 文档被解析成树形结构,所以下一步解析过程应该是当前节点的子节点</td></tr>
|
||
</tbody>
|
||
</table>
|
||
<table class="docutils footnote" frame="void" id="id84" rules="none">
|
||
<colgroup><col class="label"><col></colgroup>
|
||
<tbody valign="top">
|
||
<tr><td class="label"><a class="fn-backref" href="#id26">[3]</a></td><td>过滤器只能作为搜索文档的参数,或者说应该叫参数类型更为贴切,原文中用了 <tt class="docutils literal"><span class="pre">filter</span></tt> 因此翻译为过滤器</td></tr>
|
||
</tbody>
|
||
</table>
|
||
<table class="docutils footnote" frame="void" id="id85" rules="none">
|
||
<colgroup><col class="label"><col></colgroup>
|
||
<tbody valign="top">
|
||
<tr><td class="label"><a class="fn-backref" href="#id31">[4]</a></td><td>元素参数,HTML文档中的一个tag节点,不能是文本节点</td></tr>
|
||
</tbody>
|
||
</table>
|
||
<table class="docutils footnote" frame="void" id="id86" rules="none">
|
||
<colgroup><col class="label"><col></colgroup>
|
||
<tbody valign="top">
|
||
<tr><td class="label">[5]</td><td><em>(<a class="fn-backref" href="#id18">1</a>, <a class="fn-backref" href="#id33">2</a>, <a class="fn-backref" href="#id34">3</a>, <a class="fn-backref" href="#id35">4</a>, <a class="fn-backref" href="#id36">5</a>)</em> 采用先序遍历方式</td></tr>
|
||
</tbody>
|
||
</table>
|
||
<table class="docutils footnote" frame="void" id="id87" rules="none">
|
||
<colgroup><col class="label"><col></colgroup>
|
||
<tbody valign="top">
|
||
<tr><td class="label">[6]</td><td><em>(<a class="fn-backref" href="#id38">1</a>, <a class="fn-backref" href="#id39">2</a>)</em> CSS选择器是一种单独的文档搜索语法, 参考 <a class="reference external" href="http://www.w3school.com.cn/css/css_selector_type.asp">http://www.w3school.com.cn/css/css_selector_type.asp</a></td></tr>
|
||
</tbody>
|
||
</table>
|
||
<table class="docutils footnote" frame="void" id="id88" rules="none">
|
||
<colgroup><col class="label"><col></colgroup>
|
||
<tbody valign="top">
|
||
<tr><td class="label"><a class="fn-backref" href="#id50">[7]</a></td><td>原文写的是 html5lib, 译者觉得这是愿文档的一个笔误</td></tr>
|
||
</tbody>
|
||
</table>
|
||
<table class="docutils footnote" frame="void" id="id89" rules="none">
|
||
<colgroup><col class="label"><col></colgroup>
|
||
<tbody valign="top">
|
||
<tr><td class="label"><a class="fn-backref" href="#id43">[8]</a></td><td>wrap含有包装,打包的意思,但是这里的包装不是在外部包装而是将当前tag的内部内容包装在一个tag里.包装原来内容的新tag依然在执行 <a class="reference internal" href="#wrap">wrap()</a> 方法的tag内</td></tr>
|
||
</tbody>
|
||
</table>
|
||
<table class="docutils footnote" frame="void" id="id90" rules="none">
|
||
<colgroup><col class="label"><col></colgroup>
|
||
<tbody valign="top">
|
||
<tr><td class="label"><a class="fn-backref" href="#id52">[9]</a></td><td>文档中特殊编码字符被替换成特殊字符(通常是<E5B8B8>)的过程是Beautful Soup自动实现的,如果想要多种编码格式的文档被完全转换正确,那么,只好,预先手动处理,统一编码格式</td></tr>
|
||
</tbody>
|
||
</table>
|
||
<table class="docutils footnote" frame="void" id="id91" rules="none">
|
||
<colgroup><col class="label"><col></colgroup>
|
||
<tbody valign="top">
|
||
<tr><td class="label">[10]</td><td><em>(<a class="fn-backref" href="#id55">1</a>, <a class="fn-backref" href="#id57">2</a>)</em> 智能引号,常出现在microsoft的word软件中,即在某一段落中按引号出现的顺序每个引号都被自动转换为左引号,或右引号.</td></tr>
|
||
</tbody>
|
||
</table>
|
||
</div>
|
||
</div>
|
||
</div>
|
||
|
||
|
||
</div> |