Python爬取网页搜索信息

博主： Xherlock
发布时间：2021 年 10 月 09 日
735 次浏览
暂无评论
1627字数
分类： python 爬虫

Python~XPath

XML文档中搜索内容的一门语言
html是xml的一个子集

lxml模块

/a a标签

/a/b a下b标签

/a/b/text() a下b标签中的内容

/a/b//c/text() a下b下所有c标签内容（不管b的下一级还是二级是c）

/a/b/*/c/text() a下b下任意一级（span或者div之类的）下的c标签内容

eg:

/html/body/ol/li/a[@href='hello'] 找到特定的a标签

/html/body/ol/li/a/@href 找到特定的a标签中href内容

F12可以右键复制相关元素完整XPath路径

以猪八戒网为例学习XPath的爬虫使用 地址：【北京saas价格北京saas报价】北京saas服务外包信息-北京猪八戒网 (zbj.com)

import requests
from lxml import etree

url = "https://beijing.zbj.com/search/f/?type=new&kw=saas"
resp = requests.get(url)
html = etree.HTML(resp.text)

divs = html.xpath("/html/body/div[6]/div/div/div[2]/div[5]/div[1]/div") # XPath路径得找准
print("\033[1;31;48mPrice\tTitle\tFirm\033[0m")
for div in divs:
    price = div.xpath("./div/div/a[2]/div[2]/div[1]/span[1]/text()")[0].strip("¥")
    info = "saas".join(div.xpath("./div/div/a[2]/div[2]/div[2]/p/text()")) # 加上搜索的关键字
    firm = div.xpath("./div/div/a[1]/div[1]/p/text()")[1].strip("\n\n") # 去除换行
    print(price + '\t', end="")
    print(info + '\t', end="")
    print(firm)

效果图

这里在定位信息模块时，会出现hl把搜索结果分割，最终在结果中显示的是含两个元素的列表

没改前

所以需要使用join将列表里的元素连接起来，但有个bug，如果搜索结果在信息开头或结尾，就没法加上

最后修改：2022 年 12 月 29 日

© 允许规范转载

如果觉得我的文章对你有用，请随意赞赏

发表评论取消回复
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

评论 *

私密评论

名称 *

🎲

邮箱 *

地址

Reverse（十一）
浏览次数: 9230
Win11新机鼓捣记录
浏览次数: 2754
Reverse（十三）
浏览次数: 2634
js逆向爬虫1
浏览次数: 2009
欢迎使用 Typecho
浏览次数: 1951

mars
博主你好，请问怎么知道的alloc函数属于Buffer下的方法...
lj
该评论仅登录用户及评论双方可见
Mr.Guan
:@(赞一个)
Typecho
欢迎加入 Typecho 大家族

欢迎使用 Typecho
浏览次数: 1951
实训第五天
浏览次数: 637
Reverse（七）
浏览次数: 608
xss.haozi.me靶场
浏览次数: 665
数据结构与算法思维导图
浏览次数: 702

Python爬取网页搜索信息

Xherlock • 2021 年 10 月 09 日

<h1><strong>Python~XPath</strong></h1><ul><li><strong>XML文档中搜索内容的一门语言</strong></li><li><strong>html是xml的一个子集</strong></li></ul><p><strong>lxml模块</strong></p><p><strong>/a</strong>                            <strong>a标签</strong></p><p><strong>/a/b</strong>                            <strong>a下b标签</strong></p><p><strong>/a/b/text()</strong>                <strong>a下b标签中的内容</strong></p><p><strong>/a/b//c/text()</strong>        <strong>a下b下所有c标签内容（不管b的下一级还是二级是c）</strong></p><p><strong>/a/b/*/c/text()</strong>        <strong>a下b下任意一级（span或者div之类的）下的c标签内容</strong></p><p><strong>eg:</strong></p><p><strong>/html/body/ol/li/a[@href='hello']</strong>        <strong>找到特定的a标签</strong></p><p><strong>/html/body/ol/li/a/@href</strong>    <strong>找到特定的a标签中href内容</strong></p><p><strong>F12可以右键复制相关元素完整XPath路径</strong></p><p><strong>以猪八戒网为例学习XPath的爬虫使用</strong>    <strong>地址：</strong><span class="external-link"><a class="no-external-link" href="https://beijing.zbj.com/search/f/?kw=saas" target="_blank"><i data-feather="external-link"></i><strong>【北京saas价格<strong><em></strong>北京saas报价】</em></strong><strong>北京saas服务外包信息-北京猪八戒网 (zbj.com)</strong></a></span></p><pre><code>import requests
from lxml import etree

url = &quot;https://beijing.zbj.com/search/f/?type=new&amp;kw=saas&quot;
resp = requests.get(url)
html = etree.HTML(resp.text)

divs = html.xpath(&quot;/html/body/div[6]/div/div/div[2]/div[5]/div[1]/div&quot;) # XPath路径得找准
print(&quot;\033[1;31;48mPrice\tTitle\tFirm\033[0m&quot;)
for div in divs:
    price = div.xpath(&quot;./div/div/a[2]/div[2]/div[1]/span[1]/text()&quot;)[0].strip(&quot;¥&quot;)
    info = &quot;saas&quot;.join(div.xpath(&quot;./div/div/a[2]/div[2]/div[2]/p/text()&quot;)) # 加上搜索的关键字
    firm = div.xpath(&quot;./div/div/a[1]/div[1]/p/text()&quot;)[1].strip(&quot;\n\n&quot;) # 去除换行
    print(price + '\t', end=&quot;&quot;)
    print(info + '\t', end=&quot;&quot;)
    print(firm)</code></pre><p><strong>效果图</strong></p><p><img src="http://120.78.215.15/usr/uploads/2021/10/602766994.png" alt="image-20211009223326352.png" title="image-20211009223326352.png" style=""></p><p><strong>这里在定位信息模块时，会出现hl把搜索结果分割，最终在结果中显示的是含两个元素的列表</strong></p><p><img src="http://120.78.215.15/usr/uploads/2021/10/2545089066.png" alt="image-20211009223517550.png" title="image-20211009223517550.png" style=""></p><p><strong>没改前</strong></p><p><img src="http://120.78.215.15/usr/uploads/2021/10/1341418706.png" alt="image-20211009223820360.png" title="image-20211009223820360.png" style=""></p><p><strong>所以需要使用join将列表里的元素连接起来，但有个bug，如果搜索结果在信息开头或结尾，就没法加上</strong></p>