python的爬虫练习

1231
数据库的安装教程
https://blog.csdn.net/m0_63451989/article/details/131948723?ops_request_misc=%257B%2522request%255Fid%2522%253A%2522169957815816800192248363%2522%252C%2522scm%2522%253A%252220140713.130102334.pc%255Fblog.%2522%257D&request_id=169957815816800192248363&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2_blogfirst_rank_ecpm_v1~rank_v31_ecpm-5-131948723-null-null.nonecase&utm_term=mysql%208.2.0&spm=1018.2226.3001.4450

数据库密码忘记
https://blog.csdn.net/m0_46278037/article/details/113923726

安装数据库以后，进入数据库，创建database
mysql> create database test;
参考基本语句：https://blog.csdn.net/m0_60494863/article/details/124364800

在python里测试连接数据库

import pymysql

conn = pymysql.connect(host="127.0.0.1", user="root", password="123456", database="test",charset="utf8")
if conn:
        print("true")

爬虫练习1：

import requests
#目标页面
url="https://www.xxx.edu.cn/"
#浏览器右键检查，network然后找doc，刷新页面，在文档里找"User-Agent"，User-Agent其实就是你的浏览器信息
headers = {
 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36",

}
try:
    response = requests.get(url, headers=headers)
    #防止中文乱码
    response.encoding = 'utf-8'
    #它能够判断返回的Response类型状态是不是200。如果是200，他将表示返回的内容是正确的，如果不是200，他就会产生一个HttpError的异常。
    if response.status_code == 200:
        print(response.text[:100])
    #response.text可以抓取全页，也可以限制字符如上

except:
    print("爬取失败")

2爬虫练习，将爬取到的数据生成html文件，分析文件获取目标信息
例如，将学院的专业捕捉出来放在列表里

import requests
url = 'https://www.xxx.edu.cn/yqdx/zyyl.htm'
headers = {
 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36",
}
response = requests.get(url, headers=headers)
response.encoding='utf-8'
if response.status_code == 200:
	print(response.text)

此时获取了网页的信息，再将其写入html文件中

from bs4 import BeautifulSoup
x = BeautifulSoup(response.text,'lxml')      #使用 lxml 解析器作为底层解析引擎
print(x.prettify())     # 变成规整的 html 格式
bytes_obj = x.prettify().encode()
with open('baidus.html', 'wb') as f:
    f.write(bytes_obj)
print('ok')

爬出学院名称，进行整理

from lxml import etree
import re
parser = etree.HTMLParser(encoding='utf-8')
html = etree.parse('baidus.html', parser=parser)
str=html.xpath('//body/div//table/tbody/tr/td/p/span/text()')
print(str)
list=[]
for i in str:
    teststr="学院"
    #m模糊匹配
    r1 = re.search(teststr, i)
    if r1:
        #去掉两端的空格
        i=i.strip()
        print(i)
        list.append(i)
print(list)

完成。
这里添加一个html文件的分析，有两种方式可以解析html文件，
1.使用xpath语法，应该使用Element.xpath方法。来执行xpath的选择。
result = html.xpath(‘//li’)
xpath返回的永远是一个列表。
2.获取文本，是通过xpath中的text()函数。示例代码如下：
html.xpath(‘//li/a[1]/text()’)
3.在某个标签下，再执行xpath函数，获取这个标签下的子孙元素，那么应该在斜杠之前加一个.，代表是在当前元素下获取。
address = tr.xpath(‘./td[4]/text()’)
有一个hello.html的文件，放在python文件的目录下

Title

法治的细节
纸质书
30元
罗翔

from lxml import etree
parser = etree.HTMLParser(encoding='utf-8')
html = etree.parse('hello.html', parser=parser)

#list=html.xpath('//head/title/text()')
#['法治的细节', '纸质书', '30元', '罗翔']
list=html.xpath('//body/ul/li/text()')
#省略中间目录用//代替也是可以的，效果如上
list=html.xpath('//body//li/text()')
#多个一样的标签时可以使用[]表示第几个，['纸质书']
list=html.xpath('//body//li[1]/text()')
#也可以用其他属性加以区别，输出['纸质书']
list=html.xpath('//body//li[@id="model"]/text()')
list=html.xpath('//body//li[@class="price"]/text()')
print(list)

参考连接：https://blog.csdn.net/qq_44087994/article/details/126417444?ops_request_misc=%257B%2522request%255Fid%2522%253A%2522169989476116800225567403%2522%252C%2522scm%2522%253A%252220140713.130102334.pc%255Fblog.%2522%257D&request_id=169989476116800225567403&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2_blogfirst_rank_ecpm_v1~rank_v31_ecpm-2-126417444-null-null.nonecase&utm_term=xpath%E7%88%AC%E8%99%AB%E7%BB%93%E6%9E%9C%E6%9C%89%5Cn&spm=1018.2226.3001.4450
第二种方法，Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构，每个节点都是python对象
搜索文档书，一般用的比较多的方法就是两个方法，一个是find，一个是find_all。find方法是找到第一个满足条件的标签后立即返回，只返回一个元素。find_all方法是把所以满足条件的表签都返回。

例如仍然是hello文件的解析

from bs4 import BeautifulSoup

with open("hello.html", "r", encoding='utf-8') as f:
    html = f.read()
soup = BeautifulSoup(html,'lxml')
table = soup.find('ul')
trs = soup.find_all('li')
for tr in trs:
    print(tr)
print(table)

输出：所有li

法治的细节
纸质书
30元
罗翔

输出ul的所有内容

法治的细节
纸质书
30元
罗翔

练习：对下面文件用不同方法进行解析

<supermarket name="永辉超市" address="肖家河大厦">
    <staffs>
        <staff  id="s001">
            <name>小明</name>
            <position>收营员</position>
            <salary>4000</salary>
        </staff>
        <staff  id="s002">
            <name>小花</name>
            <position>促销员</position>
            <salary>3500</salary>
        </staff>
        <staff  id="s003">
            <name>张三</name>
            <position>保洁</position>
            <salary>3000</salary>
        </staff>
        <staff  id="s004">
            <name>李四</name>
            <position>收营员</position>
            <salary>4000</salary>
        </staff>
        <staff  id="s005">
            <name>王五</name>
            <position>售货员</position>
            <salary>3800</salary>
        </staff>
    </staffs>
    
    <goodsList>
        <goods discount="0.9">
            <name>泡面</name>
            <price>3.5</price>
            <count>120</count>
        </goods>
        <goods>
            <name>火腿肠</name>
            <price>1.5</price>
            <count>332</count>
        </goods>
        <goods>
            <name>矿泉水</name>
            <price>2</price>
            <count>549</count>
        </goods>
        <goods discount="8.5">
            <name>面包</name>
            <price>5.5</price>
            <count>29</count>
        </goods>
    </goodsList>
</supermarket>

文章出处登录后可见！

已经登录？立即刷新

python的爬虫练习

相关推荐