我爬取了5000个优质网盘资源（含源码）

记一次python优质云盘资源爬取

最近见到一个网站，里面的内容挺丰富的哈！！

里面的资源也一直在更新

虽然目前该网站资源免费，登录就可以获取

但就着学习的态度，决定把这个网站4000多条资源爬取一遍

通过观察发现，网站的网站格式如下

https://****.org/skills/4065.html
https://****.org/skills/4066.html
https://****.org/skills/4173.html
https://****.org/skills/4183.html
https://****.org/skills/4184.html

不难发现，规律是 https://****.org/skills/****.html

只要我们把从1开始一直爬取到网站目前更新到的数字就可以了

但真的这样吗？

实则不然，期间发现一些链接会有不存在的情况

很简单，只需要判断一下网页有没有“访问的页面不存在”这样的字符就行了

一顿操作下来，得到代码如下

import requests
from bs4 import BeautifulSoup
import csv
import time

def get_info(num):
    headers = {'cookie': '填写你的cookie'}
    url = "https://****.org/ziyuan/%s.html" % num
    req = requests.get(url, headers=headers)
    if "不存在" in req.text:
        print("%s 您访问的页面不存在" % num)
        return
    soup = BeautifulSoup(req.text, 'html.parser')
    info = soup.find('div', class_="panel-body")
    info_list = info.find('div', class_='entry-meta').find_all('li')
    # 获取标题、更新日期、链接类型，网盘分类等信息
    title = info.find('h1', class_='metas-title').text.strip()
    update_time = info.find('span', class_='comment-num').text.strip()
    type = info_list[0].text.strip().replace("网盘分类：", "")
    tag = info_list[1].text.strip().replace("资源分类：", "")
    # 获取网盘链接
    link = info.find('a', id='lj')['href'].replace('/go/?url=', '')
    # 获取密码
    try:
        password = info.find('span', id='tiquma').text.strip()
    except:
        password = ""
    # 将爬取到的数据写入csv文件
    with open("wpfz_info1.csv", 'a', encoding='utf-8', newline='') as f:
        csv_write = csv.writer(f)
        csv_write.writerow([num, title, update_time, type, tag, link, password])
    print("%s 已经完成" % num)


if __name__ == '__main__':
    # 循环获取每个页面的内容
    for i in range(1, 4817):
        get_info(i)
        time.sleep(1)