建设银行网站用户名怎么查_销售易crm官网下载_seo推广营销靠谱_seo数据是什么

一、获取资源

目标网址：（第一页）

https://www.shicimingju.com/category/all

（后续）

https://www.shicimingju.com/category/all_2
https://www.shicimingju.com/category/all_3
...
https://www.shicimingju.com/category/all_652

在这里插入图片描述
作者

.cont1 .list li .infor .name

简介

.cont1 .list li .infor .text

确定爬取机制遵循 robots.txt，大多数网站都有一个 robots.txt 文件，说明了允许和禁止爬取的页面。在开始之前，检查网站的 robots.txt，确保遵循其中的规则。直接请求该文件并解析其内容，以确保爬虫符合网站的意图。

robots_url = 'https://www.shicimingju.com/robots.txt'
robots_response = requests.get(robots_url)
print(robots_response.text)  # 检查 robots.txt 的内容以遵循规则

二、发送请求

response = requests.get(url)

三、数据解析

# 解析HTML
selector = Selector(response.text)
# 获取作者名称
authors_name = selector.css('.cont1 .list li .infor .name::text').getall()
# 获取作者简介
authors_introduction = selector.css('.cont1 .list li .infor .text::text').getall()
# 清洗作者简介，去除多余的换行和空格
cleaned_introductions = [' '.join(intro.strip().replace('\n', ' ').split())for intro in authors_introduction
]

四、保存数据

# 创建DataFrame
authors_data = pd.DataFrame({'name': authors_name,'introduction': cleaned_introductions[:len(authors_name)]  # 确保长度匹配
})
# 存储到 CSV 文件，使用utf-8编码以防止乱码
authors_data.to_csv('authors_info.csv', index=False, encoding='utf-8-sig')
print("数据已成功写入 authors_info.csv")

五、实现代码

import requests
from parsel import Selector
import pandas as pd
# 目标网址
url = 'https://www.shicimingju.com/category/all'  # 替换为你要爬取的实际网址
# 发送请求
response = requests.get(url)
# 检查响应状态
if response.status_code == 200:# 解析HTMLselector = Selector(response.text)# 获取作者名称authors_name = selector.css('.cont1 .list li .infor .name::text').getall()# 获取作者简介authors_introduction = selector.css('.cont1 .list li .infor .text::text').getall()# 清洗作者简介，去除多余的换行和空格cleaned_introductions = [' '.join(intro.strip().replace('\n', ' ').split())for intro in authors_introduction]# 创建DataFrameauthors_data = pd.DataFrame({'name': authors_name,'introduction': cleaned_introductions[:len(authors_name)]  # 确保长度匹配})# 存储到 CSV 文件，使用utf-8编码以防止乱码authors_data.to_csv('authors_info.csv', index=False, encoding='utf-8-sig')print("数据已成功写入 authors_info.csv")
else:print(f"请求失败，状态码：{response.status_code}")

﹍继续优化﹍

六、非恶意化

提高程序的稳定性和隐蔽性，降低被封的风险。并确保不对目标网站造成过大负担，对个别页面的访问频率应保持合理，尽量避免对服务器产生恶意影响。防止成为恶意爬虫并规避网站的反爬虫机制，可以在爬虫代码中实施添加请求延时等策略，具体如下：

1. 增加随机延时
上面代码中使用 time.sleep(random.uniform(1, 3)) 指令在请求之间随机延迟 1 到 3 秒，进一步减小对目标网站造成的压力。

2. User-Agent 伪装
伪装请求的 User-Agent 字段，让请求看起来像是来自不同的常用浏览器，而不是一个自动化的爬虫。你可以在请求头中设定较为常见的浏览器 User-Agent。

# 随机选择一个 User-Agent
USER_AGENTS = ["Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36",
]
# 创建一个 Pandas Excel 文件写入器
with pd.ExcelWriter('authors_info.xlsx', engine='openpyxl') as writer:for page in range(1, total_pages + 1):  # 从1循环到总页数（652）# 根据页码构建URLurl = 'https://www.shicimingju.com/category/all' if page == 1 else f'https://www.shicimingju.com/category/all_{page}'# 随机选取 User-Agentheaders = {'User-Agent': random.choice(USER_AGENTS)}

3. 设置请求频率
控制请求的频率可以减少被服务器识别为恶意爬虫的风险。每隔一段时间才发送请求。

4. IP 代理池
使用代理服务器可以隐藏你的真实IP地址，同时扩展可用的请求次数。你可以实现代理池，随机选择不同的代理IP发送请求。在请求头中设置HTTP代理可能效果更好：

proxies = {"http": "http://your_proxy_ip:port","https": "https://your_proxy_ip:port",
}# 在发送请求时添加 proxies 参数
response = requests.get(url, headers=headers, proxies=proxies)

5. 避免频繁请求相同页面
确保在短时间内不重复请求相同页面，可以使用缓存机制来避免反复请求已经获取的数据，也可以通过设计性爬取策略，关注不同时间段内数据的变化。

七、其他优化

1. 多线程处理
使用 concurrent.futures 模块，该模块为异步执行可调用对象提供了一个高级接口。使用 ThreadPoolExecutor 合并多线程处理，max_workers控制线程数。
2. 处理缺失的介绍信息
在 parse_authors_data 函数中，使用列表推导式来处理 authors_introduction。如果某个介绍为空（intro.strip() 结果为空），则将其替换为 ‘-’。这确保了每个作者都有对应的介绍，即使是缺失的情况。
在这里插入图片描述

3. 检查内容是否为空
检查 HTML 内容是否为空，添加了检查 html 是否为空的逻辑，以防某些请求未返回有效的网页内容。检查 DataFrame 是否为空，在写入工作表之前，确保 authors_data DataFrame 不是空的；如果是空的，打印一条信息说明未能写入。

八、优化代码

一页一sheet

import time
import random
import requests
from parsel import Selector
import pandas as pd
from concurrent.futures import ThreadPoolExecutor, as_completed# Constants
TOTAL_PAGES = 652  # Adjust as needed
MAX_RETRIES = 5
TIMEOUT = 10  # Timeout for requests
USER_AGENTS = ["Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36",
]"""
# 代理列表（替换为您的代理）
PROXIES = ["http://183.164.243.143:8089","http://200.37.186.66:8080","http://183.89.247.182:8080",
]# 分析函数
def get_random_proxy():return {'http': random.choice(PROXIES), 'https': random.choice(PROXIES)}
"""def fetch_data(url):retry_count = 0while retry_count < MAX_RETRIES:headers = {'User-Agent': random.choice(USER_AGENTS)}print(f"正在请求 {url}...")  # Print the requested URLtry:response = requests.get(url, headers=headers, timeout=TIMEOUT)# response = requests.get(url, headers=headers, proxies=get_random_proxy(), timeout=TIMEOUT)response.raise_for_status()  # Raise an error for bad responsesreturn response.textexcept requests.RequestException as e:delay_time = random.uniform(2, 6)  # Random delay between retriesprint(f"请求失败，错误信息：{e}，状态码：{response.status_code if 'response' in locals() else 'N/A'}，等待 {delay_time:.2f} 秒后重试...")time.sleep(delay_time)retry_count += 1print("达到最大重试次数，程序结束。")return Nonedef parse_authors_data(html):selector = Selector(html)authors_name = selector.css('.cont1 .list li .infor .name::text').getall()authors_introduction = selector.css('.cont1 .list li .infor .text::text').getall()# 处理介绍信息，如果没有则填充 '-'cleaned_introductions = [' '.join(intro.strip().replace('\n', ' ').split()) if intro.strip() else '-'for intro in authors_introduction]# 确保长度匹配return pd.DataFrame({'name': authors_name,'introduction': cleaned_introductions[:len(authors_name)]})def main():with pd.ExcelWriter('authors_info.xlsx', engine='openpyxl') as writer:# Create a list of URLs to scrapeurls = ['https://www.shicimingju.com/category/all' if page == 1 else f'https://www.shicimingju.com/category/all_{page}'for page in range(1, TOTAL_PAGES + 1)]# Use ThreadPoolExecutor to fetch data concurrentlywith ThreadPoolExecutor(max_workers=10) as executor:  # You can adjust the number of workersfuture_to_url = {executor.submit(fetch_data, url): url for url in urls}for future in as_completed(future_to_url):url = future_to_url[future]try:html = future.result()if html:authors_data = parse_authors_data(html)page_number = urls.index(url) + 1  # Get the page numberauthors_data.to_excel(writer, sheet_name=str(page_number), index=False)print(f"数据已成功写入第 {page_number} 页工作表")except Exception as e:print(f"处理 {url} 时发生错误：{e}")print("所有数据已成功写入 authors_info.xlsx")if __name__ == "__main__":main()

all单sheet
将所有诗人和简介写入一个表格的一张sheet，但是前面标注page数

在这里插入图片描述
运行完之后因为前面有数字页码，所以可以进行一个简单的排序。这也是为什么不分sheet的原因，这样可以直观地看到总数，如果是几百个sheet的话，会比较杂乱。

在这里插入图片描述

这个13035，大概是假设每页20个诗人，（最后一页15个），但实际有的页面可能没有20个

所以得到的会少于这个数。

import time
import random
import requests
from parsel import Selector
import pandas as pd
from concurrent.futures import ThreadPoolExecutor, as_completed# Constants
TOTAL_PAGES = 652  # Adjust as needed
MAX_RETRIES = 5
TIMEOUT = 10  # Timeout for requests
USER_AGENTS = ["Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36",
]"""
# 代理列表（替换为您的代理）
PROXIES = ["http://183.164.243.143:8089","http://200.37.186.66:8080","http://183.89.247.182:8080",
]# 分析函数
def get_random_proxy():return {'http': random.choice(PROXIES), 'https': random.choice(PROXIES)}
"""def fetch_data(url):retry_count = 0while retry_count < MAX_RETRIES:headers = {'User-Agent': random.choice(USER_AGENTS)}print(f"正在请求 {url}...")  # Print the requested URLtry:response = requests.get(url, headers=headers, timeout=TIMEOUT)# response = requests.get(url, headers=headers, proxies=get_random_proxy(), timeout=TIMEOUT)response.raise_for_status()  # Raise an error for bad responsesreturn response.textexcept requests.RequestException as e:delay_time = random.uniform(2, 6)  # Random delay between retriesprint(f"请求失败，错误信息：{e}，状态码：{response.status_code if 'response' in locals() else 'N/A'}，等待 {delay_time:.2f} 秒后重试...")time.sleep(delay_time)retry_count += 1print("达到最大重试次数，程序结束。")return Nonedef parse_authors_data(html, page_number):selector = Selector(html)authors_name = selector.css('.cont1 .list li .infor .name::text').getall()authors_introduction = selector.css('.cont1 .list li .infor .text::text').getall()# 处理介绍信息，如果没有则填充 '-'cleaned_introductions = [' '.join(intro.strip().replace('\n', ' ').split()) if intro.strip() else '-'for intro in authors_introduction]# 创建 DataFrame，并添加页码列return pd.DataFrame({'page': [page_number] * len(authors_name),  # 添加页码列'name': authors_name,'introduction': cleaned_introductions[:len(authors_name)]})def main():all_authors_data = pd.DataFrame()  # 初始化一个空的数据框来累积数据# Create a list of URLs to scrapeurls = ['https://www.shicimingju.com/category/all' if page == 1 else f'https://www.shicimingju.com/category/all_{page}'for page in range(1, TOTAL_PAGES + 1)]# Use ThreadPoolExecutor to fetch data concurrentlywith ThreadPoolExecutor(max_workers=10) as executor:  # You can adjust the number of workersfuture_to_url = {executor.submit(fetch_data, url): url for url in urls}for future in as_completed(future_to_url):url = future_to_url[future]try:html = future.result()# Check if HTML is returned properlyif html:page_number = urls.index(url) + 1  # Get the page numberauthors_data = parse_authors_data(html, page_number)# 追加数据到总数据框all_authors_data = pd.concat([all_authors_data, authors_data], ignore_index=True)else:print(f"第 {page_number} 页的HTML内容为空。")except Exception as e:print(f"处理 {url} 时发生错误：{e}")# 将最终的所有数据写入到一个工作表中all_authors_data.to_excel('authors_info.xlsx', index=False)print("所有数据已成功写入 authors_info.xlsx")if __name__ == "__main__":main()