工程建设的基本步骤_服装网站建设建议_长沙网站seo分析_百度网络营销中心官网

时间:2025/7/12 6:49:58来源：https://blog.csdn.net/meisongqing/article/details/147086136 浏览次数:0次

以下是一个针对菜鸟教程Java修饰符类型页面的爬虫代码，包含主页内容提取和子页面抓取功能：

python

复制

import requests
from bs4 import BeautifulSoup
import os
import time
from urllib.parse import urljoin# 配置参数
BASE_URL = 'https://www.runoob.com/java/java-modifier-types.html'
OUTPUT_DIR = 'java_modifiers_data'
HEADERS = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}def init_directory():"""创建输出目录"""if not os.path.exists(OUTPUT_DIR):os.makedirs(OUTPUT_DIR)def get_page_content(url):"""获取页面内容"""try:response = requests.get(url, headers=HEADERS)response.raise_for_status()response.encoding = 'utf-8'return response.textexcept requests.exceptions.RequestException as e:print(f"Error fetching {url}: {e}")return Nonedef parse_main_page(html):"""解析主页面内容"""soup = BeautifulSoup(html, 'lxml')main_content = soup.find('div', class_='article-body')# 提取基本信息title = soup.find('h1').text.strip()print(f"正在解析主页面: {title}")# 提取所有章节内容sections = []for section in main_content.find_all(['h2', 'h3', 'p', 'pre']):if section.name in ['h2', 'h3']:current_section = {'title': section.text.strip(),'content': []}sections.append(current_section)elif section.name == 'pre':current_section['content'].append(('code', section.text.strip()))else:current_section['content'].append(('text', section.text.strip()))# 提取子页面链接sub_links = []for link in main_content.find_all('a', href=True):full_url = urljoin(BASE_URL, link['href'])if full_url != BASE_URL and 'java-modifier-types' not in full_url:sub_links.append(full_url)return {'title': title,'sections': sections,'sub_links': list(set(sub_links))  # 去重}def parse_sub_page(html):"""解析子页面内容"""soup = BeautifulSoup(html, 'lxml')main_content = soup.find('div', class_='article-body')if not main_content:return Nonetitle = soup.find('h1').text.strip()print(f"正在解析子页面: {title}")content = []for element in main_content.find_all(['h2', 'h3', 'p', 'pre', 'table']):if element.name in ['h2', 'h3']:content.append(('header', element.text.strip()))elif element.name == 'pre':content.append(('code', element.text.strip()))elif element.name == 'table':rows = []for tr in element.find_all('tr'):cells = [td.text.strip() for td in tr.find_all(['th', 'td'])]rows.append(cells)content.append(('table', rows))else:content.append(('text', element.text.strip()))return {'title': title,'content': content}def save_data(data, filename):"""保存数据到文件"""path = os.path.join(OUTPUT_DIR, filename)with open(path, 'w', encoding='utf-8') as f:if isinstance(data, dict):f.write(f"标题: {data.get('title', '')}\n\n")if 'sections' in data:  # 主页面for section in data['sections']:f.write(f"## {section['title']} ##\n")for content_type, text in section['content']:if content_type == 'code':f.write(f"\n代码示例:\n{text}\n")else:f.write(f"{text}\n")f.write("\n")else:  # 子页面for content_type, content in data['content']:if content_type == 'header':f.write(f"\n### {content} ###\n")elif content_type == 'code':f.write(f"\n代码示例:\n{content}\n")elif content_type == 'table':f.write("\n表格数据:\n")for row in content:f.write(" | ".join(row) + "\n")else:f.write(f"{content}\n")else:f.write(data)print(f"已保存文件: {filename}")def main():init_directory()# 处理主页面main_html = get_page_content(BASE_URL)if not main_html:returnmain_data = parse_main_page(main_html)save_data(main_data, '00_主页面.txt')# 处理子页面for idx, sub_url in enumerate(main_data['sub_links'], 1):time.sleep(1)  # 礼貌性延迟sub_html = get_page_content(sub_url)if not sub_html:continuesub_data = parse_sub_page(sub_html)if sub_data:save_data(sub_data, f"{idx:02d}_{sub_data['title']}.txt")if __name__ == '__main__':main()

代码说明：

初始化配置：

设置目标URL和输出目录
添加User-Agent头信息模拟浏览器访问

主要功能模块：

get_page_content()：处理HTTP请求
parse_main_page()：解析主页面内容，提取章节信息和子链接
parse_sub_page()：解析子页面内容，支持表格和代码块的识别
save_data()：将数据保存为结构化文本文件

数据处理特点：

自动识别页面中的标题、正文、代码示例和表格
处理相对链接转换为绝对链接
自动去重子页面链接
保存文件时添加序号便于排序

输出文件结构：

主页面保存为00_主页面.txt
子页面按发现顺序保存为带序号的XX_标题.txt

反爬措施：

设置合理的User-Agent
请求之间添加1秒延迟
自动处理编码问题

使用说明：

安装依赖库：

bash

复制

pip install requests beautifulsoup4 lxml

运行后会创建：

java_modifiers_data 目录
主页面和所有子页面的文本文件

注意事项：

请遵守网站robots.txt规定，控制爬取速度
网站结构变化时需要调整解析逻辑
建议添加异常处理和重试机制增强稳定性
重要数据请及时保存，避免重复爬取

如果需要处理更复杂的内容或保存为其他格式（如JSON），可以修改save_data()函数和解析逻辑。

关键字：工程建设的基本步骤_服装网站建设建议_长沙网站seo分析_百度网络营销中心官网

本网仅为发布的内容提供存储空间，不对发表、转载的内容提供任何形式的保证。凡本网注明“来源：XXX网络”的作品，均转载自其它媒体，著作权归作者所有，商业转载请联系作者获得授权，非商业转载请注明出处。

我们尊重并感谢每一位作者，均已注明文章来源和作者。如因作品内容、版权或其它问题，请及时与我们联系，联系邮箱：809451989@qq.com，投稿邮箱：809451989@qq.com

责任编辑：