模板制作工艺流程_江门搜狗网站推广优化_百度小说风云榜今天_seo 最新

时间:2025/8/6 1:15:00来源：https://blog.csdn.net/rootb/article/details/146871251 浏览次数:0次

使用 Playwright 爬取 Behance 项目：详细指南 📸

嘿，小伙伴们！今天我要分享一个使用 Playwright 和 Python 爬取 Behance 项目的完整示例代码。Behance 是一个展示创意项目的平台，假如你想要提取某个关键词相关的作品集，此代码将是一个不错的起点。让我们一起来看看吧！🤓

代码解析

下面是一个用于爬取 Behance 网站上项目的函数 scrape_behance_projects。它会访问 Behance 搜索页面、提取项目的信息并返回项目列表。你可以调整参数来满足你的需求。

from playwright.sync_api import sync_playwright
import time
import jsondef clean_title(title):"""清理项目标题"""return title.strip() if title else "No Title"def scrape_behance_projects(max_projects=10, scroll_delay=1.5, proxy=None):"""爬取Behance项目Args:max_projects: 最大爬取项目数scroll_delay: 每次滚动后等待加载的时间(秒)proxy: 使用的代理（如果需要）Returns:项目列表，包含标题，详情url"""with sync_playwright() as p:browser = Nonetry:browser = p.chromium.launch(headless=False)context_options = {"viewport": {"width": 1920, "height": 1080}}if proxy:context_options["proxy"] = proxycontext = browser.new_context(**context_options)page = context.new_page()url = "https://www.behance.net/search/projects/jetour?tracking_source=typeahead_nav_recent_suggestion"page.goto(url)print(f"访问页面: {url}")page.wait_for_load_state("networkidle")projects = []seen_urls = set()count = 0print(f"开始爬取，目标数量: {max_projects} 个项目")while count < max_projects:links = page.locator("a.ProjectCoverNeue-coverLink-U39").element_handles()print(f"当前页面上找到 {len(links)} 个项目")new_found = Falsefor link in links:href = link.get_attribute("href")title = link.get_attribute("title")if href and href.startswith("/"):href = f"https://www.behance.net{href}"if not href or href in seen_urls:continueclean_text = clean_title(title)projects.append({"title": clean_text,"url": href,})seen_urls.add(href)new_found = Truecount += 1print(f"[{count}/{max_projects}]  找到项目: {clean_text}")if count >= max_projects:breakif count >= max_projects:breakif not new_found and count < max_projects:print(f"滚动加载更多内容... 当前: {count}/{max_projects}")page.evaluate("window.scrollTo(0, document.body.scrollHeight)")time.sleep(scroll_delay)retry_count = 0scroll_again = Truewhile retry_count < 5 and not new_found and scroll_again:page.evaluate("window.scrollTo(0, document.body.scrollHeight)")time.sleep(scroll_delay)current_links_count = len(seen_urls)new_links = page.locator("a.ProjectCoverNeue-coverLink-U39").element_handles()for link in new_links:href = link.get_attribute("href")if href and href.startswith("/"):href = f"https://www.behance.net{href}"if href and href not in seen_urls:scroll_again = Truebreakelse:scroll_again = Falseretry_count += 1if not scroll_again:print("已到达内容底部，无法加载更多项目")breakprint(f"爬取完成! 共获取 {len(projects)} 个作品集")browser.close()print(json.dumps(projects[:5], ensure_ascii=False))  # 打印前5个项目project_urls = [project['url'] for project in projects]return projects, project_urlsfinally:if browser:browser.close()

关键功能和用法

初始化 Playwright：
- 使用 sync_playwright() 来启动 Playwright 环境。
- 通过 p.chromium.launch() 启动一个无头浏览器（headless=False 用于调试）。
页面交互：
- page.goto(url) 用于访问指定页面。
- page.wait_for_load_state("networkidle") 确保页面加载完成。
- page.locator("a.ProjectCoverNeue-coverLink-U39") 使用 CSS 选择器定位项目链接。
滚动页面和动态加载：
- 使用 page.evaluate("window.scrollTo(0, document.body.scrollHeight)") 模拟滚动到底部来触发动态加载。
- time.sleep(scroll_delay) 简单地等待新内容加载完毕。
数据存储和去重：
- 使用 seen_urls 集合存储已处理的项目 URL，避免重复。
- 将项目标题和 URL 保存到 projects 列表。
异常处理：
- 使用 try...finally 确保浏览器在程序退出前被正确关闭。

提示和注意事项

头部浏览器模式：当你实际运行爬虫时，可以将 headless=True。这样浏览器界面是不可见的，会更高效。
反爬机制：网站可能实施反爬虫策略，建议控制请求速度并合理设置延迟。
代理支持：代码中包含对代理的支持示例，你可以根据需要开启这一功能。

这个示例告诉你如何使用 Playwright 进行简单的网页抓取，尤其是在处理动态内容的网站时提供了一些思路。如果你对爬虫技术有更多的兴趣，记得不断探索和实验哦！Happy Scraping！🎉