使用Python和Selenium爬取商品数据的多线程教程

在本教程中，我们将使用Python编程语言和Selenium库来构建一个多线程爬虫，用于从网站上提取商品数据。我们将使用BeautifulSoup库来解析网页内容，并将提取的数据保存到CSV文件中。以下是我们的目标：

通过给定的URL列表，爬取每个URL对应的商品页面。
从每个商品页面提取相关数据，例如商品名称、类型、SKU、故事和主要图片URL。
将提取的数据保存到CSV文件中，同时处理可能出现的错误。

准备工作

在开始之前，我们需要进行一些准备工作。请确保已经安装了以下软件和库：

Python编程语言（版本3.6或更高）
Selenium库（用于模拟浏览器行为）
BeautifulSoup库（用于解析网页内容）
Chrome浏览器
Chrome驱动程序（与浏览器版本相匹配）

代码实现

首先，让我们导入所需的库：

import time
import csv
import threading
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import json

接下来，我们将定义一些全局变量和函数来处理链接和保存数据。请参考以下代码片段：

# 锁对象，用于线程同步
lock = threading.Lock()
# 限制同时执行的线程数量
max_threads = 8
thread_semaphore = threading.Semaphore(max_threads)
# 定义处理链接的函数
def process_link(url):
# 创建一个Chrome浏览器实例
driver = webdriver.Chrome()
global lock, thread_semaphore
try:
# 打开目标页面
driver.get("https://www.xxx.com"+url)
# 获取页面内容
content = driver.page_source
# 使用BeautifulSoup解析页面内容
soup = BeautifulSoup(content, "html.parser")
# 查找<script>标签并提取JSON数据
script_tag = soup.find("script", {"id": "__NEXT_DATA__"})
if script_tag:
json_data = script_tag.contents[0]
data = json.loads(json_data)
# 提取字段值
name = data['props']['pageProps']['productTemplate']['name']
silhouette = data['props']['pageProps']['productTemplate']['silhouette']
sku = data['props']['pageProps']['productTemplate']['sku'].replace(' ', '-')
story = data['props']['pageProps']['productTemplate'].get('story', '')
# 处理可能不存在的字段
pictures = data['props']['pageProps']['productTemplate'].get('productTemplateExternalPictures', [])
if pictures:
main_picture_urls = ','.join([picture.get('mainPictureUrl', '') for picture in pictures])
else:
image_url = data['props']['pageProps']['productTemplate'].get('image_url', '')
main_picture_urls = image_url
# 保存为CSV文件
csv_file = '
output.csv'
with lock:
with open(csv_file, 'a', newline='') as file:
writer = csv.writer(file)
writer.writerow([name, silhouette, sku, story, main_picture_urls])
print(f"数据已保存到CSV文件: {csv_file}")
else:
raise Exception("未找到包含JSON数据的<script>标签")
except Exception as e:
print(f"处理链接出错: {url}")
print(e)
# 保存出错的链接到CSV文件
csv_file_error = 'error_links.csv'
with lock:
with open(csv_file_error, 'a', newline='') as file:
writer = csv.writer(file)
writer.writerow([url])
print(f"处理出错的链接已保存到CSV文件: {csv_file_error}")
finally:
# 关闭浏览器
driver.quit()
# 释放线程信号量，允许其他线程执行
thread_semaphore.release()

在上述代码中，我们创建了一个process_link函数，它接受一个URL作为参数，并在该函数中执行以下操作：

创建一个Chrome浏览器实例。
使用Selenium库打开目标页面。
使用BeautifulSoup库解析页面内容。
查找包含JSON数据的<script>标签，并提取所需字段的值。
处理可能不存在的字段，并将提取的数据保存到CSV文件中。
在出现错误时，保存出错的链接到另一个CSV文件中。
最后，关闭浏览器并释放线程信号量。

现在，我们需要读取链接列表并创建多个线程来处理每个链接。以下是相应的代码：

# 读取链接列表
urls_file = 'result.csv'
# 存储线程列表
threads = []
# 打开链接文件
with open(urls_file, 'r') as file:
reader = csv.reader(file)
next(reader)  # 跳过标题行
for row in reader:
url = row[0]
# 获取线程信号量，限制同时执行的线程数量
thread_semaphore.acquire()
# 创建线程，并启动线程
thread = threading.Thread(target=process_link, args=(url,))
thread.start()
threads.append(thread)
# 等待一段时间，避免线程过多导致的资源竞争
time.sleep(0.5)
# 等待所有线程完成
for thread in threads:
thread.join()

在上述代码中，我们首先指定链接列表的文件路径，然后创建一个线程列表用于存储每个线程。接下来，我们打开链接文件并读取其中的URL。然后，我们使用线程信号量来限制同时执行的线程数量，并创建线程来处理每个链接。在创建和启动线程后，我们等待一段时间，以避免线程过多导致的资源竞争。最后，我们使用join()方法等待所有线程完成。