python实现自动爬取电商平台上的商品评论数据

2019-10-17 13:16:12

闲着学点什么

本项目目的是抓取电商平台上的商品评论数据，用于用户数据分析。其实就是为此目的编写一个爬虫。要实现这个目的：

我们需要通过chrome浏览器分析页面代码，找到相应的关键标签，这些标签对应相应的数据。哪些是变化的，哪些是不变的，这个对于自动化实现深层页面搜索有帮助，所以要认真分析。

亚马逊商品评论页面数据分析，黑体标签就是我们想要抓取得内容，下面我通过公司产品作为例子来说明步骤:

1) rating-star: 4.0 out of 5 stars

2) review-title: A Significant Entry by Jabra

3) review-author: Barredbard;

4) review-date: on June 9, 2017

5) review-body: xxxxxxxxxxxxxxxxxxxxx

页面链接信息，举例JOBS的评论页

1# page:

https://www.amazon.com/Jabra-Elite-Sport-Wireless-Earbuds/product-reviews/B01N53RO3X/ref=cm_cr_getr_d_paging_btm_1?ie=UTF8&reviewerType=all_reviews&pageNumber=1

2# page:

https://www.amazon.com/Jabra-Elite-Sport-Wireless-Earbuds/product-reviews/B01N53RO3X/ref=cm_cr_getr_d_paging_btm_2?ie=UTF8&reviewerType=all_reviews&pageNumber=2

3# page:

https://www.amazon.com/Jabra-Elite-Sport-Wireless-Earbuds/product-reviews/B01N53RO3X/ref=cm_cr_getr_d_paging_btm_3?ie=UTF8&reviewerType=all_reviews&pageNumber=3

4# page

https://www.amazon.com/Jabra-Elite-Sport-Wireless-Earbuds/product-reviews/B01N53RO3X/ref=cm_cr_getr_d_paging_btm_4?ie=UTF8&reviewerType=all_reviews&pageNumber=4

从上面的评论页面链接可以发现页面和地址的相关性，只是pageNumber在变化，所以可以用以下代码实现历遍。

# 历遍所以页面

total_page=

for i in total_page:

JOBS_review_link="https://www.amazon.com/Jabra-Elite-Sport-Wireless-Earbuds/product-reviews/B01N53RO3X/ref=cm_cr_getr_d_paging_btm_"+str(i)+"?ie=UTF8&reviewerType=all_reviews&pageNumber="+str(i)

# the draft code

from lxml import html

import csv,os,json

import requests

from exceptions import ValueError

from time import sleep

def AmazonParser(url):

#simulate the browser accessing the website to avoid reject scrapy data of the target website.

headers={'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'}

page = requests.get(url, headers=headers)

while True:

sleep(3)

try:

total_page=

for i in total_page:

JOBS_review_link = "https://www.amazon.com/Jabra-Elite-Sport-Wireless-Earbuds/product-reviews/B01N53RO3X/ref=cm_cr_getr_d_paging_btm_"+str(i)+"?ie=UTF8&reviewerType=all_reviews&pageNumber="+str(i)

page=urllib.requests.urlopen(JOBS_review_link)

html=page.read()

return html

except:

print "getHtml2 error"

再看看怎么抓取京东上的评论数据。一样先通过chrome的开发工具页面获取productCommentPage的URL地址，以JSON方式获取原始数据。获取想要抓取的商品目录列表并保存到excel当中，像这样：

然后通过函数调用读取excel里面的商品数据，按需要抓取网页就行

# -*- coding: utf-8 -*-

"""

Created on Jan-22-2018

@author: jerry zhong

"""

import urllib.request

import json

import time

import random

import csv

from op_excel1 import excel_table_byindex

def crawlProductComment(url,product="",fetchJson=""):

#读取原始数据(注意选择gbk编码方式)，并去掉多余的字符

html = urllib.request.urlopen(url).read().decode('gbk')

html = html.replace(fetchJson+'(','')

html = html.replace(');','')

#获取原始数据并根据json规则读取，并存入data字典

data=json.loads(html)

#遍历商品评论列表

for i in data['comments']:

nickName = i['nickname']

Score = str(i['score'])

userClientShow = i['userClientShow']

productColor = i['productColor']

isMobile = str(i['isMobile'])

commentTime = i['creationTime']

content = i['content']

#输出商品评论关键信息

try:

with open(product+'.tsv', 'a', encoding = 'utf-8') as fh:

fh.write(nickName +"\t"+Score+"\t"+userClientShow +"\t"+productColor+"\t"+isMobile+"\t"+commentTime+"\t"+content+"\n")

except IOError:

print ("Fail to write the data into file or the list index is out of range.")

if __name__=='__main__':

print("please input the product_list name: ")

file = input()+'.xlsx'

table = excel_table_byindex(file)

#input the number of the row

print ("Please input the row number of the product: ")

product_row_num = int(input())

product_row = table[product_row_num]

#to get the number of page:

page_number = int(product_row["pages"])

#to get the product name

product = product_row["product"]

#to get the fetchJson

fetchJson = product_row['fetchJson']

#to get the url of the product

url_1 = product_row["url_1"]

url_2 = product_row["url_2"]

for i in range(0,page_number+1):

print("正在下载第{}页数据...".format(i+1))

#京东商品评论链接,here is the example of the comments for Jabra products

url = url_1 + str(i) + url_2

crawlProductComment(url,product,fetchJson)

#设置休眠时间

time.sleep(random.randint(10,15))

运行后程序自动抓取数据：

抓取的数据保存到对应的商品文件tsv格式保存，可以很方便的通过excel导入进行下一步分析。按照这个思路，你应该很容易编写出适合你的爬虫了。

分享好友

分享这个小栈给你的朋友们，一起进步吧。

应用开发

创建时间：2020-06-17 15:31:04

应用软件开发是指使用程序语言C#、java、 c++、vb等语言编写，主要是用于商业、生活应用的软件的开发。

展开

订阅须知

• 所有用户可根据关注领域订阅专区或所有专区

• 付费订阅：虚拟交易，一经交易不退款；若特殊情况，可3日内客服咨询

• 专区发布评论属默认订阅所评论专区（除付费小栈外）

技术专家

查看更多

栈栈
专家