闲着学点什么
本项目目的是抓取电商平台上的商品评论数据,用于用户数据分析。其实就是为此目的编写一个爬虫。要实现这个目的:
我们需要通过chrome浏览器分析页面代码,找到相应的关键标签,这些标签对应相应的数据。哪些是变化的,哪些是不变的,这个对于自动化实现深层页面搜索有帮助,所以要认真分析。
亚马逊商品评论页面数据分析,黑体标签就是我们想要抓取得内容,下面我通过公司产品作为例子来说明步骤:
1) rating-star: 4.0 out of 5 stars
2) review-title: A Significant Entry by Jabra
3) review-author: Barredbard;
4) review-date: on June 9, 2017
5) review-body:
页面链接信息,举例JOBS的评论页
1# page:
https://www.amazon.com/Jabra-Elite-Sport-Wireless-Earbuds/product-reviews/B01N53RO3X/ref=cm_cr_getr_d_paging_btm_1?ie=UTF8&reviewerType=all_reviews&pageNumber=1
2# page:
https://www.amazon.com/Jabra-Elite-Sport-Wireless-Earbuds/product-reviews/B01N53RO3X/ref=cm_cr_getr_d_paging_btm_2?ie=UTF8&reviewerType=all_reviews&pageNumber=2
3# page:
https://www.amazon.com/Jabra-Elite-Sport-Wireless-Earbuds/product-reviews/B01N53RO3X/ref=cm_cr_getr_d_paging_btm_3?ie=UTF8&reviewerType=all_reviews&pageNumber=3
4# page
https://www.amazon.com/Jabra-Elite-Sport-Wireless-Earbuds/product-reviews/B01N53RO3X/ref=cm_cr_getr_d_paging_btm_4?ie=UTF8&reviewerType=all_reviews&pageNumber=4
从上面的评论页面链接可以发现页面和地址的相关性,只是pageNumber在变化,所以可以用以下代码实现历遍。
# 历遍所以页面
total_page=
for i in total_page:
JOBS_review_link="https://www.amazon.com/Jabra-Elite-Sport-Wireless-Earbuds/product-reviews/B01N53RO3X/ref=cm_cr_getr_d_paging_btm_"+str(i)+"?ie=UTF8&reviewerType=all_reviews&pageNumber="+str(i)
# the draft code
from lxml import html
import csv,os,json
import requests
from exceptions import ValueError
from time import sleep
def AmazonParser(url):
#simulate the browser accessing the website to avoid reject scrapy data of the target website.
headers={'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'}
page = requests.get(url, headers=headers)
while True:
sleep(3)
try:
total_page=
for i in total_page:
JOBS_review_link = "https://www.amazon.com/Jabra-Elite-Sport-Wireless-Earbuds/product-reviews/B01N53RO3X/ref=cm_cr_getr_d_paging_btm_"+str(i)+"?ie=UTF8&reviewerType=all_reviews&pageNumber="+str(i)
page=urllib.requests.urlopen(JOBS_review_link)
html=page.read()
return html
except:
print "getHtml2 error"
再看看怎么抓取京东上的评论数据。一样先通过chrome的开发工具页面获取productCommentPage的URL地址,以JSON方式获取原始数据。获取想要抓取的商品目录列表并保存到excel当中,像这样:
然后通过函数调用读取excel里面的商品数据,按需要抓取网页就行
# -*- coding: utf-8 -*-
"""
Created on Jan-22-2018
@author: jerry zhong
"""
import urllib.request
import json
import time
import random
import csv
from op_excel1 import excel_table_byindex
def crawlProductComment(url,product="",fetchJson=""):
#读取原始数据(注意选择gbk编码方式),并去掉多余的字符
html = urllib.request.urlopen(url).read().decode('gbk')
html = html.replace(fetchJson+'(','')
html = html.replace(');','')
#获取原始数据并根据json规则读取,并存入data字典
data=json.loads(html)
#遍历商品评论列表
for i in data['comments']:
nickName = i['nickname']
Score = str(i['score'])
userClientShow = i['userClientShow']
productColor = i['productColor']
isMobile = str(i['isMobile'])
commentTime = i['creationTime']
content = i['content']
#输出商品评论关键信息
try:
with open(product+'.tsv', 'a', encoding = 'utf-8') as fh:
fh.write(nickName +"\t"+Score+"\t"+userClientShow +"\t"+productColor+"\t"+isMobile+"\t"+commentTime+"\t"+content+"\n")
except IOError:
print ("Fail to write the data into file or the list index is out of range.")
if __name__=='__main__':
print("please input the product_list name: ")
file = input()+'.xlsx'
table = excel_table_byindex(file)
#input the number of the row
print ("Please input the row number of the product: ")
product_row_num = int(input())
product_row = table[product_row_num]
#to get the number of page:
page_number = int(product_row["pages"])
#to get the product name
product = product_row["product"]
#to get the fetchJson
fetchJson = product_row['fetchJson']
#to get the url of the product
url_1 = product_row["url_1"]
url_2 = product_row["url_2"]
for i in range(0,page_number+1):
print("正在下载第{}页数据...".format(i+1))
#京东商品评论链接,here is the example of the comments for Jabra products
url = url_1 + str(i) + url_2
crawlProductComment(url,product,fetchJson)
#设置休眠时间
time.sleep(random.randint(10,15))
运行后程序自动抓取数据:
抓取的数据保存到对应的商品文件tsv格式保存,可以很方便的通过excel导入进行下一步分析。按照这个思路,你应该很容易编写出适合你的爬虫了。