数据搜集和整理:requests和pandas的简单应用示例

October 2017 · 1 minute read

都说现在是大数据时代,不仅仅气象被数据记录下来,这个世界的方方面面都被或多或少的数据记录着。

最大的问题就是数据资源太不平衡了,有些地方数据大量冗余,价值有待更多的发掘,而有些地方却苦于数据资源太少,或者难以获得,很多有价值的工作无法开展……

数据的搜集整理是一项很基础的工作,今天我们用pytnon的requests和pandas两个软件包试验一下获取淘宝每天的关注上升榜单Top100,地址是top.taobao.com

先通过浏览器分析数据加载的情况。开始一段时间从F12里没有找到数据来源,感觉有点奇怪,难道非要用selenium了吗?后来从分页的链接地址找到了数据接口,尝试curl之后发现还好,数据比较简单(但是浏览器打开的分页地址总会马上跳转到首页地址,网页源码也会变成首页源码,看不到新加载的数据)。

所以我们直接采用requests就可以抓取了,然后再用正则匹配一下我们需要的数据,从字符串转换为json,再转换为pandas的DataFrame,方便转换为csv文件存储或者写入数据库。代码如下,python总是很简单:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import re
import requests
import pandas as pd
import simplejson as json
from datetime import datetime, date, timedelta

"""taobao daily top search 100

https://top.taobao.com/index.php?rank=focus&type=up
https://top.taobao.com/index.php?rank=focus&type=up&s=20
https://top.taobao.com/index.php?rank=focus&type=up&s=40
https://top.taobao.com/index.php?rank=focus&type=up&s=60
https://top.taobao.com/index.php?rank=focus&type=up&s=80
"""


upOrDown = {
    1: 1,   # 1 means up
    2: 0,   # 2 means no change
    0: -1,  # 0 means down
}


def taobao_top100():
    url = 'https://top.taobao.com/index.php'
    params = dict(rank='focus', type='up')
    data = []
    for i in range(7):  # 5
        params.update(dict(s=i*20))
        r = requests.get(url, params=params)
        if r.status_code == requests.codes.ok:
            text = r.text
            text = re.search(r'(?<="list":)\[.*?\]', text).group(0)
            text = json.loads(text)
            # print text
            rank = [_['col1']['text'] for _ in text]
            name = [_['col2']['text'] for _ in text]
            search_num = [_['col4']['num'] for _ in text]
            rank_change = [_['col5']['text'] * upOrDown.get(_['col5']['upOrDown']) for _ in text]
            rank_change_ratio = [('-' if upOrDown.get(_['col6']['upOrDown'])<0 else '') + _['col6']['text'] for _ in text]
            text = pd.DataFrame(dict(
                name=name, search_num=search_num, 
                rank_change=rank_change, rank_change_ratio=rank_change_ratio), index=rank)
            data.append(text)
        else:
            print('Error with code {}'.format(r.status_code))
    data = pd.concat(data)
    data.to_csv('{}.csv'.format(datetime.now().strftime('%Y%m%d%H%M%S')), encoding='utf8')
    return data


if __name__ == '__main__':
    taobao_top100()