现在的位置: 首页 > 程序设计> 正文
python 脚本登录百度空间
2013年05月15日 程序设计 评论数 1 ⁄ 被围观 4,374+

最近半年的时间,百度空间进行了多次变更,自从去年的wordpress百度空间博文同步插件不可用之后,就一直没有维护更新,最近百度空间基本稳定了,通过抓包对比分析,发现和以前的登录过程很不一样,先利用python脚本进行登录过程的模拟,然后再利用php实现,来维护更新同步插件。

大体的登录原理如下:

1. 首先获取登录的cookie文件,没有cookie的话,百度空间不能正常登录,访问如下网址获取cookie

https://passport.baidu.com/v2/api/?getapi&class=login&tpl=mn&tangram=false

2. 获取登录过程的token,同样再次请求上面的网址,但是这一次需要携带第1步中server返回的cookie信息

https://passport.baidu.com/v2/api/?getapi&class=login&tpl=mn&tangram=false

3. 发送登录账号信息(用户名,密码等)到如下网址,同样需要携带步骤2中server返回的cookie信息

http://passport.baidu.com/v2/api/?login

4. 至此,登录过程完毕

代码示例:

下面的示例代码会登录百度空间,然后把所有的博文自动备份到本地,其中登录的过程就如上面的原理所讲。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
#!/usr/bin/python
 
#coding:utf8
 
import cookielib, urllib2, urllib
 
import os,sys,socket,re
 
#解析有多少页博客
 
pageStr = """var PagerInfo = {\s*allCount\s*:\s*'(\d+)',\s*pageSize\s*:\s*'(\d+)',\s*curPage\s*:\s*'\d+'\s*};"""
 
pageObj = re.compile(pageStr, re.DOTALL)
 
#获取登陆token
 
login_tokenStr = '''bdPass.api.params.login_token='(.*?)';'''
 
login_tokenObj = re.compile(login_tokenStr,re.DOTALL)
 
#获取博客标题和url
 
blogStr = r'''<div><a href=".*?" target=_blank>.*?</a></div><a href="(.*?)" target=_blank>(.*?)</a></div>'''
 
blogObj = re.compile(blogStr,re.DOTALL)
 
class Baidu(object):
 
    def __init__(self,user = '', psw = '', blog = ''):
 
        self.user = user
 
        self.psw  = psw
 
        self.blog = blog
 
        if not user or not psw or not blog:
 
            print "Plz enter enter 3 params:user,psw,blog"
 
            sys.exit(0)
 
        if not os.path.exists(self.user):
 
            os.mkdir(self.user)
 
        self.cookiename = 'baidu%s.coockie' % (self.user)
 
        self.token = ''
 
        self.allCount  = 0
 
        self.pageSize  = 10
 
        self.totalpage = 0
 
        self.logined = False
 
        self.cj = cookielib.LWPCookieJar()
 
        try:
 
            self.cj.revert(self.cookiename)
 
            self.logined = True
 
            print "OK"
 
        except Exception, e:
 
            print e
 
        self.opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(self.cj))
 
        self.opener.addheaders = [('User-agent','Opera/9.23')]
 
        urllib2.install_opener(self.opener)
 
        socket.setdefaulttimeout(30)
 
    #登陆百度
 
    def login(self):
 
        #如果没有获取到cookie,就模拟登陆
 
        if not self.logined:
 
            print "logon to baidu ..."
 
            #第一次先访问一下,目的是为了先保存一个cookie下来
 
            qurl = '''https://passport.baidu.com/v2/api/?getapi&class=login&tpl=mn&tangram=false'''
 
            r = self.opener.open(qurl)
 
            self.cj.save(self.cookiename)
 
            #第二次访问,目的是为了获取token
 
            qurl = '''https://passport.baidu.com/v2/api/?getapi&class=login&tpl=mn&tangram=false'''
 
            r = self.opener.open(qurl)
 
            rsp = r.read()
 
            #print rsp
 
            self.cj.save(self.cookiename)
 
            #通过正则表达式获取token
 
            matched_objs = login_tokenObj.findall(rsp)
 
            if matched_objs:
 
                self.token = matched_objs[0]
 
                print 'token =', self.token
 
                #然后用token模拟登陆
 
                post_data = urllib.urlencode({'username':self.user,
 
                                              'password':self.psw,
 
                                              'token':self.token,
 
                                              'charset':'UTF-8',
 
                                              'callback':'parent.bd12Pass.api.login._postCallback',
 
                                              'index':'0',
 
                                              'isPhone':'false',
 
                                              'mem_pass':'on',
 
                                              'loginType':'1',
 
                                              'safeflg':'0',
 
                                              'staticpage':'https://passport.baidu.com/v2Jump.html',
 
                                              'tpl':'mn',
 
                                              'u':'http://www.baidu.com/',
 
                                              'verifycode':'',
 
                                            })
 
                #path = 'http://passport.baidu.com/?login'
 
                path = 'http://passport.baidu.com/v2/api/?login'
 
                self.opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(self.cj))
 
                self.opener.addheaders = [('User-agent','Opera/9.23')]
 
                urllib2.install_opener(self.opener)
 
                headers = {
 
                  "Accept": "image/gif, */*",
 
                  "Referer": "https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F",
 
                  "Accept-Language": "zh-cn",
 
                  "Content-Type": "application/x-www-form-urlencoded",
 
                  "Accept-Encoding": "gzip, deflate",
 
                  "User-Agent": "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727)",
 
                  "Host": "passport.baidu.com",
 
                  "Connection": "Keep-Alive",
 
                  "Cache-Control": "no-cache"
 
                }
 
                req = urllib2.Request(path,
 
                                post_data,
 
                                headers=headers,
 
                                )
 
                rsp = self.opener.open(req).read()
 
                #print rsp
 
                self.cj.save(self.cookiename)
 
                #for login test
 
                #qurl = '''http://hi.baidu.com/pub/show/createtext'''
 
                #rsp = self.opener.open(qurl).read()
 
                #file_object = open('login.txt', 'w')
 
                #file_object.write(rsp)
 
                #file_object.close()
 
            else:
 
                print "Login Fail"
 
                sys.exit(0)
 
    #获取博客一共有多少页,如果有私有博文的话,登陆和不登陆获取的是不一样的
 
    def getTotalPage(self):
 
        #获取博客的总页数
 
        req2 = urllib2.Request(self.blog)
 
        rsp = urllib2.urlopen(req2).read()
 
        if rsp:
 
            rsp = rsp.replace('\r','').replace('\n','').replace('\t','')
 
            matched_objs = pageObj.findall(rsp)
 
            if matched_objs:
 
                obj0,obj1 = matched_objs[0]
 
                self.allCount = int(obj0)
 
                self.pageSize = int(obj1)
 
                self.totalpage = (self.allCount / self.pageSize) + 1
 
                print 'allCount:%d, pageSize:%d, totalpage:%d' % (self.allCount,self.pageSize,self.totalpage)
 
    #获取每一页里的博客链接
 
    def fetchPage(self,url):
 
        req = urllib2.Request(url)
 
        rsp = urllib2.urlopen(req).read()
 
        if rsp:
 
            rsp = rsp.replace('\r','').replace('\n','').replace('\t','')
 
            matched_objs = blogObj.findall(rsp)
 
            if matched_objs:
 
                for obj in matched_objs:
 
                    #这里可以用多线程改写一下,单线程太慢
 
                    self.download(obj[0],obj[1])
 
    def downloadBywinget(self,url,title):
 
        #比如使用wget之类的第三方工具,自己填参数写
 
        pass
 
    #下载博客
 
    def download(self,url,title):
 
        path = '%s/%s.html' % (self.user,title.decode('utf-8'))
 
        url = 'http://hi.baidu.com%s' % (url)
 
        print "Download url %s" % (url)
 
        nFail = 0
 
        while nFail < 5:
 
            try:
 
                sock = urllib.urlopen(url)
 
                htmlSource = sock.read()
 
                myfile = file(path,'w')
 
                myfile.write(htmlSource)
 
                myfile.close()
 
                sock.close()
 
                return
 
            except:
 
                nFail += 1
 
        print ('download blog fail:%s' % (url))
 
    def dlownloadall(self):
 
        for page in range(1,self.totalpage+1):
 
            url = "%s?page=%d" % (self.blog,page)
 
            #这里可以用多线程改写一下,单线程太慢
 
            self.fetchPage(url)
 
def main():
 
    user = 'runsheng2005'       #你的百度登录名
 
    psw  = 'password'  #你的百度登陆密码,不输入用户名和密码,得不到私有的文章
 
    blog = "http://hi.baidu.com/zhourunsheng" #你自己的百度博客链接
 
    baidu = Baidu(user,psw,blog)
 
    baidu.login()
 
    baidu.getTotalPage()
 
    baidu.dlownloadall()
 
if __name__ == '__main__':
 
    main()
#!/usr/bin/python

#coding:utf8

import cookielib, urllib2, urllib

import os,sys,socket,re

#解析有多少页博客

pageStr = """var PagerInfo = {\s*allCount\s*:\s*'(\d+)',\s*pageSize\s*:\s*'(\d+)',\s*curPage\s*:\s*'\d+'\s*};"""

pageObj = re.compile(pageStr, re.DOTALL)

#获取登陆token

login_tokenStr = '''bdPass.api.params.login_token='(.*?)';'''

login_tokenObj = re.compile(login_tokenStr,re.DOTALL)

#获取博客标题和url

blogStr = r'''<div><a href=".*?" target=_blank>.*?</a></div><a href="(.*?)" target=_blank>(.*?)</a></div>'''

blogObj = re.compile(blogStr,re.DOTALL)

class Baidu(object):

    def __init__(self,user = '', psw = '', blog = ''):

        self.user = user

        self.psw  = psw

        self.blog = blog

        if not user or not psw or not blog:

            print "Plz enter enter 3 params:user,psw,blog"

            sys.exit(0)

        if not os.path.exists(self.user):

            os.mkdir(self.user)

        self.cookiename = 'baidu%s.coockie' % (self.user)

        self.token = ''

        self.allCount  = 0

        self.pageSize  = 10

        self.totalpage = 0

        self.logined = False

        self.cj = cookielib.LWPCookieJar()

        try:

            self.cj.revert(self.cookiename)

            self.logined = True

            print "OK"

        except Exception, e:

            print e

        self.opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(self.cj))

        self.opener.addheaders = [('User-agent','Opera/9.23')]

        urllib2.install_opener(self.opener)

        socket.setdefaulttimeout(30)

    #登陆百度

    def login(self):

        #如果没有获取到cookie,就模拟登陆

        if not self.logined:

            print "logon to baidu ..."

            #第一次先访问一下,目的是为了先保存一个cookie下来

            qurl = '''https://passport.baidu.com/v2/api/?getapi&class=login&tpl=mn&tangram=false'''

            r = self.opener.open(qurl)

            self.cj.save(self.cookiename)

            #第二次访问,目的是为了获取token

            qurl = '''https://passport.baidu.com/v2/api/?getapi&class=login&tpl=mn&tangram=false'''

            r = self.opener.open(qurl)

            rsp = r.read()

            #print rsp

            self.cj.save(self.cookiename)

            #通过正则表达式获取token

            matched_objs = login_tokenObj.findall(rsp)

            if matched_objs:

                self.token = matched_objs[0]

                print 'token =', self.token

                #然后用token模拟登陆

                post_data = urllib.urlencode({'username':self.user,

                                              'password':self.psw,

                                              'token':self.token,

                                              'charset':'UTF-8',

                                              'callback':'parent.bd12Pass.api.login._postCallback',

                                              'index':'0',

                                              'isPhone':'false',

                                              'mem_pass':'on',

                                              'loginType':'1',

                                              'safeflg':'0',

                                              'staticpage':'https://passport.baidu.com/v2Jump.html',

                                              'tpl':'mn',

                                              'u':'http://www.baidu.com/',

                                              'verifycode':'',

                                            })

                #path = 'http://passport.baidu.com/?login'

                path = 'http://passport.baidu.com/v2/api/?login'

                self.opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(self.cj))

                self.opener.addheaders = [('User-agent','Opera/9.23')]

                urllib2.install_opener(self.opener)

                headers = {

                  "Accept": "image/gif, */*",

                  "Referer": "https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F",

                  "Accept-Language": "zh-cn",

                  "Content-Type": "application/x-www-form-urlencoded",

                  "Accept-Encoding": "gzip, deflate",

                  "User-Agent": "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727)",

                  "Host": "passport.baidu.com",

                  "Connection": "Keep-Alive",

                  "Cache-Control": "no-cache"

                }

                req = urllib2.Request(path,

                                post_data,

                                headers=headers,

                                )

                rsp = self.opener.open(req).read()

                #print rsp

                self.cj.save(self.cookiename)

                #for login test

                #qurl = '''http://hi.baidu.com/pub/show/createtext'''

                #rsp = self.opener.open(qurl).read()

                #file_object = open('login.txt', 'w')

                #file_object.write(rsp)

                #file_object.close()

            else:

                print "Login Fail"

                sys.exit(0)

    #获取博客一共有多少页,如果有私有博文的话,登陆和不登陆获取的是不一样的

    def getTotalPage(self):

        #获取博客的总页数

        req2 = urllib2.Request(self.blog)

        rsp = urllib2.urlopen(req2).read()

        if rsp:

            rsp = rsp.replace('\r','').replace('\n','').replace('\t','')

            matched_objs = pageObj.findall(rsp)

            if matched_objs:

                obj0,obj1 = matched_objs[0]

                self.allCount = int(obj0)

                self.pageSize = int(obj1)

                self.totalpage = (self.allCount / self.pageSize) + 1

                print 'allCount:%d, pageSize:%d, totalpage:%d' % (self.allCount,self.pageSize,self.totalpage)

    #获取每一页里的博客链接

    def fetchPage(self,url):

        req = urllib2.Request(url)

        rsp = urllib2.urlopen(req).read()

        if rsp:

            rsp = rsp.replace('\r','').replace('\n','').replace('\t','')

            matched_objs = blogObj.findall(rsp)

            if matched_objs:

                for obj in matched_objs:

                    #这里可以用多线程改写一下,单线程太慢

                    self.download(obj[0],obj[1])

    def downloadBywinget(self,url,title):

        #比如使用wget之类的第三方工具,自己填参数写

        pass

    #下载博客

    def download(self,url,title):

        path = '%s/%s.html' % (self.user,title.decode('utf-8'))

        url = 'http://hi.baidu.com%s' % (url)

        print "Download url %s" % (url)

        nFail = 0

        while nFail < 5:

            try:

                sock = urllib.urlopen(url)

                htmlSource = sock.read()

                myfile = file(path,'w')

                myfile.write(htmlSource)

                myfile.close()

                sock.close()

                return

            except:

                nFail += 1

        print ('download blog fail:%s' % (url))

    def dlownloadall(self):

        for page in range(1,self.totalpage+1):

            url = "%s?page=%d" % (self.blog,page)

            #这里可以用多线程改写一下,单线程太慢

            self.fetchPage(url)

def main():

    user = 'runsheng2005'       #你的百度登录名

    psw  = 'password'  #你的百度登陆密码,不输入用户名和密码,得不到私有的文章

    blog = "http://hi.baidu.com/zhourunsheng" #你自己的百度博客链接

    baidu = Baidu(user,psw,blog)

    baidu.login()

    baidu.getTotalPage()

    baidu.dlownloadall()

if __name__ == '__main__':

    main()

例如,我的用户名是runsheng2005,则会在工作目录建立一个名为baidurunsheng2005.coockie的文件用来保存cookie信息,

其中的内容如下:

#LWP-Cookies-2.0
Set-Cookie3: BAIDUID="25B820FB17B13E5F4F7C9836FB465C96:FG=1"; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2043-05-07 14:28:07Z"; version=0
Set-Cookie3: BDUSS=mJjbjFrZmp3WXNNbUhIQUxkWDJIMjFaR2dSZjdLaHdwcnhhRDBRLVNxcjQxcmxSQVFBQUFBJCQAAAAAAAAAAAEAAABv9HAAcnVuc2hlbmcyMDA1AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAPhJklH4SZJRW; path="/"; domain=".baidu.com"; path_spec; expires="2021-07-31 14:28:08Z"; version=0
Set-Cookie3: HOSUPPORT=1; path="/"; domain=".passport.baidu.com"; path_spec; expires="2021-07-31 14:28:07Z"; httponly=None; version=0
Set-Cookie3: PTOKEN=2bb1ab99373dbeeeec6b69af75e6a4c6; path="/"; domain=".passport.baidu.com"; path_spec; expires="2021-07-31 14:28:08Z"; version=0
Set-Cookie3: SAVEUSERID=a00277ba04dba8956259a5c4dfec4d40; path="/"; domain=".passport.baidu.com"; path_spec; expires="2021-07-31 14:28:08Z"; version=0
Set-Cookie3: STOKEN=1f4790267126b2e7dddb1f735f29074f; path="/"; domain=".passport.baidu.com"; path_spec; expires="2021-07-31 14:28:08Z"; version=0

在名为runsheng2005的子目录下面,会下载所有的博文
例如 打造个人的云端笔记本(CareyDiary).html等等

【上篇】
【下篇】

目前有 1 条留言 其中:访客:0 条, 博主:0 条 引用: 1

    查看来自外部的引用: 1

    给我留言

    留言无头像?


    ×
    腾讯微博