-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathatom.xml
1202 lines (1021 loc) · 376 KB
/
atom.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<title>Bgods</title>
<subtitle>一切,仅源于兴趣</subtitle>
<link href="/atom.xml" rel="self"/>
<link href="http://bgods.top/"/>
<updated>2016-09-16T07:23:00.000Z</updated>
<id>http://bgods.top/</id>
<author>
<name>Bgods</name>
</author>
<generator uri="http://hexo.io/">Hexo</generator>
<entry>
<title>mysql学习笔记</title>
<link href="http://bgods.top/2016/07/10/mysql%E5%AD%A6%E4%B9%A0%E7%AC%94%E8%AE%B0/"/>
<id>http://bgods.top/2016/07/10/mysql学习笔记/</id>
<published>2016-07-09T17:57:45.000Z</published>
<updated>2016-09-16T07:23:00.000Z</updated>
<content type="html"><p>mysql学习笔记</p>
<a id="more"></a>
<h1 id="MySQL语法规划"><a href="#MySQL语法规划" class="headerlink" title="MySQL语法规划"></a>MySQL语法规划</h1><ol>
<li>关键字与函数名称全部大写;</li>
<li>数据库名称、表名称、字段名称全部小写;</li>
<li>SQL语句必须以分号结尾。</li>
</ol>
<h1 id="常用命令"><a href="#常用命令" class="headerlink" title="常用命令"></a>常用命令</h1><ul>
<li>显示当前服务器版本</li>
</ul>
<figure class="highlight sql"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">SELECT</span> <span class="keyword">VERSION</span>();</span><br></pre></td></tr></table></figure>
<ul>
<li>显示当前时间</li>
</ul>
<figure class="highlight sql"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">SELECT</span> <span class="keyword">NOW</span>();</span><br></pre></td></tr></table></figure>
<ul>
<li>显示当前用户</li>
</ul>
<figure class="highlight sql"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">SELECT</span> <span class="keyword">USER</span>();</span><br></pre></td></tr></table></figure>
<ul>
<li>显示所有数据库名称</li>
</ul>
<figure class="highlight sql"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">SHOW</span> <span class="keyword">DATABASES</span>;</span><br></pre></td></tr></table></figure>
<ul>
<li><p>切换到某个数据库</p>
<figure class="highlight sql"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">USE</span> db_name;</span><br></pre></td></tr></table></figure>
</li>
<li><p>显示当前数据库</p>
<figure class="highlight sql"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">SELECT</span> <span class="keyword">DATABASE</span>();</span><br></pre></td></tr></table></figure>
</li>
<li><p>显示当前数据库的所有表</p>
<figure class="highlight sql"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">SHOW</span> <span class="keyword">TABLES</span>;</span><br></pre></td></tr></table></figure>
</li>
<li><p>显示警告信息</p>
<figure class="highlight sql"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">SHOW</span> <span class="keyword">WARNINGS</span>;</span><br></pre></td></tr></table></figure>
</li>
</ul>
<h1 id="MySQL数据库操作"><a href="#MySQL数据库操作" class="headerlink" title="MySQL数据库操作"></a>MySQL数据库操作</h1><ul>
<li>创建数据库<figure class="highlight sql"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">CREATE</span> &#123;<span class="keyword">DATABASE</span> | <span class="keyword">SCHEMA</span>&#125; [<span class="keyword">IF</span> <span class="keyword">NOT</span> <span class="keyword">EXISTS</span>] db_name [<span class="keyword">DEFAULT</span>] <span class="built_in">CHARACTER</span> <span class="keyword">SET</span> [=] charset_name;</span><br></pre></td></tr></table></figure>
</li>
</ul>
<figure class="highlight sql"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">-- 创建数据库t1</span></span><br><span class="line"><span class="keyword">CREATE</span> <span class="keyword">DATABASE</span> t1;</span><br><span class="line"></span><br><span class="line"><span class="comment">-- 如果数据库t1不存在,则创建数据库t1</span></span><br><span class="line"><span class="keyword">CREATE</span> <span class="keyword">DATABASE</span> <span class="keyword">IF</span> <span class="keyword">NOT</span> <span class="keyword">EXISTS</span> t1;</span><br><span class="line"></span><br><span class="line"><span class="comment">-- 创建数据库t2,编码是gbk</span></span><br><span class="line"><span class="keyword">CREATE</span> <span class="keyword">DATABASE</span> <span class="keyword">IF</span> <span class="keyword">NOT</span> <span class="keyword">EXISTS</span> t2 <span class="built_in">CHARACTER</span> <span class="keyword">SET</span> gbk;</span><br><span class="line"></span><br><span class="line"><span class="comment">-- 查看数据t2,创建时的信息</span></span><br><span class="line"><span class="keyword">SHOW</span> <span class="keyword">CREATE</span> <span class="keyword">DATABASE</span> t2;</span><br></pre></td></tr></table></figure>
<ul>
<li>修改数据库<figure class="highlight sql"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">ALTER</span> &#123;<span class="keyword">DATABASE</span> | <span class="keyword">SCHEMA</span>&#125; [db_name] [<span class="keyword">DEFAULT</span>] <span class="built_in">CHARACTER</span> <span class="keyword">SET</span> [=] charset_name;</span><br></pre></td></tr></table></figure>
</li>
</ul>
<figure class="highlight sql"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">-- 修改数据库t2的编码格式为utf8</span></span><br><span class="line"><span class="keyword">ALTER</span> <span class="keyword">DATABASE</span> t2 <span class="built_in">CHARACTER</span> <span class="keyword">SET</span> UTF8;</span><br></pre></td></tr></table></figure>
<ul>
<li>删除数据库<figure class="highlight sql"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">DROP</span> &#123;<span class="keyword">DATABASE</span> | <span class="keyword">SCHEMA</span>&#125; [<span class="keyword">IF</span> <span class="keyword">EXISTS</span>] db_name;</span><br></pre></td></tr></table></figure>
</li>
</ul>
<figure class="highlight sql"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">DROP</span> <span class="keyword">DATABASE</span> t1;</span><br></pre></td></tr></table></figure>
<h1 id="MySQL数据类型"><a href="#MySQL数据类型" class="headerlink" title="MySQL数据类型"></a>MySQL数据类型</h1><p>数据类型是指列、存储过程参数、表达式和局部变量的数据特征,它决定了数据的存储格式,代表了不同的信息类型。</p>
<ul>
<li>整型</li>
</ul>
<table>
<thead>
<tr>
<th style="text-align:left">数据类型</th>
<th style="text-align:left">存储范围(有符号)</th>
<th style="text-align:left">存储范围(无符号)</th>
<th style="text-align:left">字节</th>
<th style="text-align:left">用途</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:left">TINYINT</td>
<td style="text-align:left">(-128,127)</td>
<td style="text-align:left">(0,225)</td>
<td style="text-align:left">1</td>
<td style="text-align:left">小整数值</td>
</tr>
<tr>
<td style="text-align:left">SMALLINT</td>
<td style="text-align:left">(-32768,32767)</td>
<td style="text-align:left">(0,65535)</td>
<td style="text-align:left">2</td>
<td style="text-align:left">大整数值</td>
</tr>
<tr>
<td style="text-align:left">MEDIUMINT</td>
<td style="text-align:left">(-8388608,8388607)</td>
<td style="text-align:left">(0,16777215)</td>
<td style="text-align:left">3</td>
<td style="text-align:left">大整数值</td>
</tr>
<tr>
<td style="text-align:left">INT</td>
<td style="text-align:left">(-2147483648,2147483647)</td>
<td style="text-align:left">(0,4294967295)</td>
<td style="text-align:left">4</td>
<td style="text-align:left">大整数值</td>
</tr>
<tr>
<td style="text-align:left">BIGINT</td>
<td style="text-align:left">(-9223372036854775808,9223372036854775807)</td>
<td style="text-align:left">(0,18446744073709551615)</td>
<td style="text-align:left">8</td>
<td style="text-align:left">极大整数值</td>
</tr>
</tbody>
</table>
<ul>
<li>浮点型</li>
</ul>
<table>
<thead>
<tr>
<th style="text-align:left">数据类型</th>
<th style="text-align:left">存储范围</th>
<th style="text-align:left">字节</th>
<th style="text-align:left">用途</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:left">FLOAT[M,D]</td>
<td style="text-align:left">(-3.402823466E+38,1.175494351E-38),0,(1.175494351E-38,3.402823466351E+38)</td>
<td style="text-align:left">4</td>
<td style="text-align:left">单精度浮点数值</td>
</tr>
<tr>
<td style="text-align:left">DOUBLE[M,D]</td>
<td style="text-align:left">(-1.7976931348623157E+308,-2.2250738585072014E-308),0,(2.2250738585072014E-308,1.7976931348623157E+308)</td>
<td style="text-align:left">8</td>
<td style="text-align:left">双精度浮点数值</td>
</tr>
</tbody>
</table>
<p>M是数字总位数,D是小数点后面的位数,如果M、D被省略,根据硬件允许的限制来保存值,单精度浮点数精确到大约7位小数。</p>
<p>待续。。。。</p>
</content>
<summary type="html">
<p>mysql学习笔记</p>
</summary>
<category term="mysql" scheme="http://bgods.top/tags/mysql/"/>
</entry>
<entry>
<title>CentOS安装R语言</title>
<link href="http://bgods.top/2016/07/10/CentOS%E5%AE%89%E8%A3%85R%E8%AF%AD%E8%A8%80/"/>
<id>http://bgods.top/2016/07/10/CentOS安装R语言/</id>
<published>2016-07-09T16:27:24.000Z</published>
<updated>2016-09-16T08:03:46.000Z</updated>
<content type="html"><h1 id="环境准备"><a href="#环境准备" class="headerlink" title="环境准备"></a><strong>环境准备</strong></h1><p> 在编译R之前,需要使用root用户安装以下几个程序:<br><a id="more"></a></p>
<ol>
<li><p>安装gcc-gfortran</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">yum install gcc-gfortran</span><br></pre></td></tr></table></figure>
</li>
<li><p>安装gcc gcc-c++</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">yum install gcc gcc-c++</span><br></pre></td></tr></table></figure>
</li>
<li><p>安装readline-devel</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">yum install readline-devel</span><br></pre></td></tr></table></figure>
</li>
<li><p>安装libXt-devel</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">yum install libXt-devel</span><br></pre></td></tr></table></figure>
</li>
</ol>
<hr>
<h1 id="安装"><a href="#安装" class="headerlink" title="安装"></a><strong>安装</strong></h1><ol>
<li>首先wget获取R源代码<figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">wget https://cran.r-project.org/src/base/R-3/R-3.2.2.tar.gz</span><br></pre></td></tr></table></figure>
</li>
</ol>
<p>R语言的源码地址<a href="https://cran.r-project.org/src/base/" target="_blank" rel="external">https://cran.r-project.org/src/base/</a>,这里以3.2.2为例。</p>
<ol>
<li><p>解压文件</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">tar zxvf R-3.2.2.tar.gz</span><br></pre></td></tr></table></figure>
</li>
<li><p>进入R-3.2.2目录</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="built_in">cd</span> R-3.2.2</span><br></pre></td></tr></table></figure>
</li>
<li><p>编译源码</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">./configure</span><br><span class="line">make</span><br><span class="line">make install</span><br><span class="line">make check</span><br></pre></td></tr></table></figure>
</li>
</ol>
<h1 id="启动R"><a href="#启动R" class="headerlink" title="启动R"></a><strong>启动R</strong></h1><p><img src="/img/bgods006.png" alt=""><br><img src="/img/bgods007.png" alt=""></p>
</content>
<summary type="html">
<h1 id="环境准备"><a href="#环境准备" class="headerlink" title="环境准备"></a><strong>环境准备</strong></h1><p> 在编译R之前,需要使用root用户安装以下几个程序:<br>
</summary>
<category term="R" scheme="http://bgods.top/tags/R/"/>
<category term="CentOS" scheme="http://bgods.top/tags/CentOS/"/>
</entry>
<entry>
<title>scrapy框架</title>
<link href="http://bgods.top/2016/06/29/scrapy%E6%A1%86%E6%9E%B6/"/>
<id>http://bgods.top/2016/06/29/scrapy框架/</id>
<published>2016-06-29T10:46:49.000Z</published>
<updated>2016-09-16T08:08:11.000Z</updated>
<content type="html"><p> Scrapy框架的数据处理流程是由Scrapy引擎进行控制,它的整体架构如图所示。图中的绿色街头表示数据的流向。<br><a id="more"></a></p>
<p><center><br><img src="/img/bgods008.png" alt=""><br></center><br>Scrapy 的运行流程为: </p>
<ol>
<li>Scrapy 引擎打开一个初始的域名,并定位到相应的蜘蛛处理属于这个域名的 URL,然后 让蜘蛛获取第一个要爬取的 URL;</li>
<li>Scrapy 引擎从蜘蛛那里获得第一个需要爬取的 URL 并将该 URL 包装成请求并指定响应该 该请求的回调函数,然后将其发送给调度器;</li>
<li>Scrapy 引擎向调度器请求下一步要进行爬取的页面;</li>
<li>调度器将下一个要爬取的 URL 以请求的方式返回给 Scrapy 引擎,Scrapy 引擎通过下载器 中间件将请求发送给下载器;</li>
<li>当下载器执行请求、下载完页面以后,下载的页面内容通过下载器中间件发送给 Scrapy 引擎;</li>
<li>Scrapy 引擎在收到下载器的返回的下载数据后,通过蜘蛛中间件将响应数据发送到蜘蛛 进行数据处理;</li>
<li>蜘蛛解析下载的页面并返回网页解析后的数据,然后将抽取出的要继续爬取的 URL 再次 封装成请求发送给 Scrapy 引擎;</li>
<li>Scrapy 引擎将解析完成的数据发送至数据处理流水线,并将新的 URL 爬取请求继续转发 给调度器;</li>
<li>系统重复步骤 2-8,直到调度器中没有新的请求,就会关闭爬虫。</li>
</ol>
<hr>
<p>转载自:<a href="http://www.360doc.com/content/14/0325/22/9482_363730690.shtml" target="_blank" rel="external">Scrapy抓取框架的介绍</a></p>
</content>
<summary type="html">
<p> Scrapy框架的数据处理流程是由Scrapy引擎进行控制,它的整体架构如图所示。图中的绿色街头表示数据的流向。<br>
</summary>
<category term="python" scheme="http://bgods.top/tags/python/"/>
<category term="爬虫" scheme="http://bgods.top/tags/%E7%88%AC%E8%99%AB/"/>
<category term="scrapy" scheme="http://bgods.top/tags/scrapy/"/>
</entry>
<entry>
<title>Python文件、目录操作</title>
<link href="http://bgods.top/2016/06/27/Python%E6%96%87%E4%BB%B6%E3%80%81%E7%9B%AE%E5%BD%95%E6%93%8D%E4%BD%9C/"/>
<id>http://bgods.top/2016/06/27/Python文件、目录操作/</id>
<published>2016-06-27T15:21:32.000Z</published>
<updated>2016-09-16T08:35:17.000Z</updated>
<content type="html"><p> python中对文件、文件夹的操作需要涉及到os模块和shutil模块。</p>
<a id="more"></a>
<ol>
<li><p>创建空文件</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">os.mknod(<span class="string">"test.txt"</span>)</span><br></pre></td></tr></table></figure>
</li>
<li><p>直接打开一个文件,如果文件不存在则创建文件</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">open(<span class="string">"test.txt"</span>,w)</span><br></pre></td></tr></table></figure>
</li>
<li><p>创建目录</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">os.mkdir(<span class="string">"file"</span>)</span><br></pre></td></tr></table></figure>
</li>
<li><p>创建多层新目录:</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">mkdirs</span><span class="params">(path)</span>:</span> </span><br><span class="line"> <span class="comment"># 去除首位空格</span></span><br><span class="line"> path=path.strip()</span><br><span class="line"> <span class="comment"># 去除尾部 \ 符号</span></span><br><span class="line"> path=path.rstrip(“\“)</span><br><span class="line"> </span><br><span class="line"> <span class="comment"># 判断路径是否存在</span></span><br><span class="line"> <span class="comment"># 存在 True</span></span><br><span class="line"> <span class="comment"># 不存在 False</span></span><br><span class="line"> isExists = os.path.exists(path)</span><br><span class="line"> </span><br><span class="line"> <span class="comment"># 判断结果</span></span><br><span class="line"> <span class="keyword">if</span> <span class="keyword">not</span> isExists:</span><br><span class="line"> <span class="comment"># 创建目录操作函数</span></span><br><span class="line"> os.makedirs(path)</span><br><span class="line"> <span class="comment"># 如果不存在则创建目录</span></span><br><span class="line"> <span class="keyword">print</span> path + u’ 创建成功’</span><br><span class="line"> <span class="keyword">return</span> <span class="keyword">True</span></span><br><span class="line"> <span class="keyword">else</span>:</span><br><span class="line"> <span class="comment"># 如果目录存在则不创建,并提示目录已存在</span></span><br><span class="line"> <span class="keyword">print</span> path + u’ 目录已存在’</span><br><span class="line"> <span class="keyword">return</span> <span class="keyword">False</span></span><br></pre></td></tr></table></figure>
</li>
<li><p>复制文件</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">shutil.copyfile(“oldfile”,“newfile”) <span class="comment">#oldfile和newfile都只能是文件</span></span><br><span class="line">shutil.copy(“oldfile”,“newfile”) <span class="comment">#oldfile只能是文件夹,newfile可以是文件,也可以是目标目录</span></span><br></pre></td></tr></table></figure>
</li>
<li><p>复制文件夹</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">hutil.copytree(“olddir”,“newdir”) <span class="comment">#olddir和newdir都只能是目录,且newdir必须不存在</span></span><br></pre></td></tr></table></figure>
</li>
<li><p>重命名文件(目录)</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">os.rename(“oldname”,“newname”) <span class="comment">#文件或目录都是使用这条命令</span></span><br></pre></td></tr></table></figure>
</li>
<li><p>移动文件(目录)</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">shutil.move(“oldpos”,“newpos”)</span><br></pre></td></tr></table></figure>
</li>
<li><p>删除文件</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">os.remove(“file”)</span><br></pre></td></tr></table></figure>
</li>
<li><p>删除目录</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">os.rmdir(“dir”) <span class="comment">#只能删除空目录</span></span><br><span class="line">shutil.rmtree(“dir”) <span class="comment">#空目录、有内容的目录都可以删</span></span><br></pre></td></tr></table></figure>
</li>
<li><p>转换目录</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">os.chdir(“path”) <span class="comment">#却换到指定路径下</span></span><br></pre></td></tr></table></figure>
</li>
<li><p>判断目标</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">os.path.exists(“goal”) <span class="comment">#判断目标是否存在</span></span><br><span class="line">os.path.isdir(“goal”) <span class="comment">#判断目标是否目录</span></span><br><span class="line">os.path.isfile(“goal”) <span class="comment">#判断目标是否文件</span></span><br></pre></td></tr></table></figure>
</li>
</ol>
<p><font color="red"><strong>备注</strong></font>: 若路径中含中文,在windows环境(编码为GBK)下,要将目录编码成GBK,如:dir.encode(‘GBK’)</p>
<hr>
<p>转载自:<a href="http://l90z11.blog.163.com/blog/static/187389042201312153318389/" target="_blank" rel="external">http://l90z11.blog.163.com/blog/static/187389042201312153318389/</a></p>
</content>
<summary type="html">
<p> python中对文件、文件夹的操作需要涉及到os模块和shutil模块。</p>
</summary>
<category term="python" scheme="http://bgods.top/tags/python/"/>
</entry>
<entry>
<title>R语言:表达式、数学公式、特殊符号</title>
<link href="http://bgods.top/2016/06/27/R%E8%AF%AD%E8%A8%80%EF%BC%9A%E8%A1%A8%E8%BE%BE%E5%BC%8F%E3%80%81%E6%95%B0%E5%AD%A6%E5%85%AC%E5%BC%8F%E3%80%81%E7%89%B9%E6%AE%8A%E7%AC%A6%E5%8F%B7/"/>
<id>http://bgods.top/2016/06/27/R语言:表达式、数学公式、特殊符号/</id>
<published>2016-06-27T12:17:01.000Z</published>
<updated>2016-09-16T09:36:49.000Z</updated>
<content type="html"><p> 在R语言的绘图函数中,如果文本参数是合法的R语言表达式,那么这个表达式就被用Tex类似的规则进行文本格式化。</p>
<a id="more"></a>
<figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line">y &lt;- <span class="keyword">function</span>(x)&#123;</span><br><span class="line"> (exp(-(x^<span class="number">2</span>)/<span class="number">2</span>))/sqrt(<span class="number">2</span>*pi)</span><br><span class="line">&#125;</span><br><span class="line">plot(y, -<span class="number">5</span>, <span class="number">5</span>, </span><br><span class="line"> main = expression(f(x) == frac(<span class="number">1</span>,sqrt(<span class="number">2</span>*pi))*e^(-frac(x^<span class="number">2</span>,<span class="number">2</span>))), </span><br><span class="line"> lwd = <span class="number">3</span>, </span><br><span class="line"> col = <span class="string">"blue"</span></span><br><span class="line">)</span><br></pre></td></tr></table></figure>
<p><img src="http://img.blog.csdn.net/20151028223417872" alt=""></p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">library</span>(ggplot2)</span><br><span class="line"></span><br><span class="line">x &lt;- seq(<span class="number">0</span>, <span class="number">2</span>*pi, by = <span class="number">0.01</span>)</span><br><span class="line">y &lt;- sin(x)</span><br><span class="line">data &lt;- data.frame(x, y)</span><br><span class="line">p &lt;- ggplot(data, aes(x, y)) + geom_line()</span><br><span class="line">p + geom_area(fill = <span class="string">'blue'</span>, alpha = <span class="number">0.3</span>) +</span><br><span class="line"> scale_x_continuous(breaks = c(<span class="number">0</span>, pi, <span class="number">2</span>*pi), labels = c(<span class="string">'0'</span>, expression(pi), expression(<span class="number">2</span>*pi))) +</span><br><span class="line"> geom_text(parse = <span class="literal">T</span>, aes(x = pi/<span class="number">2</span>,y = <span class="number">0.3</span>, label = <span class="string">'integral(sin(x)*dx, 0, pi)'</span>))</span><br></pre></td></tr></table></figure>
<p><img src="http://img.blog.csdn.net/20151029130139187" alt=""></p>
<hr>
<h1 id="R语言的“表达式”"><a href="#R语言的“表达式”" class="headerlink" title="R语言的“表达式”"></a><strong>R语言的“表达式”</strong></h1><p> 在R语言中,“表达式”的概念有狭义和广义两种意义。狭义的表达式指表达式(expression)类对象,由expression函数产生;而广义的的表达式既包含expression类,也包含R“语言”类(language)。expression和language是R语言中两种特殊数据类:</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">getClass(<span class="string">"expression"</span>)</span><br><span class="line"></span><br><span class="line"><span class="comment"># lass "expression" [package "methods"]</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># No Slots, prototype of class "expression"</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># Extends: "vector"</span></span><br></pre></td></tr></table></figure>
<figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br></pre></td><td class="code"><pre><span class="line">getClass(<span class="string">"language"</span>)</span><br><span class="line"></span><br><span class="line"><span class="comment">#Virtual Class "language" [package "methods"]</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># No Slots, prototype of class "name"</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># Known Subclasses: </span></span><br><span class="line"><span class="comment"># Class "name", directly</span></span><br><span class="line"><span class="comment"># Class "call", directly</span></span><br><span class="line"><span class="comment"># Class "&#123;", directly</span></span><br><span class="line"><span class="comment"># Class "if", directly</span></span><br><span class="line"><span class="comment"># Class "&lt;-", directly</span></span><br><span class="line"><span class="comment"># Class "for", directly</span></span><br><span class="line"><span class="comment"># Class "while", directly</span></span><br><span class="line"><span class="comment"># Class "repeat", directly</span></span><br><span class="line"><span class="comment"># Class "(", directly</span></span><br><span class="line"><span class="comment"># Class ".name", by class "name", distance 2, with explicit coerce</span></span><br></pre></td></tr></table></figure>
<p> 可以看到expression类由向量派生得到,而language类是虚拟类,它包括我们熟悉的程序控制关键词/符号和name、call 子类。</p>
<hr>
<h1 id="产生“表达式”的函数"><a href="#产生“表达式”的函数" class="headerlink" title="产生“表达式”的函数"></a><strong>产生“表达式”的函数</strong></h1><p> 虽然我们在R终端键入的任何有效语句都是表达式,但这些表达式在输入后即被求值(evaluate)了,获得未经求值的纯粹“表达式”就要使用函数。下面我们从函数参数和返回值两方面了解expression、quote、bquote和substitute这几个常用函数。</p>
<h2 id="expression-函数"><a href="#expression-函数" class="headerlink" title="expression 函数"></a><strong>expression 函数</strong></h2><p> expression函数可以有一个或多个参数,它把全部参数当成一个列表,每个参数都被转成一个表达式向量,所以它的返回值是表达式列表,每个元素都是表达式类型对象,返回值的长度等于参数的个数:<br><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br></pre></td><td class="code"><pre><span class="line">(ex &lt;- expression(x = <span class="number">1</span>, <span class="number">1</span> + sqrt(a)))</span><br><span class="line"><span class="comment">## expression(x = 1, 1 + sqrt(a))</span></span><br><span class="line">length(ex)</span><br><span class="line"><span class="comment">## [1] 2</span></span><br><span class="line">ex[<span class="number">1</span>]</span><br><span class="line"><span class="comment">## expression(x = 1)</span></span><br><span class="line">mode(ex[<span class="number">1</span>])</span><br><span class="line"><span class="comment">## [1] "expression"</span></span><br><span class="line">typeof(ex[<span class="number">1</span>])</span><br><span class="line"><span class="comment">## [1] "expression"</span></span><br><span class="line">ex[<span class="number">2</span>]</span><br><span class="line"><span class="comment">## expression(1 + sqrt(a))</span></span><br><span class="line">mode(ex[<span class="number">2</span>])</span><br><span class="line"><span class="comment">## [1] "expression"</span></span><br><span class="line">typeof(ex[<span class="number">2</span>])</span><br><span class="line"><span class="comment">## [1] "expression"</span></span><br></pre></td></tr></table></figure></p>
<p> 因为expression函数把参数当成列表处理,所以等号”=”两边的表达式要符合R语言列表元素的书写规则,否则出错,比如:<br><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">expression(x+<span class="number">11</span>=<span class="number">1</span>)</span><br></pre></td></tr></table></figure></p>
<h2 id="quote函数"><a href="#quote函数" class="headerlink" title="quote函数"></a><strong>quote函数</strong></h2><p> quote函数只能有一个参数。quote函数的返回值一般情况下是call类型,表达式参数是单个变量的话返回值就是name类型,如果是常量那么返回值的存储模式就和相应常量的模式相同:<br><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br></pre></td><td class="code"><pre><span class="line">(cl &lt;- quote(<span class="number">1</span> + sqrt(a) + b^c))</span><br><span class="line"><span class="comment">## 1 + sqrt(a) + b^c</span></span><br><span class="line">mode(cl)</span><br><span class="line"><span class="comment">## [1] "call"</span></span><br><span class="line">typeof(cl)</span><br><span class="line"><span class="comment">## [1] "language"</span></span><br><span class="line">(cl &lt;- quote(a))</span><br><span class="line"><span class="comment">## a</span></span><br><span class="line">mode(cl)</span><br><span class="line"><span class="comment">## [1] "name"</span></span><br><span class="line">typeof(cl)</span><br><span class="line"><span class="comment">## [1] "symbol"</span></span><br><span class="line">(cl &lt;- quote(<span class="number">1</span>))</span><br><span class="line"><span class="comment">## [1] 1</span></span><br><span class="line">mode(cl)</span><br><span class="line"><span class="comment">## [1] "numeric"</span></span><br><span class="line">typeof(cl)</span><br><span class="line"><span class="comment">## [1] "double"</span></span><br></pre></td></tr></table></figure></p>
<p> quote返回值如果是name或常量类型,它的长度就是1;如果是call类型,返回值长度就与函数/运算符的参数个数n对应,长度等于n+1,多出的长度1是函数/符号名。<br><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line">length(quote(a)) <span class="comment">#name或常量类型,返回值长度为1</span></span><br><span class="line"><span class="comment">## [1] 1</span></span><br><span class="line">length(quote(!a)) <span class="comment">#单目运算符,返回值长度为2</span></span><br><span class="line"><span class="comment">## [1] 2</span></span><br><span class="line">length(quote(-b)) <span class="comment">#单目运算符,返回值长度为2</span></span><br><span class="line"><span class="comment">## [1] 2</span></span><br><span class="line">length(quote(a + b)) <span class="comment">#双目运算符,返回值长度为3</span></span><br><span class="line"><span class="comment">## [1] 3</span></span><br><span class="line">length(quote((a + b) * c)) <span class="comment">#多个运算符只算优先级最低的一个</span></span><br><span class="line"><span class="comment">## [1] 3</span></span><br></pre></td></tr></table></figure></p>
<h2 id="bquote-和-substitute-函数"><a href="#bquote-和-substitute-函数" class="headerlink" title="bquote 和 substitute 函数"></a><strong>bquote 和 substitute 函数</strong></h2><p> 如果不使用环境变量或环境变量参数,bquote 和 substitute 函数得到的结果与quote函数相同。<br><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">bquote(<span class="number">1</span> + sqrt(a) + b^c) == quote(<span class="number">1</span> + sqrt(a) + b^c)</span><br><span class="line"><span class="comment">## [1] TRUE</span></span><br><span class="line">substitute(<span class="number">1</span> + sqrt(a) + b^c) == quote(<span class="number">1</span> + sqrt(a) + b^c)</span><br><span class="line"><span class="comment">## [1] TRUE</span></span><br></pre></td></tr></table></figure></p>
<p> 但是bquote 和 substitute 函数可以在表达式中使用变量,变量的值随运行进程而被替换。bquote 和 substitute 函数变量替换的方式不一样,bquote函数中需要替换的变量用 .( ) 引用,substitute函数中需要替换的变量用列表参数方式给出。除了这一点,bquote 和 substitute 函数没有差别:<br><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line">a &lt;- <span class="number">3</span></span><br><span class="line">b &lt;- <span class="number">2</span></span><br><span class="line">(bq &lt;- bquote(y == sqrt(.(a), .(b))))</span><br><span class="line"><span class="comment">## y == sqrt(3, 2) </span></span><br><span class="line">(ss &lt;- substitute(y == sqrt(a, b), list(a = <span class="number">3</span>, b = <span class="number">2</span>)))</span><br><span class="line"><span class="comment">## y == sqrt(3, 2) </span></span><br><span class="line">bq == ss</span><br><span class="line"><span class="comment">## [1] TRUE</span></span><br></pre></td></tr></table></figure></p>
<p> 搞出两个功能完全一样的函数不算很奇怪,R语言里面太多了,可能是照顾不同使用习惯的人们吧。bquote函数的帮助档说这个函数类似于LISP的backquote宏,对于像我这样的LISP盲,使用substitute函数好一些。 substitute函数的典型用途是替换表达式中的变量,如果我们希望在表达式中使用变量并且希望这些变量在运行过程中做出相应改变,就可以使用substitute函数。</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">par(mar = rep(<span class="number">0.1</span>, <span class="number">4</span>), cex = <span class="number">2</span>)</span><br><span class="line">plot.new()</span><br><span class="line">plot.window(c(<span class="number">0</span>, <span class="number">10</span>), c(<span class="number">0</span>, <span class="number">1</span>))</span><br><span class="line"><span class="keyword">for</span> (i <span class="keyword">in</span> <span class="number">1</span>:<span class="number">9</span>) text(i, <span class="number">0.5</span>, substitute(sqrt(x, a), list(a = i + <span class="number">1</span>)))</span><br></pre></td></tr></table></figure>
<p><img src="http://img.blog.csdn.net/20151029095415911" alt=""></p>
<h2 id="parse-函数"><a href="#parse-函数" class="headerlink" title="parse 函数"></a><strong>parse 函数</strong></h2><p> parse函数用于从文件读取文本作为表达式,返回的值是expression类型,这函数也很有用。后面有例子。<br><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">x &lt;- <span class="number">1</span></span><br><span class="line">x + <span class="string">"x"</span></span><br><span class="line"><span class="comment">## Error: 二进列运算符中有非数值参数</span></span><br><span class="line">expression(x + <span class="string">"x"</span>)</span><br><span class="line"><span class="comment">## expression(x + "x")</span></span><br><span class="line">quote(x + <span class="string">"x"</span>)</span><br><span class="line"><span class="comment">## x + "x"</span></span><br></pre></td></tr></table></figure></p>
<p> 但R要检查表达式中的运算符,不符合运算符使用规则的表达式将出错:<br><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line">expression(x + +++y)</span><br><span class="line"><span class="comment">## expression(x + +++y) </span></span><br><span class="line"></span><br><span class="line">expression(x - ---y)</span><br><span class="line"><span class="comment">## expression(x - ---y) </span></span><br><span class="line"><span class="comment">## expression(x****y) (Not run) expression(x////y) (Not run) </span></span><br><span class="line"><span class="comment">## expression(1&lt;=x&lt;=4) (Not run) </span></span><br><span class="line"></span><br><span class="line">quote(x + +++y)</span><br><span class="line"><span class="comment">## x + +++y </span></span><br><span class="line"></span><br><span class="line">quote(x - ---y)</span><br><span class="line"><span class="comment">## x - ---y </span></span><br><span class="line"><span class="comment">## quote(x****y) (Not run) quote(x////y) (Not run) quote(1&lt;=x&lt;=4) (Not run)</span></span><br></pre></td></tr></table></figure></p>
<p> + - 运算连续使用不出错是因为它们还可以当成求正/负值运算的符号。 在表达式产生函数中使用paste函数可以解决这样的问题。在这种条件下,paste对参数的处理方式和表达式产生函数一样,检查运算符但不检查变量名。用NULL作为运算符的参数可以获得意外的效果:<br><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line">ex &lt;- expression(paste(x, <span class="string">"////"</span>, y))</span><br><span class="line">cl &lt;- quote(paste(x, <span class="string">"****"</span>, y))</span><br><span class="line">par(mar = rep(<span class="number">0.1</span>, <span class="number">4</span>), cex = <span class="number">2</span>)</span><br><span class="line">plot.new()</span><br><span class="line">plot.window(c(<span class="number">0</span>, <span class="number">1.2</span>), c(<span class="number">0</span>, <span class="number">1</span>))</span><br><span class="line">text(<span class="number">0.2</span>, <span class="number">0.5</span>, ex)</span><br><span class="line">text(<span class="number">0.6</span>, <span class="number">0.5</span>, cl)</span><br><span class="line">cl &lt;- quote(paste(<span class="number">1</span> &lt;= x, <span class="literal">NULL</span> &lt;= <span class="number">4</span>))</span><br><span class="line">text(<span class="number">1</span>, <span class="number">0.5</span>, cl)</span><br></pre></td></tr></table></figure></p>
<p><img src="http://img.blog.csdn.net/20151029101039626" alt=""></p>
<hr>
<h1 id="R绘图函数对文本参数中的表达式的处理"><a href="#R绘图函数对文本参数中的表达式的处理" class="headerlink" title="R绘图函数对文本参数中的表达式的处理"></a><strong>R绘图函数对文本参数中的表达式的处理</strong></h1><p> quote, bquote 和 substitute 的返回值有三种类型call, name 和 常量,事实上expression 函数的结果最终也是这三种类型。因为expression函数的结果是expression列表,我们取列表元素的值检查看看:<br><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br></pre></td><td class="code"><pre><span class="line">(ex &lt;- expression(<span class="number">1</span> + sqrt(x), x, <span class="number">1</span>))</span><br><span class="line"><span class="comment">## expression(1 + sqrt(x), x, 1)</span></span><br><span class="line"></span><br><span class="line">ex[[<span class="number">1</span>]]</span><br><span class="line"><span class="comment">## 1 + sqrt(x)</span></span><br><span class="line"></span><br><span class="line">mode(ex[[<span class="number">1</span>]])</span><br><span class="line"><span class="comment">## [1] "call"</span></span><br><span class="line"></span><br><span class="line">typeof(ex[[<span class="number">1</span>]])</span><br><span class="line"><span class="comment">## [1] "language"</span></span><br><span class="line"></span><br><span class="line">ex[[<span class="number">2</span>]]</span><br><span class="line"><span class="comment">## x</span></span><br><span class="line"></span><br><span class="line">mode(ex[[<span class="number">2</span>]])</span><br><span class="line"><span class="comment">## [1] "name"</span></span><br><span class="line"></span><br><span class="line">typeof(ex[[<span class="number">2</span>]])</span><br><span class="line"><span class="comment">## [1] "symbol"</span></span><br><span class="line"></span><br><span class="line">ex[[<span class="number">3</span>]]</span><br><span class="line"><span class="comment">## [1] 1</span></span><br><span class="line"></span><br><span class="line">mode(ex[[<span class="number">3</span>]])</span><br><span class="line"><span class="comment">## [1] "numeric"</span></span><br><span class="line"></span><br><span class="line">typeof(ex[[<span class="number">3</span>]])</span><br><span class="line"><span class="comment">## [1] "double"</span></span><br></pre></td></tr></table></figure></p>
<p> 确实是这样。所以绘图函数对文本参数中的表达式处理就有三种情况。先看看处理结果:</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">par(mar = rep(<span class="number">0.1</span>, <span class="number">4</span>), cex = <span class="number">2</span>)</span><br><span class="line">plot.new()</span><br><span class="line">plot.window(c(<span class="number">0</span>, <span class="number">1.2</span>), c(<span class="number">0</span>, <span class="number">1</span>))</span><br><span class="line">text(<span class="number">0.2</span>, <span class="number">0.5</span>, ex[<span class="number">1</span>])</span><br><span class="line">text(<span class="number">0.6</span>, <span class="number">0.5</span>, ex[<span class="number">2</span>])</span><br><span class="line">text(<span class="number">1</span>, <span class="number">0.5</span>, ex[<span class="number">3</span>])</span><br></pre></td></tr></table></figure>
<p><img src="http://img.blog.csdn.net/20151029101647891" alt=""></p>
<p> name 和常量类型都很简单,直接输出文本,而call类型就不好判断了。我们前面说过call类型返回值的长度与函数/运算符的参数个数有关。这是怎么体现的呢?由于文本参数最终得到的是文本,我们用as.character函数来看看:<br><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line">as.character(quote(x - y))</span><br><span class="line"><span class="comment">## [1] "-" "x" "y"</span></span><br><span class="line"></span><br><span class="line">as.character(quote(<span class="number">1</span> - x + y))</span><br><span class="line"><span class="comment">## [1] "+" "1 - x" "y"</span></span><br><span class="line"></span><br><span class="line">as.character(quote((<span class="number">1</span> + x) * y))</span><br><span class="line"><span class="comment">## [1] "*" "(1 + x)" "y"</span></span><br><span class="line"></span><br><span class="line">as.character(quote(!a))</span><br><span class="line"><span class="comment">## [1] "!" "a"</span></span><br><span class="line"></span><br><span class="line">as.character(quote(sqrt(x)))</span><br><span class="line"><span class="comment">## [1] "sqrt" "x"</span></span><br></pre></td></tr></table></figure></p>
<p> 转换成字符串向量后排在第一位的是运算符或函数名称,后面是参数(如果参数中还有运算符或函数名,R还会对其进行解析)。运算符和函数是相同的处理方式。事实上,在R语言中,所有运算符(包括数学运算符和逻辑运算符)都是函数,你可以用函数的方式使用运算符:<br><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line"><span class="number">2</span> + <span class="number">4</span></span><br><span class="line"><span class="comment">## [1] 6</span></span><br><span class="line"></span><br><span class="line"><span class="number">2</span> - <span class="number">4</span></span><br><span class="line"><span class="comment">## [1] -2</span></span><br><span class="line"></span><br><span class="line"><span class="number">2</span> &lt;= <span class="number">4</span></span><br><span class="line"><span class="comment">## [1] TRUE</span></span><br><span class="line"></span><br><span class="line"><span class="number">2</span> &gt;= <span class="number">4</span></span><br><span class="line"><span class="comment">## [1] FALSE</span></span><br></pre></td></tr></table></figure></p>
<p> R绘图函数对表达式中包含的函数名和它们的参数首先应用Tex文本格式化规则进行处理,这种规则的具体情况可以使用 ?plotmath 进行查看,主要是一些数学公式和符号的表示方法。把这个说明文档中字符串拷贝到maths.txt文件中并保存到当前工作目录后可以用下面的代码做出后面的表格:<br><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br></pre></td><td class="code"><pre><span class="line">ex &lt;- parse(<span class="string">"maths.txt"</span>)</span><br><span class="line">labs &lt;- readLines(<span class="string">"maths.txt"</span>)</span><br><span class="line">n &lt;- length(ex)</span><br><span class="line">par(mar = rep(<span class="number">0.1</span>, <span class="number">4</span>), cex = <span class="number">0.8</span>)</span><br><span class="line">plot.new()</span><br><span class="line">plot.window(c(<span class="number">0</span>, <span class="number">8</span>), c(<span class="number">0</span>, n/<span class="number">4</span>))</span><br><span class="line">y &lt;- seq(n/<span class="number">4</span>, by = -<span class="number">1</span>, length = n/<span class="number">4</span>)</span><br><span class="line">x &lt;- seq(<span class="number">0.1</span>, by = <span class="number">2</span>, length = <span class="number">4</span>)</span><br><span class="line">xy &lt;- expand.grid(x, y)</span><br><span class="line">text(xy, labs, adj = c(<span class="number">0</span>, <span class="number">0.5</span>))</span><br><span class="line">xy &lt;- expand.grid(x + <span class="number">1.3</span>, y)</span><br><span class="line">text(xy, ex, adj = c(<span class="number">0</span>, <span class="number">0.5</span>), col = <span class="string">"blue"</span>)</span><br><span class="line">box(lwd = <span class="number">2</span>)</span><br><span class="line">abline(v = seq(<span class="number">1.3</span>, by = <span class="number">2</span>, length = <span class="number">4</span>), lty = <span class="number">3</span>)</span><br><span class="line">abline(v = seq(<span class="number">2</span>, by = <span class="number">2</span>, length = <span class="number">3</span>), lwd = <span class="number">1.5</span>)</span><br></pre></td></tr></table></figure></p>
<p><img src="http://img.blog.csdn.net/20151029105049990" alt=""></p>
<p> 表中奇数列是字符串(表达式),偶数列(蓝色)是Tex格式化的图形。除了上表列出的规则外还有一些拉丁文和希腊文符号,可以在表达式中用 symbol 函数或名称(如alpha)等表示,用到时自己去找吧。 如果函数名(包括运算符)有对应的Tex格式化规则,函数名和参数都按规则进行图形绘制;如果没有,就当成是R语言普通函数:<br><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">ex &lt;- expression(sqrt(x), x + y, x^<span class="number">2</span>, x %<span class="keyword">in</span>% A, x &lt;= y, mean(x, y, z), x | y, x &amp; y)</span><br><span class="line">n &lt;- length(ex)</span><br><span class="line">par(mar = rep(<span class="number">0.1</span>, <span class="number">4</span>), cex = <span class="number">1.5</span>)</span><br><span class="line">col &lt;- c(<span class="string">"red"</span>, <span class="string">"blue"</span>)</span><br><span class="line">plot.new()</span><br><span class="line">plot.window(c(<span class="number">0</span>, n), c(<span class="number">0</span>, <span class="number">1</span>))</span><br><span class="line"><span class="keyword">for</span> (i <span class="keyword">in</span> <span class="number">1</span>:n) text(i - <span class="number">0.5</span>, <span class="number">0.5</span>, ex[i], col = col[i%%<span class="number">2</span> + <span class="number">1</span>])</span><br></pre></td></tr></table></figure></p>
<p><img src="http://img.blog.csdn.net/20151029103624683" alt=""></p>
<p> 上面例子中前5种运算函数都是有对应数学符号的,所以它出的图(符号和顺序)与数学习惯一致,后三种运算函数没有对应数学符号,所以用普通函数方式(函数名在前,参数在括号内用逗号分隔)出图。其他还有一些琐碎的规则,自己找找吧。</p>
<p>本文转自<a href="http://blog.sciencenet.cn/blog-253562-818554.html" target="_blank" rel="external"><font color="red" size="3" face="黑体">科学网李鹏程博客</font></a></p>
</content>
<summary type="html">
<p> 在R语言的绘图函数中,如果文本参数是合法的R语言表达式,那么这个表达式就被用Tex类似的规则进行文本格式化。</p>
</summary>
<category term="R" scheme="http://bgods.top/tags/R/"/>
</entry>
<entry>
<title>jiebaR中文分词快速入门</title>
<link href="http://bgods.top/2016/06/27/jiebaR%E4%B8%AD%E6%96%87%E5%88%86%E8%AF%8D%E5%BF%AB%E9%80%9F%E5%85%A5%E9%97%A8/"/>
<id>http://bgods.top/2016/06/27/jiebaR中文分词快速入门/</id>
<published>2016-06-27T11:58:57.000Z</published>
<updated>2016-09-16T10:07:11.000Z</updated>
<content type="html"><p> 参考于<a href="http://qinwenfeng.com/jiebaR/index.html" target="_blank" rel="external">jiebaR中文分词帮助文档</a>,做了个笔记,方便以后学习。这里是<a href="https://cran.r-project.org/web/packages/jiebaR/jiebaR.pdf" target="_blank" rel="external">官方英文文档</a>,以及<a href="https://cran.r-project.org/web/packages/jiebaR/index.html" target="_blank" rel="external">jiebaR官网</a>的地址。<br><a id="more"></a></p>
<hr>
<h1 id="分词"><a href="#分词" class="headerlink" title="分词"></a><strong>分词</strong></h1><p>jiebaR提供了四种分词模式,可以通过函数worker()来初始化分词引擎,使用函数segment()进行分词。具体使用?worker查看帮助<br><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">text &lt;- ‘你要明白,这仅仅是一个测试文本’</span><br><span class="line">mixseg &lt;- worker() <span class="comment">#使用默认参数,混合模型(MixSegment)</span></span><br><span class="line"></span><br><span class="line">segment(text, mixseg)</span><br><span class="line"><span class="comment">#等价于mixseg[text]</span></span><br><span class="line"><span class="comment">#也等价于mixseg &lt;= text</span></span><br></pre></td></tr></table></figure></p>
<p><img src="http://img.blog.csdn.net/20151019210108164" alt=""></p>
<p>直接输入mixseg命令,可以查看此worker的配置<br><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">mixseg</span><br></pre></td></tr></table></figure></p>
<p><img src="http://img.blog.csdn.net/20151019210344931" alt=""></p>
<p>可以通过R语言常用的 \$符号重设一些worker的参数设置,如 WorkerName\$symbol = T,在输出中保留标点符号。一些参数在初始化的时候已经确定,无法修改, 可以通过WorkerName\$PrivateVarible来获得这些信息。</p>
<p><strong>支持文件分词,省去读取文件后再进行分词的麻烦</strong></p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">segment(<span class="string">'D:/test.txt'</span>, mixseg) <span class="comment">#自动判断输入文件编码模式,默认文件输出在同目录下。</span></span><br><span class="line"><span class="comment">#等价于mixseg['D:/test.txt']</span></span><br><span class="line"><span class="comment">#也等价于mixseg &lt;= 'D:/test.txt'</span></span><br></pre></td></tr></table></figure>
<hr>
<h2 id="最大概率法(MPSegment)"><a href="#最大概率法(MPSegment)" class="headerlink" title="最大概率法(MPSegment)"></a><strong>最大概率法(MPSegment)</strong></h2><p>负责根据Trie树构建有向无环图和进行动态规划算法,是分词算法的核心。<br><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">mpseg &lt;- worker(<span class="string">'mp'</span>) <span class="comment">#最大概率法(MPSegment)</span></span><br><span class="line">mpseg[text]</span><br></pre></td></tr></table></figure></p>
<p><img src="http://img.blog.csdn.net/20151019210856454" alt=""></p>
<h2 id="隐式马尔科夫模型(HMMSegment)"><a href="#隐式马尔科夫模型(HMMSegment)" class="headerlink" title="隐式马尔科夫模型(HMMSegment)"></a><strong>隐式马尔科夫模型(HMMSegment)</strong></h2><p>是根据基于人民日报等语料库构建的HMM模型来进行分词,主要算法思路是根据(B,E,M,S)四个状态来代表每个字的隐藏状态。 HMM模型由dict/hmm_model.utf8提供。分词算法即viterbi算法。<br><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">hmmseg &lt;- worker(<span class="string">'hmm'</span>) <span class="comment">#隐式马尔科夫模型(HMMSegment)</span></span><br><span class="line">hmmseg[text]</span><br></pre></td></tr></table></figure></p>
<p><img src="http://img.blog.csdn.net/20151019211005038" alt=""></p>
<h2 id="混合模型(MixSegment)"><a href="#混合模型(MixSegment)" class="headerlink" title="混合模型(MixSegment)"></a><strong>混合模型(MixSegment)</strong></h2><p>是四个分词引擎里面分词效果较好的类,结它合使用最大概率法和隐式马尔科夫模型。<br><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">mixseg &lt;- worker(<span class="string">'mix'</span>) <span class="comment">#混合模型(MixSegment)</span></span><br><span class="line">mixseg[text]</span><br></pre></td></tr></table></figure></p>
<p><img src="http://img.blog.csdn.net/20151019211044468" alt=""></p>
<h2 id="索引模型(QuerySegment)"><a href="#索引模型(QuerySegment)" class="headerlink" title="索引模型(QuerySegment)"></a><strong>索引模型(QuerySegment)</strong></h2><p>先使用混合模型进行切词,再对于切出来的较长的词,枚举句子中所有可能成词的情况,找出词库里存在。<br><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">queryseg &lt;- worker(<span class="string">'query'</span>) <span class="comment">#索引模型(QuerySegment)</span></span><br><span class="line">queryseg[text]</span><br></pre></td></tr></table></figure></p>
<p><img src="http://img.blog.csdn.net/20151019211147663" alt=""></p>
<hr>
<h1 id="标注词性"><a href="#标注词性" class="headerlink" title="标注词性"></a><strong>标注词性</strong></h1><p>可以使用 &lt;=.tagger 或者 tag 来进行分词和词性标注,词性标注使用混合模型模型分词,标注采用和 ictclas 兼容的标记法。<br><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">tagseg &lt;- worker(<span class="string">'tag'</span>)</span><br><span class="line">tagseg[text]</span><br></pre></td></tr></table></figure></p>
<p><img src="http://img.blog.csdn.net/20151019205943503" alt=""> </p>
<h1 id="提取关键词"><a href="#提取关键词" class="headerlink" title="提取关键词"></a><strong>提取关键词</strong></h1><p>关键词提取所使用逆向文件频率(IDF)文本语料库可以切换成自定义语料库的路径,使用方法与分词类似。topn参数为关键词的个数。<br><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">keys = worker(<span class="string">'keywords'</span>, topn = <span class="number">2</span>) <span class="comment">#参数topn表示提取排在最前的关键词个数</span></span><br><span class="line">keys &lt;= text</span><br></pre></td></tr></table></figure></p>
<p><img src="http://img.blog.csdn.net/20151019212542289" alt=""></p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">keys &lt;= <span class="string">"filename.txt"</span> <span class="comment">#同样的,也可以对文件进行关键词提取</span></span><br></pre></td></tr></table></figure>
<h1 id="simhash计算"><a href="#simhash计算" class="headerlink" title="simhash计算"></a><strong>simhash计算</strong></h1><p>对中文文档计算出对应的simhash值。simhash是谷歌用来进行文本去重的算法,现在广泛应用在文本处理中。Simhash引擎先进行分词和关键词提取,后计算Simhash值和海明距离。<br><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">simhasher = worker(<span class="string">"simhash"</span>, topn = <span class="number">2</span>)</span><br><span class="line">simhasher &lt;= text</span><br></pre></td></tr></table></figure></p>
<p><img src="http://img.blog.csdn.net/20151019213020089" alt=""></p>
<h1 id="快速模式"><a href="#快速模式" class="headerlink" title="快速模式"></a><strong>快速模式</strong></h1><p>无需使用函数worker(),使用默认参数启动引擎,并立即进行分词。使用<strong>qseg</strong>(quick segmentation),使用默认分词模式,自动建立分词引擎,类似于ggplot2包里面的qplot函数。<br><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">qseg &lt;= text</span><br></pre></td></tr></table></figure></p>
<p><img src="http://img.blog.csdn.net/20151019213310248" alt=""></p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">worker(<span class="string">'mix'</span>) <span class="comment">#查看worker('mix')参数配置</span></span><br><span class="line">qseg <span class="comment">#查看qseg参数配置,与上面一样都得到以下结果</span></span><br></pre></td></tr></table></figure>
<p><img src="http://img.blog.csdn.net/20151019214331195" alt=""></p>
<p>实际上,第一次运行时,会启动默认引擎 quick_worker,相当于先运行了一遍代码:<br><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">qseg = worker(<span class="string">'mix'</span>)</span><br></pre></td></tr></table></figure></p>
<hr>
<ul>
<li>可以通过<font color="#8B0000">qseg\$</font>重设模型参数,重设模型参数将会修改以后每次默认启动的默认参数;</li>
<li>如果只是想临时修改模型参数,可以使用非快速模式的修改方式<font color="#8B0000">quick_worker\$</font>。<figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">qseg$type = <span class="string">"mp"</span> <span class="comment"># 重设模型参数的同时,重新启动引擎;下次重新启动包时,现有的设置不会改变。</span></span><br><span class="line"></span><br><span class="line">quick_worker$detect = <span class="literal">T</span> <span class="comment"># 临时修改,下次重新启动包时,会恢复原来的默认设置。</span></span><br><span class="line">get_qsegmodel() <span class="comment"># 获得当前快速模式的默认参数</span></span><br></pre></td></tr></table></figure>
</li>
</ul>
</content>
<summary type="html">
<p> 参考于<a href="http://qinwenfeng.com/jiebaR/index.html">jiebaR中文分词帮助文档</a>,做了个笔记,方便以后学习。这里是<a href="https://cran.r-project.org/web/packages/jiebaR/jiebaR.pdf">官方英文文档</a>,以及<a href="https://cran.r-project.org/web/packages/jiebaR/index.html">jiebaR官网</a>的地址。<br>
</summary>
<category term="R" scheme="http://bgods.top/tags/R/"/>
<category term="jiebaR" scheme="http://bgods.top/tags/jiebaR/"/>
</entry>
<entry>
<title>jiebaR中文分词,并做词云(R语言)</title>
<link href="http://bgods.top/2016/06/27/jiebaR%E4%B8%AD%E6%96%87%E5%88%86%E8%AF%8D%EF%BC%8C%E5%B9%B6%E5%81%9A%E8%AF%8D%E4%BA%91%EF%BC%88R%E8%AF%AD%E8%A8%80%EF%BC%89/"/>
<id>http://bgods.top/2016/06/27/jiebaR中文分词,并做词云(R语言)/</id>
<published>2016-06-27T11:55:40.000Z</published>
<updated>2016-09-16T10:17:23.000Z</updated>
<content type="html"><p> 使用结巴中文分词(jiebaR)对爬取的<a href="http://blog.csdn.net/songzhilian22/article/details/49081581" target="_blank" rel="external">新浪新闻 </a>文本进行分词,统计词频之后,使用包wordcloud画词云。</p>
<a id="more"></a>
<h1 id="1、读入数据"><a href="#1、读入数据" class="headerlink" title="1、读入数据"></a><strong>1、读入数据</strong></h1><hr>
<p> 以下数据是在<a href="http://blog.csdn.net/songzhilian22/article/details/49081581" target="_blank" rel="external">这里</a>爬取的,这里只对社会新闻类进行测试,文件还是比较大的。分词完有一千多万个词,处理完后有将近30万。</p>
<p><img src="http://img.blog.csdn.net/20151017131951178" alt=""><br><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">library</span>(jiebaR)</span><br><span class="line"><span class="keyword">library</span>(wordcloud)</span><br><span class="line"></span><br><span class="line"><span class="comment">#读入数据分隔符是‘\n’,字符编码是‘UTF-8’,what=''表示以字符串类型读入</span></span><br><span class="line">f &lt;- scan(<span class="string">'D:/数据/News/shxw.txt'</span>,sep=<span class="string">'\n'</span>,what=<span class="string">''</span>,encoding=<span class="string">"UTF-8"</span>)</span><br></pre></td></tr></table></figure></p>
<h1 id="2、数据处理"><a href="#2、数据处理" class="headerlink" title="2、数据处理"></a><strong>2、数据处理</strong></h1><hr>
<figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line">seg &lt;- qseg[f] <span class="comment">#使用qseg类型分词,并把结果保存到对象seg中</span></span><br><span class="line">seg &lt;- seg[nchar(seg)&gt;<span class="number">1</span>] <span class="comment">#去除字符长度小于2的词语</span></span><br><span class="line"></span><br><span class="line">seg &lt;- table(seg) <span class="comment">#统计词频</span></span><br><span class="line"></span><br><span class="line">seg &lt;- seg[!grepl(<span class="string">'[0-9]+'</span>,names(seg))] <span class="comment">#去除数字</span></span><br><span class="line">length(seg) <span class="comment">#查看处理完后剩余的词数</span></span><br><span class="line"><span class="comment"># [1] 288955</span></span><br><span class="line">seg &lt;- sort(seg, decreasing = <span class="literal">TRUE</span>)[<span class="number">1</span>:<span class="number">100</span>] <span class="comment">#降序排序,并提取出现次数最多的前100个词语</span></span><br><span class="line">seg <span class="comment">#查看100个词频最高的</span></span><br></pre></td></tr></table></figure>
<p><img src="http://img.blog.csdn.net/20151017130828586" alt=""></p>
<h1 id="3、做词云"><a href="#3、做词云" class="headerlink" title="3、做词云"></a><strong>3、做词云</strong></h1><hr>
<figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">bmp(<span class="string">"comment_cloud.bmp"</span>, width = <span class="number">500</span>, height = <span class="number">500</span>) <span class="comment">#设置画布</span></span><br><span class="line">par(bg = <span class="string">"black"</span>) <span class="comment">#背景色</span></span><br><span class="line">wordcloud(names(seg), seg, colors = rainbow(<span class="number">100</span>), random.order=<span class="literal">F</span>)</span><br><span class="line">dev.off()</span><br></pre></td></tr></table></figure>
<p><img src="http://img.blog.csdn.net/20151017141130747" alt=""></p>
</content>
<summary type="html">
<p> 使用结巴中文分词(jiebaR)对爬取的<a href="http://blog.csdn.net/songzhilian22/article/details/49081581">新浪新闻 </a>文本进行分词,统计词频之后,使用包wordcloud画词云。</p>
</summary>
<category term="R" scheme="http://bgods.top/tags/R/"/>
<category term="jiebaR" scheme="http://bgods.top/tags/jiebaR/"/>
</entry>
<entry>
<title>JDK环境变量配置(linux)</title>
<link href="http://bgods.top/2016/06/27/JDK%E7%8E%AF%E5%A2%83%E5%8F%98%E9%87%8F%E9%85%8D%E7%BD%AE-linux/"/>
<id>http://bgods.top/2016/06/27/JDK环境变量配置-linux/</id>
<published>2016-06-27T11:41:30.000Z</published>
<updated>2016-09-16T10:26:00.000Z</updated>
<content type="html"><p> linux下的jdk环境变量配置笔记。</p>
<a id="more"></a>
<hr>
<ul>
<li>新建”/usr/java”文件夹</li>
</ul>
<p><img src="http://img.blog.csdn.net/20151011112912411" alt=""></p>
<ul>
<li>把jdk文件复制到”/usr/java”文件夹下</li>
</ul>
<p><img src="http://img.blog.csdn.net/20151011113357014" alt=""></p>
<ul>
<li>使用命令”tar zxvf jdk-7u71-linux-x64.gz”解压到当前文件夹,解压后,得到”jdk1.7.0_71”文件夹</li>
</ul>
<p><img src="http://img.blog.csdn.net/20151011113535669" alt=""></p>
<ul>
<li>使用命令”vi /etc/profile”在文件”/etc/profile”后加入以下代码</li>
</ul>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="built_in">export</span> JAVA_HOME=/usr/java/jdk1.7.0_71</span><br><span class="line"><span class="built_in">export</span> JRE_HOME=/usr/java/jdk1.7.0_71/jre</span><br><span class="line"><span class="built_in">export</span> PATH=<span class="variable">$PATH</span>:<span class="variable">$JAVA_HOME</span>/bin:<span class="variable">$JRE_HOME</span>/bin</span><br><span class="line"><span class="built_in">export</span> CLASSPATH=.:<span class="variable">$JAVA_HOME</span>/lib/dt.jar:<span class="variable">$JAVA_HOME</span>/lib/tools.jar:<span class="variable">$JRE_HOME</span>/lib</span><br></pre></td></tr></table></figure>
<p><img src="http://img.blog.csdn.net/20151011113806486" alt=""></p>
<ul>
<li>使用命令”source /etc/profile”,让修改立即生效;输入”java -version”查看java变量是否配置成功</li>
</ul>
<p><img src="http://img.blog.csdn.net/20151011113941815" alt=""></p>
<p>出现以上则说明配置成功</p>
</content>
<summary type="html">
<p> linux下的jdk环境变量配置笔记。</p>
</summary>
<category term="jdk" scheme="http://bgods.top/tags/jdk/"/>
</entry>
<entry>
<title>python使用cookie登陆新浪微博用户信息</title>
<link href="http://bgods.top/2016/06/27/python%E4%BD%BF%E7%94%A8cookie%E7%99%BB%E9%99%86%E6%96%B0%E6%B5%AA%E5%BE%AE%E5%8D%9A%E7%94%A8%E6%88%B7%E4%BF%A1%E6%81%AF/"/>
<id>http://bgods.top/2016/06/27/python使用cookie登陆新浪微博用户信息/</id>
<published>2016-06-27T11:28:02.000Z</published>
<updated>2016-09-16T11:35:01.000Z</updated>
<content type="html"><p> 在上一篇博客<a href="http://blog.csdn.net/songzhilian22/article/details/48396545" target="_blank" rel="external">python模拟新浪微博登陆之获取cookies</a>中,已经实现了登陆新浪微博并把cookie保存了下来。接下来通过得到的cookie去访问新浪微博其他页面,并获取我们想要的信息。</p>
<p> 顺便一提,我的软件是python2.7.10(64位),IDE是pycharm,win8.1系统环境。所用到的包是base64、rsa、binascii、re、requests。</p>
<ul>
<li>这里,我首先通过访问自己新浪微博主页获取我的所有微博;</li>
<li>然后进入我的关注用户页面,获取我关注用户的用户ID、用户名;</li>
<li>最后分别进入各个用户的微博主页获取所有微博。<a id="more"></a>
</li>
</ul>
<hr>
<h1 id="获取我的uid与用户名"><a href="#获取我的uid与用户名" class="headerlink" title="获取我的uid与用户名"></a><strong>获取我的uid与用户名</strong></h1><p> 以下是获取我的uid与用户名的详细代码,使用时需要<a href="http://blog.csdn.net/songzhilian22/article/details/48396545" target="_blank" rel="external">cookies</a>参数。<br><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">get_myuid</span><span class="params">(cookies)</span>:</span></span><br><span class="line"></span><br><span class="line"> url = <span class="string">'http://weibo.com/'</span></span><br><span class="line"> html = requests.get(url,cookies=cookies).content <span class="comment">#用get请求加入cookies参数登陆微博主页</span></span><br><span class="line"></span><br><span class="line"> a = html.find(<span class="string">'[\'uid\']='</span>)</span><br><span class="line"> b = html[a:].find(<span class="string">';'</span>)</span><br><span class="line"> myuid = html[a + len(<span class="string">'[\'uid\']='</span>): a + b][<span class="number">1</span>:<span class="number">-1</span>] <span class="comment">#获取我的uid</span></span><br><span class="line"></span><br><span class="line"> a = html.find(<span class="string">'[\'nick\']='</span>)</span><br><span class="line"> b = html[a:].find(<span class="string">';'</span>)</span><br><span class="line"> myname = html[a + len(<span class="string">'[\'nick\']='</span>): a + b][<span class="number">1</span>:<span class="number">-1</span>] <span class="comment">#获取我的用户名</span></span><br><span class="line"> <span class="keyword">return</span> myuid,myname</span><br></pre></td></tr></table></figure></p>
<p> 使用以下命令,可以将返回的两个字符串分别赋值给变量myuid、myname<br><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">myuid,myname = get_myuid(cookies)</span><br></pre></td></tr></table></figure></p>
<hr>
<h1 id="获取我关注用户的uid、用户名"><a href="#获取我关注用户的uid、用户名" class="headerlink" title="获取我关注用户的uid、用户名"></a><strong>获取我关注用户的uid、用户名</strong></h1><p> 以下是函数get_follow(myuid,cookies)的代码,可以获取我关注用户的uid、用户名。<br><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">get_follow</span><span class="params">(myuid,cookies)</span>:</span></span><br><span class="line"></span><br><span class="line"> <span class="string">'''获取微博关注用户的uid与用户名'''</span></span><br><span class="line"> url = <span class="string">'http://weibo.com/'</span> + myuid + <span class="string">'/follow'</span></span><br><span class="line"> html = requests.get(url,cookies=cookies).content</span><br><span class="line"></span><br><span class="line"> c = html.find(<span class="string">'member_ul clearfix'</span>)<span class="number">-13</span></span><br><span class="line"> html = html[c:]</span><br><span class="line"> u = re.findall(<span class="string">r'[uid=]&#123;4&#125;([0-9]+)[&amp;nick=]&#123;6&#125;(.*?)\\"'</span>,html)</span><br><span class="line"> </span><br><span class="line"> user_id = []</span><br><span class="line"> uname = []</span><br><span class="line"> <span class="keyword">for</span> i <span class="keyword">in</span> u:</span><br><span class="line"> user_id.append(i[<span class="number">0</span>]) <span class="comment">#把uid储存到列表user_id中</span></span><br><span class="line"> uname.append(i[<span class="number">1</span>]) <span class="comment">#把用户名储存到列表uname中</span></span><br><span class="line"> <span class="keyword">return</span> user_id,uname <span class="comment">#返回两个列表</span></span><br></pre></td></tr></table></figure></p>
<h1 id="获取微博"><a href="#获取微博" class="headerlink" title="获取微博"></a><strong>获取微博</strong></h1><p> 通过以下url可以直接进入访问用户的微博页面,其中uid是前面提到uid。通过改变uid可以访问不同用户的微博页面;把profile改为fans(follow)可以访问用户的粉丝(关注)页面。<br><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">url = <span class="string">'http://weibo.com/'</span>+uid+<span class="string">'/profile'</span></span><br></pre></td></tr></table></figure></p>
<p> 以下是get_weibo(uid,cookies,page)函数的代码,这里我使用的是普通的正则匹配、re模块。<br><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">get_weibo</span><span class="params">(uid,cookies,page)</span>:</span></span><br><span class="line"> <span class="string">'''获取我的前page页的微博'''</span></span><br><span class="line"></span><br><span class="line"> url = <span class="string">'http://weibo.com/'</span>+uid+<span class="string">'/profile'</span></span><br><span class="line"> my_weibo = []</span><br><span class="line"> <span class="keyword">for</span> p <span class="keyword">in</span> range(<span class="number">1</span>,page+<span class="number">1</span>):</span><br><span class="line"></span><br><span class="line"> <span class="comment">#新浪微博每一页信息是异步加载的,分三次加载,而且提交的参数都不同</span></span><br><span class="line"> <span class="keyword">for</span> pb <span class="keyword">in</span> range(<span class="number">-1</span>,<span class="number">2</span>):</span><br><span class="line"> data = &#123;<span class="string">'pagebar'</span>:str(pb),</span><br><span class="line"> <span class="string">'pre_page'</span>:str(p),</span><br><span class="line"> <span class="string">'page'</span>:str(p),</span><br><span class="line"> &#125;</span><br><span class="line"> <span class="keyword">if</span> p == <span class="number">1</span>:</span><br><span class="line"> <span class="keyword">if</span> pb == <span class="number">-1</span>:</span><br><span class="line"> html = requests.get(url,cookies=cookies).content</span><br><span class="line"> <span class="keyword">else</span>:</span><br><span class="line"> html = requests.get(url,cookies=cookies,params=data).content</span><br><span class="line"> <span class="keyword">else</span>:</span><br><span class="line"> html = requests.get(url,cookies=cookies,params=data).content</span><br><span class="line"></span><br><span class="line"> hlist = html.split(<span class="string">'node-type=\\"feed_list_content\\"'</span>)[<span class="number">1</span>:]</span><br><span class="line"> <span class="keyword">for</span> i <span class="keyword">in</span> hlist:</span><br><span class="line"> i = i.split(<span class="string">'&lt;\/div&gt;'</span>)[<span class="number">0</span>]</span><br><span class="line"> s = re.findall(<span class="string">'&gt;(.*?)&lt;'</span>,i)</span><br><span class="line"> weibo = <span class="string">''</span></span><br><span class="line"> <span class="keyword">for</span> j <span class="keyword">in</span> s:</span><br><span class="line"> weibo = weibo + j.strip(<span class="string">'\\n /\\'</span>)</span><br><span class="line"> <span class="keyword">if</span> len(weibo) != <span class="number">0</span>:</span><br><span class="line"> my_weibo.append(weibo) <span class="comment">#如果提取的信息不为空,则保存到列表my_weibo中</span></span><br><span class="line"> </span><br><span class="line"> <span class="keyword">return</span> my_weibo <span class="comment">#返回一个微博信息列表</span></span><br></pre></td></tr></table></figure></p>
<hr>
<h1 id="获取微博并保存到文本"><a href="#获取微博并保存到文本" class="headerlink" title="获取微博并保存到文本"></a><strong>获取微博并保存到文本</strong></h1><p> 利用前面编写的Get_cookies(username,password)、get_myuid(cookies)、get_follow(myuid,cookies) 以及get_weibo(uid,cookies,page)四个函数,就可以获取自己以及关注用户的uid、用户名、微博信息。以下是代码<br><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line">cookies = Get_cookies(username,password) <span class="comment">#获取登陆后的cookies</span></span><br><span class="line">myuid,myname = get_myuid(cookies) <span class="comment">#获取我的uid与用户名</span></span><br><span class="line">uid,uname = get_follow(myuid,cookies) <span class="comment">#获取关注用户的uid与用户名</span></span><br><span class="line"></span><br><span class="line">s = open(<span class="string">'user_weibo.txt'</span>,<span class="string">'a'</span>)</span><br><span class="line"><span class="keyword">for</span> i <span class="keyword">in</span> range(len(uid)):</span><br><span class="line"> my_weibo = get_weibo(uid[i],cookies,<span class="number">3</span>) <span class="comment">#获取用户前三页的微博信息</span></span><br><span class="line"> <span class="keyword">for</span> j <span class="keyword">in</span> my_weibo:</span><br><span class="line"> s.write(uid[i]+<span class="string">' '</span>+uname[i]+<span class="string">' '</span>+j+<span class="string">'\n'</span>)</span><br><span class="line"> <span class="keyword">print</span> str(i+<span class="number">1</span>)+<span class="string">'/'</span>+str(len(uid)) <span class="comment">#显示获取进度</span></span><br><span class="line"></span><br><span class="line">s.close()</span><br><span class="line"><span class="keyword">print</span> <span class="string">'所有用户获取完成'</span></span><br></pre></td></tr></table></figure></p>
<p> 以下是所获取的数据<br><img src="http://img.blog.csdn.net/20150924171838106" alt="这里写图片描述"></p>
<p> 至此,获取新浪微博信息完成。至于进一步获取关注用户的子用户的微博信息,这里就不做了。</p>
<hr>
<p>======================我是分割线==========================</p>
<p><strong>以下是完整代码:</strong><br><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br><span class="line">92</span><br><span class="line">93</span><br><span class="line">94</span><br><span class="line">95</span><br><span class="line">96</span><br><span class="line">97</span><br><span class="line">98</span><br><span class="line">99</span><br><span class="line">100</span><br><span class="line">101</span><br><span class="line">102</span><br><span class="line">103</span><br><span class="line">104</span><br><span class="line">105</span><br><span class="line">106</span><br><span class="line">107</span><br><span class="line">108</span><br><span class="line">109</span><br><span class="line">110</span><br><span class="line">111</span><br><span class="line">112</span><br><span class="line">113</span><br><span class="line">114</span><br><span class="line">115</span><br><span class="line">116</span><br><span class="line">117</span><br><span class="line">118</span><br><span class="line">119</span><br><span class="line">120</span><br><span class="line">121</span><br><span class="line">122</span><br><span class="line">123</span><br><span class="line">124</span><br><span class="line">125</span><br><span class="line">126</span><br><span class="line">127</span><br><span class="line">128</span><br><span class="line">129</span><br><span class="line">130</span><br><span class="line">131</span><br><span class="line">132</span><br><span class="line">133</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">#-*- encoding:utf-8 -*-</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">import</span> requests</span><br><span class="line"><span class="keyword">import</span> re</span><br><span class="line"><span class="keyword">import</span> base64,rsa,binascii</span><br><span class="line"></span><br><span class="line">username = <span class="string">'这里输入用户名'</span> <span class="comment">#用户名</span></span><br><span class="line">password = <span class="string">'这里输入密码'</span> <span class="comment">#密码</span></span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">Get_cookies</span><span class="params">(username,password)</span>:</span></span><br><span class="line"> <span class="string">'''登陆新浪微博,获取登陆后的Cookie,返回到变量cookies中'''</span></span><br><span class="line"> url = <span class="string">'http://login.sina.com.cn/sso/prelogin.php?entry=sso&amp;callback=sinaSSOController.preloginCallBack&amp;su=%s&amp;rsakt=mod&amp;client=ssologin.js(v1.4.15)%'</span>+username</span><br><span class="line"> html = requests.get(url).content</span><br><span class="line"> </span><br><span class="line"> servertime = re.findall(<span class="string">'"servertime":(.*?),'</span>,html,re.S)[<span class="number">0</span>]</span><br><span class="line"> nonce = re.findall(<span class="string">'"nonce":"(.*?)"'</span>,html,re.S)[<span class="number">0</span>]</span><br><span class="line"> pubkey = re.findall(<span class="string">'"pubkey":"(.*?)"'</span>,html,re.S)[<span class="number">0</span>]</span><br><span class="line"> rsakv = re.findall(<span class="string">'"rsakv":"(.*?)"'</span>,html,re.S)[<span class="number">0</span>]</span><br><span class="line"></span><br><span class="line"> username = base64.b64encode(username) <span class="comment">#加密用户名</span></span><br><span class="line"></span><br><span class="line"> rsaPublickey = int(pubkey, <span class="number">16</span>)</span><br><span class="line"> key = rsa.PublicKey(rsaPublickey, <span class="number">65537</span>) <span class="comment">#创建公钥</span></span><br><span class="line"> message = str(servertime) + <span class="string">'\t'</span> + str(nonce) + <span class="string">'\n'</span> + str(password) <span class="comment">#拼接明文js加密文件中得到</span></span><br><span class="line"> passwd = rsa.encrypt(message, key) <span class="comment">#加密</span></span><br><span class="line"> passwd = binascii.b2a_hex(passwd) <span class="comment">#将加密信息转换为16进制。</span></span><br><span class="line"></span><br><span class="line"> login_url = <span class="string">'http://login.sina.com.cn/sso/login.php?client=ssologin.js(v1.4.4)'</span></span><br><span class="line"> data = &#123;<span class="string">'entry'</span>: <span class="string">'weibo'</span>,</span><br><span class="line"> <span class="string">'gateway'</span>: <span class="string">'1'</span>,</span><br><span class="line"> <span class="string">'from'</span>: <span class="string">''</span>,</span><br><span class="line"> <span class="string">'savestate'</span>: <span class="string">'30'</span>,</span><br><span class="line"> <span class="string">'userticket'</span>: <span class="string">'1'</span>,</span><br><span class="line"> <span class="string">'ssosimplelogin'</span>: <span class="string">'1'</span>,</span><br><span class="line"> <span class="string">'vsnf'</span>: <span class="string">'1'</span>,</span><br><span class="line"> <span class="string">'vsnval'</span>: <span class="string">''</span>,</span><br><span class="line"> <span class="string">'su'</span>: username,</span><br><span class="line"> <span class="string">'service'</span>: <span class="string">'miniblog'</span>,</span><br><span class="line"> <span class="string">'servertime'</span>: servertime,</span><br><span class="line"> <span class="string">'nonce'</span>: nonce,</span><br><span class="line"> <span class="string">'pwencode'</span>: <span class="string">'rsa2'</span>,</span><br><span class="line"> <span class="string">'sp'</span>: passwd,</span><br><span class="line"> <span class="string">'encoding'</span>: <span class="string">'UTF-8'</span>,</span><br><span class="line"> <span class="string">'prelt'</span>: <span class="string">'115'</span>,</span><br><span class="line"> <span class="string">'rsakv'</span> : rsakv,</span><br><span class="line"> <span class="string">'url'</span>: <span class="string">'http://weibo.com/ajaxlogin.php?framelogin=1&amp;callback=parent.sinaSSOController.feedBackUrlCallBack'</span>,</span><br><span class="line"> <span class="string">'returntype'</span>: <span class="string">'META'</span></span><br><span class="line"> &#125;</span><br><span class="line"> html = requests.post(login_url,data=data).content</span><br><span class="line"></span><br><span class="line"> urlnew = re.findall(<span class="string">'location.replace\(\'(.*?)\''</span>,html,re.S)[<span class="number">0</span>]</span><br><span class="line"></span><br><span class="line"> html = requests.get(urlnew)</span><br><span class="line"> cookies = html.cookies</span><br><span class="line"> <span class="keyword">return</span> cookies</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">get_myuid</span><span class="params">(cookies)</span>:</span></span><br><span class="line"></span><br><span class="line"> url = <span class="string">'http://weibo.com/'</span></span><br><span class="line"> html = requests.get(url,cookies=cookies).content <span class="comment">#用get请求加入cookies参数登陆微博主页</span></span><br><span class="line"></span><br><span class="line"> a = html.find(<span class="string">'[\'uid\']='</span>)</span><br><span class="line"> b = html[a:].find(<span class="string">';'</span>)</span><br><span class="line"> myuid = html[a + len(<span class="string">'[\'uid\']='</span>): a + b][<span class="number">1</span>:<span class="number">-1</span>] <span class="comment">#获取我的uid</span></span><br><span class="line"></span><br><span class="line"> a = html.find(<span class="string">'[\'nick\']='</span>)</span><br><span class="line"> b = html[a:].find(<span class="string">';'</span>)</span><br><span class="line"> myname = html[a + len(<span class="string">'[\'nick\']='</span>): a + b][<span class="number">1</span>:<span class="number">-1</span>] <span class="comment">#获取我的用户名</span></span><br><span class="line"> <span class="keyword">return</span> myuid,myname</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">get_weibo</span><span class="params">(uid,cookies,page)</span>:</span></span><br><span class="line"> <span class="string">'''获取我的前page页的微博'''</span></span><br><span class="line"></span><br><span class="line"> url = <span class="string">'http://weibo.com/'</span>+uid+<span class="string">'/profile'</span></span><br><span class="line"> my_weibo = []</span><br><span class="line"> <span class="keyword">for</span> p <span class="keyword">in</span> range(<span class="number">1</span>,page+<span class="number">1</span>):</span><br><span class="line"></span><br><span class="line"> <span class="comment">#新浪微博每一页信息是异步加载的,分三次加载</span></span><br><span class="line"> <span class="keyword">for</span> pb <span class="keyword">in</span> range(<span class="number">-1</span>,<span class="number">2</span>):</span><br><span class="line"> data = &#123;<span class="string">'pagebar'</span>:str(pb),</span><br><span class="line"> <span class="string">'pre_page'</span>:str(p),</span><br><span class="line"> <span class="string">'page'</span>:str(p),</span><br><span class="line"> &#125;</span><br><span class="line"> <span class="keyword">if</span> p == <span class="number">1</span>:</span><br><span class="line"> <span class="keyword">if</span> pb == <span class="number">-1</span>:</span><br><span class="line"> html = requests.get(url,cookies=cookies).content</span><br><span class="line"> <span class="keyword">else</span>:</span><br><span class="line"> html = requests.get(url,cookies=cookies,params=data).content</span><br><span class="line"> <span class="keyword">else</span>:</span><br><span class="line"> html = requests.get(url,cookies=cookies,params=data).content</span><br><span class="line"></span><br><span class="line"> hlist = html.split(<span class="string">'node-type=\\"feed_list_content\\"'</span>)[<span class="number">1</span>:]</span><br><span class="line"> <span class="keyword">for</span> i <span class="keyword">in</span> hlist:</span><br><span class="line"> i = i.split(<span class="string">'&lt;\/div&gt;'</span>)[<span class="number">0</span>]</span><br><span class="line"> s = re.findall(<span class="string">'&gt;(.*?)&lt;'</span>,i)</span><br><span class="line"> weibo = <span class="string">''</span></span><br><span class="line"> <span class="keyword">for</span> j <span class="keyword">in</span> s:</span><br><span class="line"> weibo = weibo + j.strip(<span class="string">'\\n /\\'</span>)</span><br><span class="line"> <span class="keyword">if</span> len(weibo) != <span class="number">0</span>:</span><br><span class="line"> my_weibo.append(weibo)</span><br><span class="line"> <span class="keyword">return</span> my_weibo</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">get_follow</span><span class="params">(myuid,cookies)</span>:</span></span><br><span class="line"></span><br><span class="line"> <span class="string">'''获取微博关注用户的uid与用户名'''</span></span><br><span class="line"> url = <span class="string">'http://weibo.com/'</span> + myuid + <span class="string">'/follow'</span></span><br><span class="line"> html = requests.get(url,cookies=cookies).content</span><br><span class="line"></span><br><span class="line"> c = html.find(<span class="string">'member_ul clearfix'</span>)<span class="number">-13</span></span><br><span class="line"> html = html[c:]</span><br><span class="line"> u = re.findall(<span class="string">r'[uid=]&#123;4&#125;([0-9]+)[&amp;nick=]&#123;6&#125;(.*?)\\"'</span>,html)</span><br><span class="line"></span><br><span class="line"> user_id = []</span><br><span class="line"> uname = []</span><br><span class="line"> <span class="keyword">for</span> i <span class="keyword">in</span> u:</span><br><span class="line"> user_id.append(i[<span class="number">0</span>]) <span class="comment">#把uid储存到列表user_id中</span></span><br><span class="line"> uname.append(i[<span class="number">1</span>]) <span class="comment">#把用户名储存到列表uname中</span></span><br><span class="line"> <span class="keyword">return</span> user_id,uname</span><br><span class="line"> </span><br><span class="line"><span class="comment">#- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -</span></span><br><span class="line">cookies = Get_cookies(username,password) <span class="comment">#获取登陆后的cookies</span></span><br><span class="line">myuid,myname = get_myuid(cookies) <span class="comment">#获取我的uid与用户名</span></span><br><span class="line">uid,uname = get_follow(myuid,cookies) <span class="comment">#获取关注用户的uid与用户名</span></span><br><span class="line"></span><br><span class="line">s = open(<span class="string">'user_weibo.txt'</span>,<span class="string">'a'</span>)</span><br><span class="line"><span class="keyword">for</span> i <span class="keyword">in</span> range(len(uid)):</span><br><span class="line"> my_weibo = get_weibo(uid[i],cookies,<span class="number">3</span>) <span class="comment">#获取用户前三页的微博信息</span></span><br><span class="line"> <span class="keyword">for</span> j <span class="keyword">in</span> my_weibo:</span><br><span class="line"> s.write(uid[i]+<span class="string">' '</span>+uname[i]+<span class="string">' '</span>+j+<span class="string">'\n'</span>)</span><br><span class="line"> <span class="keyword">print</span> str(i+<span class="number">1</span>)+<span class="string">'/'</span>+str(len(uid))</span><br><span class="line"></span><br><span class="line">s.close()</span><br><span class="line"><span class="keyword">print</span> <span class="string">'所有用户获取完成'</span></span><br></pre></td></tr></table></figure></p>
</content>
<summary type="html">
<p> 在上一篇博客<a href="http://blog.csdn.net/songzhilian22/article/details/48396545">python模拟新浪微博登陆之获取cookies</a>中,已经实现了登陆新浪微博并把cookie保存了下来。接下来通过得到的cookie去访问新浪微博其他页面,并获取我们想要的信息。</p>
<p> 顺便一提,我的软件是python2.7.10(64位),IDE是pycharm,win8.1系统环境。所用到的包是base64、rsa、binascii、re、requests。</p>
<ul>
<li>这里,我首先通过访问自己新浪微博主页获取我的所有微博;</li>
<li>然后进入我的关注用户页面,获取我关注用户的用户ID、用户名;</li>
<li>最后分别进入各个用户的微博主页获取所有微博。
</summary>
<category term="python" scheme="http://bgods.top/tags/python/"/>
<category term="爬虫" scheme="http://bgods.top/tags/%E7%88%AC%E8%99%AB/"/>
</entry>
<entry>
<title>python模拟新浪微博登陆之获取cookies</title>
<link href="http://bgods.top/2016/06/27/python%E6%A8%A1%E6%8B%9F%E6%96%B0%E6%B5%AA%E5%BE%AE%E5%8D%9A%E7%99%BB%E9%99%86%E4%B9%8B%E8%8E%B7%E5%8F%96cookies/"/>
<id>http://bgods.top/2016/06/27/python模拟新浪微博登陆之获取cookies/</id>
<published>2016-06-27T11:19:28.000Z</published>
<updated>2016-09-16T13:05:30.000Z</updated>
<content type="html"><p> 首先感谢<a href="http://www.cnblogs.com/mouse-coder/archive/2013/03/03/2941265.html?utm_source=tuicool" target="_blank" rel="external">敲代码的耗子</a>,之前一直搞不懂登陆新浪微博的原理,看了他那篇文章之后,终于明白了基本原理。在这里主要是通过代码实现那篇文章的过程。</p>
<p> 获取网页使用的包是requests,正则匹配用的是re,其他需要的还有base64、rsa、binascii。如果安装有pip,可以直接在cmd(linux在终端)中输入命令“pip install 包名”进行安装,包的安装方法有很多种,这里不详述。</p>
<p> 其实,过程的实现还是比较容易的。新浪微博登陆主要是post时提交表单的用户名与密码都是经过处理后才提交的。<br><a id="more"></a></p>
<h1 id="利用网页分析工具监控"><a href="#利用网页分析工具监控" class="headerlink" title="利用网页分析工具监控"></a><strong>利用网页分析工具监控</strong></h1><p>我使用的是HttpAnalyzerStdV7工具对登陆过程进行监控(当然也可使用其他工具)。<br>首先打开新浪通行证<a href="http://login.sina.com.cn/" target="_blank" rel="external">http://login.sina.com.cn/</a><br><img src="http://img.blog.csdn.net/20150912201335702" alt=""></p>
<p>然后,打开HttpAnalyzerStdV7工具,并点击开始监控按钮</p>
<p><img src="http://img.blog.csdn.net/20150912201936705" alt=""></p>
<p>最后输入用户名和密码,登陆成功后回到HttpAnalyzerStdV7工具界面,点击”Post Data“,再找到post请求的url,就会看到一个用于最后登陆的要提交的表单,如下图:</p>
<p><img src="http://img.blog.csdn.net/20150912203254297" alt=""></p>
<p>可以看到,最后通过post请求提交的表单中包含了用户名(su,sinauser缩写)、密码(sp,sinapassword缩写);<br>此外还有一些不知道的参数(servertime、nonce、rsakv),这些参数都是登陆必不可少的(再次感谢敲代码的耗子);<br>其他的还有参数,如savestate应该是cookies有效期,这些自己看着办。</p>
<h1 id="get请求验证数据"><a href="#get请求验证数据" class="headerlink" title="get请求验证数据"></a><strong>get请求验证数据</strong></h1><p>其实,在发送post请求之前,浏览器会先向以下url(username是用户名)发送一个get请求,在get请求返回的数据中利用正则匹配提取servertime、nonce、pubkey、rsakv四个的值。过程如下:<br><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">url = <span class="string">'http://login.sina.com.cn/sso/prelogin.php?entry=sso&amp;callback=sinaSSOController.preloginCallBack&amp;su=%s&amp;rsakt=mod&amp;client=ssologin.js(v1.4.4)%username'</span> </span><br><span class="line"></span><br><span class="line">html = requests.get(url).content</span><br><span class="line">servertime = re.findall(<span class="string">'"servertime":(.*?),'</span>,html,re.S)[<span class="number">0</span>]</span><br><span class="line">nonce = re.findall(<span class="string">'"nonce":"(.*?)"'</span>,html,re.S)[<span class="number">0</span>]</span><br><span class="line">pubkey = re.findall(<span class="string">'"pubkey":"(.*?)"'</span>,html,re.S)[<span class="number">0</span>]</span><br><span class="line">rsakv = re.findall(<span class="string">'"rsakv":"(.*?)"'</span>,html,re.S)[<span class="number">0</span>]</span><br></pre></td></tr></table></figure></p>
<h1 id="加密过程"><a href="#加密过程" class="headerlink" title="加密过程"></a><strong>加密过程</strong></h1><p>在1中我们利用工具分析得知post提交的数据中用户名与密码都是经过加密处理的,所以我们必须先对用户名与密码进行加密才能发送post请求。</p>
<p>加密用户名:<br><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">username = base64.b64encode(username)</span><br></pre></td></tr></table></figure></p>
<p>密码采用的是rsa算法加密方式:<br><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">rsaPublickey = int(pubkey, <span class="number">16</span>)</span><br><span class="line">key = rsa.PublicKey(rsaPublickey, <span class="number">65537</span>) <span class="comment">#创建公钥</span></span><br><span class="line">message = str(servertime) + <span class="string">'\t'</span> + str(nonce) + <span class="string">'\n'</span> + str(password) <span class="comment">#拼接明文js加密文件中得到</span></span><br><span class="line">passwd = rsa.encrypt(message, key) <span class="comment">#加密</span></span><br><span class="line">passwd = binascii.b2a_hex(passwd) <span class="comment">#将加密信息转换为16进制。</span></span><br></pre></td></tr></table></figure></p>
<h1 id="post登陆新浪通行证"><a href="#post登陆新浪通行证" class="headerlink" title="post登陆新浪通行证"></a><strong>post登陆新浪通行证</strong></h1><p>用户名username与密码passwd进行加密处理之后,就可以发送post请求登陆了。首先建立一个dict类型data变量,用于存储请求的数据表单,把username、passwd、servertime、nonce、rsakv重要参数都加入到相应位置,其他保持不变。(这里为了省事,data我是直接复制过来的)</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br></pre></td><td class="code"><pre><span class="line">gin_url = <span class="string">'http://login.sina.com.cn/sso/login.php?client=ssologin.js(v1.4.4)'</span></span><br><span class="line"> data = &#123;<span class="string">'entry'</span>: <span class="string">'weibo'</span>,</span><br><span class="line"> <span class="string">'gateway'</span>: <span class="string">'1'</span>,</span><br><span class="line"> <span class="string">'from'</span>: <span class="string">''</span>,</span><br><span class="line"> <span class="string">'savestate'</span>: <span class="string">'7'</span>,</span><br><span class="line"> <span class="string">'userticket'</span>: <span class="string">'1'</span>,</span><br><span class="line"> <span class="string">'ssosimplelogin'</span>: <span class="string">'1'</span>,</span><br><span class="line"> <span class="string">'vsnf'</span>: <span class="string">'1'</span>,</span><br><span class="line"> <span class="string">'vsnval'</span>: <span class="string">''</span>,</span><br><span class="line"> <span class="string">'su'</span>: username,</span><br><span class="line"> <span class="string">'service'</span>: <span class="string">'miniblog'</span>,</span><br><span class="line"> <span class="string">'servertime'</span>: servertime,</span><br><span class="line"> <span class="string">'nonce'</span>: nonce,</span><br><span class="line"> <span class="string">'pwencode'</span>: <span class="string">'rsa2'</span>,</span><br><span class="line"> <span class="string">'sp'</span>: passwd,</span><br><span class="line"> <span class="string">'encoding'</span>: <span class="string">'UTF-8'</span>,</span><br><span class="line"> <span class="string">'prelt'</span>: <span class="string">'115'</span>,</span><br><span class="line"> <span class="string">'rsakv'</span> : rsakv,</span><br><span class="line"> <span class="string">'url'</span>: <span class="string">'http://weibo.com/ajaxlogin.php?framelogin=1&amp;callback=parent.sinaSSOController.feedBackUrlCallBack'</span>,</span><br><span class="line"> <span class="string">'returntype'</span>: <span class="string">'META'</span></span><br><span class="line"> &#125;</span><br><span class="line"> html = requests.post(login_url,data=data).content</span><br></pre></td></tr></table></figure>
<p>然后再运行一下print html代码就会得到以下结果:<br><figure class="highlight html"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line"><span class="tag">&lt;<span class="name">html</span>&gt;</span></span><br><span class="line"><span class="tag">&lt;<span class="name">head</span>&gt;</span></span><br><span class="line"><span class="tag">&lt;<span class="name">title</span>&gt;</span>新浪通行证<span class="tag">&lt;/<span class="name">title</span>&gt;</span></span><br><span class="line"><span class="tag">&lt;<span class="name">meta</span> <span class="attr">http-equiv</span>=<span class="string">"Content-Type"</span> <span class="attr">content</span>=<span class="string">"text/html; charset=GBK"</span> /&gt;</span></span><br><span class="line"></span><br><span class="line"><span class="tag">&lt;<span class="name">script</span> <span class="attr">charset</span>=<span class="string">"utf-8"</span> <span class="attr">src</span>=<span class="string">"http://i.sso.sina.com.cn/js/ssologin.js"</span>&gt;</span><span class="undefined"></span><span class="tag">&lt;/<span class="name">script</span>&gt;</span></span><br><span class="line"><span class="tag">&lt;/<span class="name">head</span>&gt;</span></span><br><span class="line"><span class="tag">&lt;<span class="name">body</span>&gt;</span></span><br><span class="line">正在登录 ...</span><br><span class="line"><span class="tag">&lt;<span class="name">script</span>&gt;</span><span class="undefined"></span><br><span class="line">try&#123;sinaSSOController.setCrossDomainUrlList(&#123;"retcode":0,"arrURL":["http:\/\/crosdom.weicaifu.com\/sso\/crosdom?action=login&amp;savestate=1473605342","http:\/\/passport.97973.com\/sso\/crossdomain?action=login&amp;savestate=1473605342","http:\/\/passport.weibo.cn\/sso\/crossdomain?action=login&amp;savestate=1"]&#125;);&#125;catch(e)&#123;&#125;try&#123;sinaSSOController.crossDomainAction('login',function()&#123;location.replace('http://weibo.com/ajaxlogin.php?framelogin=1&amp;callback=parent.sinaSSOController.feedBackUrlCallBack&amp;ssosavestate=1473605342&amp;ticket=ST-MzU2MjI3OTUwMQ==-1442069342-xd-E69DF7FE9517FBB10247C39F3D1C20F5&amp;retcode=0');&#125;);&#125;catch(e)&#123;&#125;</span><br><span class="line"></span><span class="tag">&lt;/<span class="name">script</span>&gt;</span></span><br><span class="line"><span class="tag">&lt;/<span class="name">body</span>&gt;</span></span><br><span class="line"><span class="tag">&lt;/<span class="name">html</span>&gt;</span></span><br></pre></td></tr></table></figure></p>
<p>其中如果retcode=0,则说明登陆成功,如果是retcode=101则说明是失败。在登陆成功的页面里,你会发现有一个url(在location.replace之后的括号里),利用正则匹配把url提取出来:<br><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">urlnew = re.findall(<span class="string">'location.replace\(\'(.*?)\''</span>,html,re.S)[<span class="number">0</span>]</span><br></pre></td></tr></table></figure></p>
<p>再用刚刚提取到的urlnew发送get请求,并把cookies保存下来:<br><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">cookies = requests.get(urlnew).cookies</span><br></pre></td></tr></table></figure></p>
<p>至此,cookies已经成功保存下来,接下来就可以利用此cookies登陆新浪微博了,在cookies有效期内再也不用使用输入用户名密码登陆了。</p>
<p>=================================================<br><strong>以下是完整的代码,我把它定义成了一个函数,使用时用cookies = Get_cookies(),会提示输入用户名、密码,最后的cookies会保存到cookie变量中。</strong><br><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">#-*- encoding:utf-8 -*-</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">import</span> base64</span><br><span class="line"><span class="keyword">import</span> requests</span><br><span class="line"><span class="keyword">import</span> re</span><br><span class="line"><span class="keyword">import</span> rsa</span><br><span class="line"><span class="keyword">import</span> binascii</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">Get_cookies</span><span class="params">()</span>:</span></span><br><span class="line"> <span class="string">'''登陆新浪微博,获取登陆后的Cookie,返回到变量cookies中'''</span></span><br><span class="line"> username = raw_input(<span class="string">u'请输入用户名:'</span>)</span><br><span class="line"> password = raw_input(<span class="string">u'请输入密码:'</span>)</span><br><span class="line"></span><br><span class="line"> url = <span class="string">'http://login.sina.com.cn/sso/prelogin.php?entry=sso&amp;callback=sinaSSOController.preloginCallBack&amp;su=%s&amp;rsakt=mod&amp;client=ssologin.js(v1.4.4)%'</span>+username</span><br><span class="line"> html = requests.get(url).content</span><br><span class="line"></span><br><span class="line"> servertime = re.findall(<span class="string">'"servertime":(.*?),'</span>,html,re.S)[<span class="number">0</span>]</span><br><span class="line"> nonce = re.findall(<span class="string">'"nonce":"(.*?)"'</span>,html,re.S)[<span class="number">0</span>]</span><br><span class="line"> pubkey = re.findall(<span class="string">'"pubkey":"(.*?)"'</span>,html,re.S)[<span class="number">0</span>]</span><br><span class="line"> rsakv = re.findall(<span class="string">'"rsakv":"(.*?)"'</span>,html,re.S)[<span class="number">0</span>]</span><br><span class="line"></span><br><span class="line"> username = base64.b64encode(username) <span class="comment">#加密用户名</span></span><br><span class="line"> rsaPublickey = int(pubkey, <span class="number">16</span>)</span><br><span class="line"> key = rsa.PublicKey(rsaPublickey, <span class="number">65537</span>) <span class="comment">#创建公钥</span></span><br><span class="line"> message = str(servertime) + <span class="string">'\t'</span> + str(nonce) + <span class="string">'\n'</span> + str(password) <span class="comment">#拼接明文js加密文件中得到</span></span><br><span class="line"> passwd = rsa.encrypt(message, key) <span class="comment">#加密</span></span><br><span class="line"> passwd = binascii.b2a_hex(passwd) <span class="comment">#将加密信息转换为16进制。</span></span><br><span class="line"></span><br><span class="line"> login_url = <span class="string">'http://login.sina.com.cn/sso/login.php?client=ssologin.js(v1.4.4)'</span></span><br><span class="line"> data = &#123;<span class="string">'entry'</span>: <span class="string">'weibo'</span>,</span><br><span class="line"> <span class="string">'gateway'</span>: <span class="string">'1'</span>,</span><br><span class="line"> <span class="string">'from'</span>: <span class="string">''</span>,</span><br><span class="line"> <span class="string">'savestate'</span>: <span class="string">'7'</span>,</span><br><span class="line"> <span class="string">'userticket'</span>: <span class="string">'1'</span>,</span><br><span class="line"> <span class="string">'ssosimplelogin'</span>: <span class="string">'1'</span>,</span><br><span class="line"> <span class="string">'vsnf'</span>: <span class="string">'1'</span>,</span><br><span class="line"> <span class="string">'vsnval'</span>: <span class="string">''</span>,</span><br><span class="line"> <span class="string">'su'</span>: username,</span><br><span class="line"> <span class="string">'service'</span>: <span class="string">'miniblog'</span>,</span><br><span class="line"> <span class="string">'servertime'</span>: servertime,</span><br><span class="line"> <span class="string">'nonce'</span>: nonce,</span><br><span class="line"> <span class="string">'pwencode'</span>: <span class="string">'rsa2'</span>,</span><br><span class="line"> <span class="string">'sp'</span>: passwd,</span><br><span class="line"> <span class="string">'encoding'</span>: <span class="string">'UTF-8'</span>,</span><br><span class="line"> <span class="string">'prelt'</span>: <span class="string">'115'</span>,</span><br><span class="line"> <span class="string">'rsakv'</span> : rsakv,</span><br><span class="line"> <span class="string">'url'</span>: <span class="string">'http://weibo.com/ajaxlogin.php?framelogin=1&amp;callback=parent.sinaSSOController.feedBackUrlCallBack'</span>,</span><br><span class="line"> <span class="string">'returntype'</span>: <span class="string">'META'</span></span><br><span class="line"> &#125;</span><br><span class="line"> html = requests.post(login_url,data=data).content</span><br><span class="line"> urlnew = re.findall(<span class="string">'location.replace\(\'(.*?)\''</span>,html,re.S)[<span class="number">0</span>]</span><br><span class="line"></span><br><span class="line"> <span class="comment">#发送get请求并保存cookies</span></span><br><span class="line"> cookies = requests.get(urlnew).cookies</span><br><span class="line"> <span class="keyword">return</span> cookies</span><br></pre></td></tr></table></figure></p>
</content>
<summary type="html">
<p> 首先感谢<a href="http://www.cnblogs.com/mouse-coder/archive/2013/03/03/2941265.html?utm_source=tuicool">敲代码的耗子</a>,之前一直搞不懂登陆新浪微博的原理,看了他那篇文章之后,终于明白了基本原理。在这里主要是通过代码实现那篇文章的过程。</p>
<p> 获取网页使用的包是requests,正则匹配用的是re,其他需要的还有base64、rsa、binascii。如果安装有pip,可以直接在cmd(linux在终端)中输入命令“pip install 包名”进行安装,包的安装方法有很多种,这里不详述。</p>
<p> 其实,过程的实现还是比较容易的。新浪微博登陆主要是post时提交表单的用户名与密码都是经过处理后才提交的。<br>
</summary>
<category term="python" scheme="http://bgods.top/tags/python/"/>
<category term="爬虫" scheme="http://bgods.top/tags/%E7%88%AC%E8%99%AB/"/>
</entry>
<entry>
<title>About Blog</title>
<link href="http://bgods.top/2016/06/27/About-Blog/"/>
<id>http://bgods.top/2016/06/27/About-Blog/</id>
<published>2016-06-27T10:18:44.000Z</published>
<updated>2016-09-16T11:59:06.000Z</updated>
<content type="html"><p> 本博客是使用<a href="https://hexo.io/" target="_blank" rel="external">Hexo</a>创建的,采用的主题是<a href="https://github.com/MOxFIVE/hexo-theme-yelee" target="_blank" rel="external">yelee</a>,网站托管于<a href="https://github.com/Bgods/Bgods.github.io" target="_blank" rel="external">github</a>。</p>
<a id="more"></a>
</content>
<summary type="html">
<p> 本博客是使用<a href="https://hexo.io/">Hexo</a>创建的,采用的主题是<a href="https://github.com/MOxFIVE/hexo-theme-yelee">yelee</a>,网站托管于<a href="https://github.com/Bgods/Bgods.github.io">github</a>。</p>
</summary>
<category term="hexo" scheme="http://bgods.top/tags/hexo/"/>
</entry>
<entry>
<title>ssh无密码登陆</title>
<link href="http://bgods.top/2016/06/27/ssh%E6%97%A0%E5%AF%86%E7%A0%81%E7%99%BB%E9%99%86/"/>
<id>http://bgods.top/2016/06/27/ssh无密码登陆/</id>
<published>2016-06-27T10:17:13.000Z</published>
<updated>2016-09-16T12:25:15.000Z</updated>
<content type="html"><h1 id="修改主机(非必要)"><a href="#修改主机(非必要)" class="headerlink" title="修改主机(非必要)"></a><strong>修改主机(非必要)</strong></h1><p> 为了后面方便使用,修改主机名,contos修改”/etc/sysconfig/network”文件,ubuntu修改”/etc/hostname”文件.<br><a id="more"></a><br>比如我这里有两台主机,分别命名为”m”和”s”:<br><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">HOSTNAME=m</span><br></pre></td></tr></table></figure></p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">HOSTNAME=s</span><br></pre></td></tr></table></figure>
<p>修改/etc/hosts文件,增加如下地址映射,然后保存:</p>
<p><img src="http://img.blog.csdn.net/20151018221219080?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQv/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center" alt=""></p>
<h1 id="配置ssh无密码登陆"><a href="#配置ssh无密码登陆" class="headerlink" title="配置ssh无密码登陆"></a><strong>配置ssh无密码登陆</strong></h1><p>切换回hadoop(普通)用户。以下注意顺序:</p>
<ol>
<li>分别在s、m机中,使用命令”ssh-keygen -t dsa -P ‘’ -f ~/.ssh/id_dsa”生成密钥;</li>
<li>m机中,使用命令”cat ~/.ssh/id_dsa.pub &gt;&gt; ~/.ssh/authorized_keys”,将m机的id_dsa.pub(公钥)追加到authorized_keys文件中;</li>
<li>m机中,使用命令”scp ~/.ssh/authorized_keys 192.168.128.101:~/.ssh/“将认证文件复制到s机上;</li>
<li>s机中,使用命令”cat ~/.ssh/id_dsa.pub &gt;&gt; ~/.ssh/authorized_keys”,将s机的id_dsa.pub(公钥)追加到authorized_keys文件中;</li>
<li>s机中,使用命令”scp ~/.ssh/authorized_keys 192.168.128.100:~/.ssh/“将认证文件复制到m机上;</li>
</ol>
<p>最后,验证。</p>
<p><img src="http://img.blog.csdn.net/20151018224618054" alt=""></p>
<p><strong>使用SSH无密码相互登陆,原理是:</strong></p>
<ul>
<li>在各台机器上分别生成各自的公钥(id_dsa.pub);</li>
<li>然后再把所有机器的公钥都添加到同一个认证文件authorized_keys中;</li>
<li>最后把这个认证文件放到相应位置。</li>
</ul>
</content>
<summary type="html">
<h1 id="修改主机(非必要)"><a href="#修改主机(非必要)" class="headerlink" title="修改主机(非必要)"></a><strong>修改主机(非必要)</strong></h1><p> 为了后面方便使用,修改主机名,contos修改”/etc/sysconfig/network”文件,ubuntu修改”/etc/hostname”文件.<br>
</summary>
<category term="ssh" scheme="http://bgods.top/tags/ssh/"/>
</entry>
<entry>
<title>centos mysql 安装及配置</title>
<link href="http://bgods.top/2016/06/27/centos-mysql-%E5%AE%89%E8%A3%85%E5%8F%8A%E9%85%8D%E7%BD%AE/"/>
<id>http://bgods.top/2016/06/27/centos-mysql-安装及配置/</id>
<published>2016-06-27T10:17:07.000Z</published>
<updated>2016-09-16T13:04:59.000Z</updated>
<content type="html"><p> 在centos中的mysql安装及配置<br><a id="more"></a></p>
<h1 id="安装Mysql"><a href="#安装Mysql" class="headerlink" title="安装Mysql"></a><strong>安装Mysql</strong></h1><ol>
<li><p>查看mysql-server包列表</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">yum list mysql-server</span><br></pre></td></tr></table></figure>
<p><img src="http://img.blog.csdn.net/20151108135153826" alt=""></p>
</li>
<li><p>选择对应版本安装</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">yum install mysql-server.x86_64</span><br></pre></td></tr></table></figure>
<p><img src="http://img.blog.csdn.net/20151108140221404" alt=""><br><img src="http://img.blog.csdn.net/20151108140727499" alt=""></p>
</li>
</ol>
<h1 id="设置Mysql的服务"><a href="#设置Mysql的服务" class="headerlink" title="设置Mysql的服务"></a><strong>设置Mysql的服务</strong></h1><ol>
<li><p>启动Mysql服务</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">service mysqld start</span><br></pre></td></tr></table></figure>
</li>
<li><p>设置mysql开机自启</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">chkconfig mysqld on</span><br></pre></td></tr></table></figure>
</li>
<li><p>开启3306端口并保存</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">/sbin/iptables -I INPUT -p tcp --dport 3306 -j ACCEPT</span><br><span class="line">/etc/rc.d/init.d/iptables save</span><br></pre></td></tr></table></figure>
<p><img src="http://img.blog.csdn.net/20151108141137159" alt=""></p>
</li>
</ol>
<h1 id="修改密码并设置远程访问"><a href="#修改密码并设置远程访问" class="headerlink" title="修改密码并设置远程访问"></a><strong>修改密码并设置远程访问</strong></h1><ol>
<li><p>连接mysql数据库</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">mysql -uroot -p</span><br></pre></td></tr></table></figure>
<p><img src="http://img.blog.csdn.net/20151108141316181" alt=""></p>
</li>
<li><p>设置密码</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line"># 选择mysql数据库</span><br><span class="line">use mysql;</span><br><span class="line"></span><br><span class="line"># 将root用户的密码设置为root</span><br><span class="line">update user set password=password(&apos;root&apos;) where user=&apos;root&apos;;</span><br><span class="line"></span><br><span class="line"># 强制让MySQL重新加载权限,使修改马上生效</span><br><span class="line">flush privileges;</span><br></pre></td></tr></table></figure>
</li>
<li><p>设置Mysql远程访问。配置远程登陆时,需要的用户名(root)和密码(123456)</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">grant all privileges on *.* to &apos;root&apos;@&apos;%&apos; identified by &apos;123456&apos; with grant option;</span><br></pre></td></tr></table></figure>
</li>
</ol>
<h1 id="解决Mysql乱码问题"><a href="#解决Mysql乱码问题" class="headerlink" title="解决Mysql乱码问题"></a><strong>解决Mysql乱码问题</strong></h1><ol>
<li><p>找到配置文件,复制到/etc/目录,命名为my.cnf</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">cp /usr/share/doc/mysql-server-5.1.73/my-medium.cnf /etc/my.cnf</span><br></pre></td></tr></table></figure>
<p><img src="http://img.blog.csdn.net/20151108142422765" alt=""></p>
</li>
<li><p>在[client]和[mysqld]下面都添加上“default-character-set=utf8”</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">vim my.cnf</span><br></pre></td></tr></table></figure>
<p><img src="http://img.blog.csdn.net/20151108142545559" alt=""></p>
</li>
<li><p>最后重新启动服务就可以了</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">service mysqld restart</span><br></pre></td></tr></table></figure>
</li>
</ol>
</content>
<summary type="html">
<p> 在centos中的mysql安装及配置<br>
</summary>
<category term="mysql" scheme="http://bgods.top/tags/mysql/"/>
<category term="centos" scheme="http://bgods.top/tags/centos/"/>
</entry>
<entry>
<title>60 个实用的 R 语言技巧(转载)</title>
<link href="http://bgods.top/2016/06/27/60-%E4%B8%AA%E5%AE%9E%E7%94%A8%E7%9A%84-R-%E8%AF%AD%E8%A8%80%E6%8A%80%E5%B7%A7%EF%BC%88%E8%BD%AC%E8%BD%BD%EF%BC%89/"/>
<id>http://bgods.top/2016/06/27/60-个实用的-R-语言技巧(转载)/</id>
<published>2016-06-27T10:15:35.000Z</published>
<updated>2016-09-16T14:46:03.000Z</updated>
<content type="html"><p> 本文内容来源于 Rstatistics.net 的 60 R Tips,这些都是作者们长期使用 R 积累下来的一些技巧或者建议。我觉得这个内容挺好的,并且在书上看不到这些内容,所以做了搬运和翻译,重点是加了例子,否则如果只看文字可能搞不懂状况。</p>
<a id="more"></a>
<p><font color="red" size="4"><strong>本文转载自:</strong></font> 60 个实用的 R 语言技巧 | EthanDeng +<a href="http://ddswhu.com/2015/09/07/60-r-tips/" target="_blank" rel="external">http://ddswhu.com/2015/09/07/60-r-tips/</a></p>
<p>转载请注明以上的原文地址。</p>
<hr>
<p> 本文内容来源于 Rstatistics.net 的 60 R Tips,这些都是作者们长期使用 R 积累下来的一些技巧或者建议。我觉得这个内容挺好的,并且在书上看不到这些内容,所以做了搬运和翻译,重点是加了例子,否则如果只看文字可能搞不懂状况。</p>
<ol>
<li><p>在将 factor 类型的变量转为数值变量的时候切记不要使用 as.numeric(),正确的方式是 as.numeric(as.character(myFactorVar))。</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">cha2fac &lt;- as.factor(c(<span class="string">"4"</span>,<span class="string">"8"</span>,<span class="string">"10"</span>,<span class="string">"15"</span>))</span><br><span class="line">as.numeric(cha2fac)</span><br><span class="line">as.numeric(as.character(cha2fac))</span><br></pre></td></tr></table></figure>
</li>
<li><p>使用选项 options(show.error.messages = F) 可以关闭错误信息提示。</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">class(x) <span class="comment"># error msg: Error: object 'x' not found</span></span><br><span class="line">options(show.error.messages = <span class="literal">F</span>)</span><br><span class="line">class(x)</span><br></pre></td></tr></table></figure>
</li>
<li><p>使用 file.path() 创建(使用)文件路径,这可以保证在不同系统下都适用。</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">setwd(file.path(<span class="string">"F:"</span>, <span class="string">"git"</span>, <span class="string">"roxygen2"</span>))</span><br></pre></td></tr></table></figure>
</li>
<li><p>在对字符串排序的时候,如果需要对数字也能排序,可以使用 gtools 包中的 mixedsort(),效果与 sort() 不一样。</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">Treatment &lt;- c(<span class="string">"Control"</span>, <span class="string">"Asprin 10mg/day"</span>, <span class="string">"Asprin 50mg/day"</span>, <span class="string">"Asprin 100mg/day"</span>, <span class="string">"Acetomycin 100mg/day"</span>, <span class="string">"Acetomycin 1000mg/day"</span>)</span><br><span class="line">sort(Treatment)</span><br><span class="line"><span class="keyword">require</span>(gtools)</span><br><span class="line">mixedsort(Treatment)</span><br></pre></td></tr></table></figure>
</li>
<li><p>在绘图的时候使用 ylim = range(myNumericData) + 10 可以调整 Y 轴绘图范围,可以使用倍数或者区间值。</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">x &lt;- seq(<span class="number">1</span>:<span class="number">10</span>)</span><br><span class="line">set.seed(<span class="number">1101</span>)</span><br><span class="line">y &lt;- <span class="number">10</span>*rnorm(<span class="number">10</span>)</span><br><span class="line">plot(x, y)</span><br><span class="line">plot(x, y, ylim = <span class="number">1.25</span>*range(y))</span><br></pre></td></tr></table></figure>
</li>
<li><p>使用 plot() 绘图时,可以使用 las 参数调整坐标轴标签(数字)的显示方向,las 取值 {0,1,2,3},对应的对齐方式为 {平行于轴(默认),水平(这个不错),垂直于轴,垂直}。</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">plot(x, y, las = <span class="number">1</span>)</span><br><span class="line">plot(x, y, las = <span class="number">2</span>)</span><br></pre></td></tr></table></figure>
</li>
<li><p>Use memory.limit(size=2500) 限制 R 占用内存。</p>
</li>
<li><p>alarm() 函数可以添加到我们函数、过程的末尾,用以提示工作完成进度。(注意:RStudio 中无效)</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">for</span> (i <span class="keyword">in</span> <span class="number">1</span>:<span class="number">5</span>) &#123;</span><br><span class="line">Sys.sleep(<span class="number">1</span>)</span><br><span class="line">alarm()</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure>
</li>
<li><p>eval(parse(text=paste(“a &lt;- 10”))) 会创建向量 a 并赋值为 10。这个命令可以将字符串作为 R 命令执行。</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">eval(parse(text=paste(<span class="string">"a &lt;- 10"</span>)))</span><br><span class="line">a</span><br></pre></td></tr></table></figure>
</li>
<li><p>sessionInfo() 可以获取 R 的版本、环境信息,以及加载的包的信息。</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">sessionInfo()</span><br></pre></td></tr></table></figure>
</li>
<li><p>计算从 word1 到 word2 所需要的变化可以使用 adist(word1, word2).</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">adist(<span class="string">"hello world"</span>,<span class="string">"hello wordx"</span>)</span><br></pre></td></tr></table></figure>
</li>
<li><p>使用选项 options(max.print=1000000) 可以增加控制台的信息显示的行数。</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">options(max.print=<span class="number">1000000</span>)</span><br></pre></td></tr></table></figure>
</li>
<li><p>使用 cmd 运行 R 代码:”C:\your-R-path\R.exe” CMD BATCH –vanilla –slave</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="string">"c:\project-path\my_script.R"</span> <span class="comment">#(可以用这个写自动运行的脚本,比如批处理。)</span></span><br></pre></td></tr></table></figure>
</li>
<li><p>如果有多个 R 会话,每个 R 的唯一 id 可以用 Sys.getpid() 获取。</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">Sys.getpid()</span><br></pre></td></tr></table></figure>
</li>
<li><p>可以使用 unname() 去除 R 对象的名称属性。</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">y &lt;- quantile(mtcars$mpg)</span><br><span class="line">unname(y)</span><br></pre></td></tr></table></figure>
</li>
<li><p>检验两个对象(x 和 y)是否一致使用 identical(x, y),使用 all.equal 会比较各种属性是否一致。</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">x &lt;- c(<span class="number">1</span>, <span class="number">2</span>)</span><br><span class="line">y &lt;- as.vector(x)</span><br><span class="line">identical(x, y)</span><br><span class="line">all.equal(x, y)</span><br><span class="line">y2 &lt;- c(y, <span class="number">3</span>)</span><br><span class="line">all.equal(x, y2)</span><br></pre></td></tr></table></figure>
</li>
<li><p>使用 R 获取 Twitter 推文(用于文本分析)。<a href="http://rstatistics.net/extracting-tweets-with-r/" target="_blank" rel="external">http://rstatistics.net/extracting-tweets-with-r/</a></p>
</li>
<li>关于时间序列分析简短的介绍:<a href="http://rstatistics.net/time-series-analysis/" target="_blank" rel="external">http://rstatistics.net/time-series-analysis/</a></li>
<li>当某个步骤运行的时间太长(超过预先设定的时间),可以使用 R.utils 包中的 withTimeout() 打断,然后跳到下一个步骤继续运行。</li>
<li><p>可以使用 dist() 计算矩阵行与行之间的距离(默认是欧氏距离)。</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">x &lt;- matrix(seq(<span class="number">1</span>:<span class="number">20</span>), ncol = <span class="number">4</span>, byrow = <span class="literal">FALSE</span>)</span><br><span class="line">dist(x, method = <span class="string">"euclidean"</span>, upper = <span class="literal">TRUE</span>)</span><br></pre></td></tr></table></figure>
</li>
<li><p>计算向量的(多重)差分可以使用 diff()</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">x &lt;- c(seq(<span class="number">1</span>:<span class="number">5</span>), seq(from = <span class="number">1</span>, to = <span class="number">9</span>, by = <span class="number">2</span>))</span><br><span class="line">diff(x, <span class="number">2</span>)</span><br></pre></td></tr></table></figure>
</li>
<li><p>选项 options(scipen=999) 可以关闭数字科学记数法显示。</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="number">1e-5</span></span><br><span class="line">options(scipen=<span class="number">999</span>)</span><br><span class="line"><span class="number">1e-5</span></span><br></pre></td></tr></table></figure>
</li>
<li><p>earth 包中的 bagEarth() 可以用来做 Bagged MARS (多元适应性回归平滑)</p>
</li>
<li>可以使用 setClass(‘myClass’) 定义一个类型 myClass,setAs() 可以做进一步的自定义。</li>
<li><p>创建大量的变量可以使用 assign (“varName”, 10),原因在于,我们可以向 varName 传递变量名(比如用循环),方便编程。</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">assign(<span class="string">"x"</span>, <span class="number">10</span>)</span><br></pre></td></tr></table></figure>
</li>
<li><p>dim(matrix) 返回的是矩阵的行数与列数。</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">my.Matrix &lt;- matrix(<span class="number">1</span>:<span class="number">20</span>, ncol = <span class="number">4</span>)</span><br><span class="line">dim(my.Matrix)</span><br></pre></td></tr></table></figure>
</li>
<li><p>两个编写函数的技巧:1. 使用 … 传递已有函数的参数。2. 使用 invisible 隐藏输出。<br>视频参看:<a href="https://www.youtube.com/watch?v=ahRHTXNjixU" target="_blank" rel="external">https://www.youtube.com/watch?v=ahRHTXNjixU</a></p>
</li>
<li>使用 data.matrix() 可以将一个数据框转为数值矩阵,并且因子类型也会得到正确的转化。</li>
<li>invisible(..) 可以不显示输出,在定义函数的时候经常使用到。</li>
<li><p>cat(“\014”) 能清空 R 会话中的内容(类似于 CTRL + L 清屏,还是蛮有用的)。</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">cat(<span class="string">"\014"</span>)</span><br></pre></td></tr></table></figure>
</li>
<li><p>dir(“folder.path”) 会显示文件夹内的内容,类似于 cmd。</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">dir()</span><br><span class="line">dir(<span class="string">"subfolder.path"</span>)</span><br></pre></td></tr></table></figure>
</li>
<li><p>在一个因子变量中如果存在缺失值,建议将缺失值做成一个因子等级 UNKNOWN,可以使用 levels(Var) &lt;-c(levels(Var), “UNKNOWN”) 来实现。</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">my.Factor &lt;- as.factor(c(<span class="string">"First"</span>, <span class="string">"Second"</span>, <span class="string">"Third"</span>, <span class="literal">NA</span>))</span><br><span class="line">levels(my.Factor) &lt;-c(levels(my.Factor), <span class="string">"UNKNOWN"</span>)</span><br><span class="line">my.Factor</span><br></pre></td></tr></table></figure>
</li>
<li><p>初始化所有加载的包可以使用 lapply(x, require, character.only = T),其中 x 为包的名称。</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">lapply(c(<span class="string">"dplyr"</span>, <span class="string">"tidyr"</span>), <span class="keyword">require</span>, character.only = <span class="literal">T</span>)</span><br></pre></td></tr></table></figure>
</li>
<li><p>rev() 函数可以将一个向量翻转过来。</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">x &lt;- seq(<span class="number">1</span>:<span class="number">10</span>, )</span><br><span class="line">rev(x)</span><br></pre></td></tr></table></figure>
</li>
<li><p>complete.cases() 顾名思义,它的作用是得到完整观测(不含缺失)的索引,用于数据框缺失值的行删除。</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">nrow(mtcars)mtcars$mpg[mtcars$disp &gt; <span class="number">200</span>] &lt;- <span class="literal">NA</span></span><br><span class="line">mtcars</span><br><span class="line">mtcars2 &lt;- mtcars[complete.cases(mtcars), ]</span><br><span class="line">mtcars2</span><br></pre></td></tr></table></figure>
</li>
<li><p>nnet 包中的 avNNet() 可以用来做 Averaged 神经网络模型。</p>
</li>
<li><p>file.remove(‘filepath’) 可以用来删除文件夹中的文件,如果我们要删除重复性的中间文件,可以用它来实现。</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">file.create(<span class="string">"tempfile.R"</span>)</span><br><span class="line">file.remove(<span class="string">"tempfile.R"</span>)</span><br></pre></td></tr></table></figure>
</li>
<li><p>ada 包中的 ada() 函数可以用来做 Boosted 分类树问题。</p>
</li>
<li><p>unclass() 可以将 lm 对象拆散成列表(list),方便我们获取未被显示的元素。</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">mod &lt;- lm(wt ~ disp + cyl, data = mtcars)</span><br><span class="line">unclass(mod)</span><br></pre></td></tr></table></figure>
</li>
<li><p>根据数据框(df)的两列进行排序可以使用 df[order(df$col1, df$col2), ]</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">mtcars</span><br><span class="line">mtcars[order(mtcars$carb, mtcars$hp), ]</span><br></pre></td></tr></table></figure>
</li>
<li><p>将一个 N 阶因子变量转为 N 个 0-1 变量最简单的方式是 model.matrix(~as.factor(Data)+0)</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">model.matrix(~as.factor(mtcars$carb)+<span class="number">0</span>)</span><br></pre></td></tr></table></figure>
</li>
<li><p>对一个时间序列去季节趋势可以使用 seaadj():<a href="http://goo.gl/Oio7s2" target="_blank" rel="external">http://goo.gl/Oio7s2</a>.</p>
</li>
<li>在一个函数内对函数外的变量的赋值使用 &lt;&lt;-,而不要用 &lt;-。</li>
<li>在 Windows 中,使用 memory.limit(size=desired-size) 可以限制 R 使用内存的大小,其他操作系统,使用 mem.limits()。</li>
<li>使用 file.copy(from=fromFile, to = toFile, overwrite = TRUE) 可以实现文件的复制。</li>
<li>debugonce() 可以调试一次代码,它与 debug() 的区别是无需使用 undebug() 跳出调试。</li>
<li>在 R 中,将一个因子类型的变量(factor)转化为一组 0/1 虚拟变量可以使用 bins &lt;- model.matrix(~ 0 + varName, data),在回归的时候经常会用到这个。(同 42)</li>
<li>arules 包中的 discretize() 函数可以很方便的将一个连续变量转为分类变量(categorical)。</li>
<li>NROW() 类似于 nrow(),不过前者对向量也适用,相比 length() 更具有鲁棒性。</li>
<li>在 R 里面输入 commandArgs(),将会返回使用 cmd 运行 R 脚本所需要传递的参数。</li>
<li>在函数内使用 attr(myFunc, “AttrName”) &lt;- myVal,在我们下次调用 myFunc 的时候将会记住 AttrName 这个属性。</li>
<li>object.size() 可以得到给定 R 对象所消耗的内存。</li>
<li>当我们处理比较大的 R 项目的时候,可以使用 ls.str() 查看这些 R 对象的结构信息。</li>
<li>dir(path=’dir_path’) 将会列出 dir_path 下的所有文件及文件夹。</li>
<li><p>library(help = libname) 会显示 libname 这个包的所有函数以及所带的数据集。(前提是必须安装了这个包)</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">install.packages(<span class="string">"AER"</span>)</span><br><span class="line"><span class="keyword">library</span>(help = AER)</span><br></pre></td></tr></table></figure>
</li>
<li><p>get(“objectNameString”) 会获取对象名称为 objectNameString 的对象。如果这个对象在一个特定的环境中,使用 envir 参数。</p>
<figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">x &lt;- c(<span class="number">1</span>, <span class="number">2</span>, <span class="number">3</span>)</span><br><span class="line">get(<span class="string">"x"</span>)</span><br></pre></td></tr></table></figure>
</li>
<li><p>可以使用 cor.test(x,y) 计算 x 与 y 的相关性。</p>
</li>
<li>如果你要做交互式的数据展示,Shiny 是一个很好的选择,这里是 Shiny 的一张备忘单。<a href="http://bit.ly/1pFWGJW" target="_blank" rel="external">http://bit.ly/1pFWGJW</a></li>
</ol>
<p><font color="red"><strong>本文转载自:</strong></font> <a href="http://ddswhu.com/2015/09/07/60-r-tips/" target="_blank" rel="external">60 个实用的 R 语言技巧</a></p>
</content>
<summary type="html">
<p> 本文内容来源于 Rstatistics.net 的 60 R Tips,这些都是作者们长期使用 R 积累下来的一些技巧或者建议。我觉得这个内容挺好的,并且在书上看不到这些内容,所以做了搬运和翻译,重点是加了例子,否则如果只看文字可能搞不懂状况。</p>
</summary>
<category term="R" scheme="http://bgods.top/tags/R/"/>
</entry>
<entry>
<title>初识Scrapy-实战(一)</title>
<link href="http://bgods.top/2016/06/27/%E5%88%9D%E8%AF%86Scrapy-%E5%AE%9E%E6%88%98%EF%BC%88%E4%B8%80%EF%BC%89/"/>
<id>http://bgods.top/2016/06/27/初识Scrapy-实战(一)/</id>
<published>2016-06-27T10:07:24.000Z</published>
<updated>2016-09-16T13:31:40.000Z</updated>
<content type="html"><p> 接触爬虫也有一段时间了,起初都是使用request库爬取数据,并没有使用过什么爬虫框架。对于scrapy这个框架,之前仅仅是好奇,这两天看了一下<a href="http://scrapy-chs.readthedocs.org/zh_CN/1.0/intro/tutorial.html" target="_blank" rel="external">scrapy</a>文档,也试着去爬了一些数据,发现还是很方便的。<br><a id="more"></a><br> 以下以爬 <a href="http://index.bitauto.com/xiaoliang/" target="_blank" rel="external">易车网</a>的销售指数为例。</p>
<p>要爬取的字段是:</p>
<ul>
<li>时间(年月);</li>
<li>分类类别(包括小型、微型、中型、紧凑型、中大型、SUV、MPV);</li>
<li>车型(二级分类);</li>
<li>销量。</li>
</ul>
<hr>
<h1 id="分析网站结构"><a href="#分析网站结构" class="headerlink" title="分析网站结构"></a><strong>分析网站结构</strong></h1><p> 首先分析网站结构,其中包括翻页的实现、不同类别、数据的加载类型(aspx或者html)、请求类型(post、get)等等。</p>
<h2 id="静态方式"><a href="#静态方式" class="headerlink" title="静态方式"></a><strong>静态方式</strong></h2><ol>
<li>通过点击翻页按钮,发现URL是改变的,比如紧凑型车2016-3第二个页面的URL是:</li>
<li><a href="http://index.bitauto.com/xiaoliang/jincouxingche/2016m3/2/" target="_blank" rel="external">http://index.bitauto.com/xiaoliang/<font color="red">jincouxingche/2016m3/2/</font></a>;</li>
<li>以上URL红色部分的2016是年份,3是月份,2是页码,jincouxingche是表示紧凑型车;</li>
<li>如果需要查询季度数据,只需要把url中的m改为s就行,比如“2016s2”表示查询2016年第二季度数据;</li>
<li>所以我们只需要改变URL中的年份、月份、页码、类别,就可以请求到不同的数据。</li>
</ol>
<h2 id="动态方式"><a href="#动态方式" class="headerlink" title="动态方式"></a><strong>动态方式</strong></h2><p> 再深入我们发现,点击页面中的切换时间按钮(javascript实现)时,发现url是没有发生改变的,返回的是aspx页面。<br>使用抓包工具(我使用的是火狐自带的)可以查看URL是什么,提交了什么数据。<br><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># URL </span></span><br><span class="line">URL = <span class="string">"http://index.bitauto.com/Interface/GetData.aspx?"</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 提交参数</span></span><br><span class="line">data = &#123;</span><br><span class="line"> <span class="string">"indexType"</span>: <span class="string">"xiaoliang"</span>,</span><br><span class="line"> <span class="string">"brandType"</span>: <span class="string">"level"</span>,</span><br><span class="line"> <span class="string">"itemID"</span>: <span class="string">"0"</span>,</span><br><span class="line"> <span class="string">"dateType"</span>: <span class="string">"m"</span>, <span class="comment"># 日期类型(月份m,季度s)</span></span><br><span class="line"> <span class="string">"dateValue"</span>: <span class="string">"1"</span>, <span class="comment"># 月份(1-12)、季度(1-4)</span></span><br><span class="line"> <span class="string">"cityID"</span>: <span class="string">"0"</span>, <span class="comment"># 城市代码,0代表是全国。</span></span><br><span class="line"> <span class="string">"dateYear"</span>: <span class="string">"2016"</span>, <span class="comment"># 年份</span></span><br><span class="line"> <span class="string">"pageBlock"</span>: <span class="string">"indexListMore"</span>,</span><br><span class="line"> <span class="string">"levelSpell"</span>: <span class="string">"jincouxingche"</span>, <span class="comment"># 分类类别</span></span><br><span class="line"> <span class="string">"0.418664796167437"</span>: <span class="string">""</span></span><br><span class="line">&#125;</span><br><span class="line"><span class="comment"># 改变字典data中dateType、dateValue、dateYear、levelSpell的值,就可以请求到不同的数据。</span></span><br></pre></td></tr></table></figure></p>
<p>关于城市代码可以通过抓包获取到,<a href="http://api.admin.bitauto.com/city/getcity.ashx?callback=City_Select._$JSON_callback.$JSON&amp;requesttype=json&amp;bizCity=1" target="_blank" rel="external">这里是我用抓包工具获取到地址</a>。</p>
<p> 当然这种方式,可以查询更多的纬度数据;但是有个问题,目前我没有找到哪个参数是实现翻页的。所以这里使用的是第一种方式获取数据。</p>
<h1 id="编写spiders"><a href="#编写spiders" class="headerlink" title="编写spiders"></a><strong>编写spiders</strong></h1><ol>
<li><p>yicheSpider.py</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># -*- coding:utf-8 -*-</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">import</span> scrapy</span><br><span class="line"><span class="keyword">from</span> scrapy.http <span class="keyword">import</span> Request</span><br><span class="line"><span class="keyword">from</span> yiche.items <span class="keyword">import</span> YicheItem</span><br><span class="line"><span class="keyword">import</span> re</span><br><span class="line"></span><br><span class="line"><span class="comment"># create UrlList</span></span><br><span class="line">url_list = []</span><br><span class="line">Type = [<span class="string">'jincouxingche'</span>,<span class="string">'xiaoxingche'</span>,<span class="string">'weixingche'</span>,<span class="string">'zhongxingche'</span>,<span class="string">'zhongdaxingche'</span>,<span class="string">'suv'</span>,<span class="string">'mpv'</span>]</span><br><span class="line"><span class="keyword">for</span> t <span class="keyword">in</span> Type:</span><br><span class="line"> <span class="keyword">for</span> year <span class="keyword">in</span> range(<span class="number">2010</span>,<span class="number">2016</span>):</span><br><span class="line"> <span class="keyword">for</span> m <span class="keyword">in</span> range(<span class="number">1</span>,<span class="number">13</span>):</span><br><span class="line"> url = <span class="string">'http://index.bitauto.com/xiaoliang/'</span>+t+<span class="string">'/'</span>+str(year)+<span class="string">'m'</span>+str(m)+<span class="string">'/1'</span></span><br><span class="line"> url_list.append(url)</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">YicheSpider</span><span class="params">(scrapy.spiders.Spider)</span>:</span></span><br><span class="line"> name = <span class="string">"yiche"</span></span><br><span class="line"> allowed_domains = [<span class="string">"index.bitauto.com"</span>]</span><br><span class="line"></span><br><span class="line"> start_urls = url_list</span><br><span class="line"></span><br><span class="line"> <span class="function"><span class="keyword">def</span> <span class="title">parse</span><span class="params">(self, response)</span>:</span></span><br><span class="line"></span><br><span class="line"> <span class="comment"># 获取第一个页面的数据</span></span><br><span class="line"> s = response.url</span><br><span class="line"> t,year,m = re.findall(<span class="string">'xiaoliang/(.*?)/(\d+)m(\d+)'</span>,s,re.S)[<span class="number">0</span>]</span><br><span class="line"></span><br><span class="line"> <span class="keyword">for</span> sel <span class="keyword">in</span> response.xpath(<span class="string">'//ol/li'</span>):</span><br><span class="line"></span><br><span class="line"> Name = sel.xpath(<span class="string">'a/text()'</span>).extract()[<span class="number">0</span>]</span><br><span class="line"> SalesNum = sel.xpath(<span class="string">'span/text()'</span>).extract()[<span class="number">0</span>]</span><br><span class="line"> <span class="comment">#print Name,SalesNum</span></span><br><span class="line"> items = YicheItem()</span><br><span class="line"> items[<span class="string">'Date'</span>] = str(year)+<span class="string">'/'</span>+str(m)</span><br><span class="line"> items[<span class="string">'CarName'</span>] = Name</span><br><span class="line"> items[<span class="string">'Type'</span>] = t</span><br><span class="line"> items[<span class="string">'SalesNum'</span>] = SalesNum</span><br><span class="line"> <span class="keyword">yield</span> items</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"> <span class="comment"># 判断是否还有下一页,如果没有跳过,有则爬取下一个页面</span></span><br><span class="line"> <span class="keyword">if</span> len(response.xpath(<span class="string">'//div[@class="the_pages"]/@class'</span>).extract())==<span class="number">0</span>:</span><br><span class="line"> <span class="keyword">pass</span></span><br><span class="line"> <span class="keyword">else</span>:</span><br><span class="line"> next_pageclass = response.xpath(<span class="string">'//div[@class="the_pages"]/div/span[@class="next_off"]/@class'</span>).extract()</span><br><span class="line"> next_page = response.xpath(<span class="string">'//div[@class="the_pages"]/div/span[@class="next_off"]/text()'</span>).extract()</span><br><span class="line"></span><br><span class="line"> <span class="keyword">if</span> len(next_page)!=<span class="number">0</span> <span class="keyword">and</span> len(next_pageclass)!=<span class="number">0</span>:</span><br><span class="line"> <span class="keyword">pass</span></span><br><span class="line"> <span class="keyword">else</span>:</span><br><span class="line"> next_url = <span class="string">'http://index.bitauto.com'</span>+response.xpath(<span class="string">'//div[@class="the_pages"]/div/a/@href'</span>)[<span class="number">-1</span>].extract()</span><br><span class="line"> <span class="keyword">yield</span> Request(next_url, callback=self.parse)</span><br></pre></td></tr></table></figure>
</li>
<li><p>items.py</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># -*- coding: utf-8 -*-</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">import</span> scrapy</span><br><span class="line"></span><br><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">YicheItem</span><span class="params">(scrapy.Item)</span>:</span></span><br><span class="line"> <span class="comment"># define the fields for your item here like:</span></span><br><span class="line"> Date = scrapy.Field()</span><br><span class="line"> CarName = scrapy.Field()</span><br><span class="line"> Type = scrapy.Field()</span><br><span class="line"> SalesNum = scrapy.Field()</span><br></pre></td></tr></table></figure>
</li>
</ol>
<h1 id="保存到数据库"><a href="#保存到数据库" class="headerlink" title="保存到数据库"></a><strong>保存到数据库</strong></h1><ol>
<li><p>修改settings.py</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">ITEM_PIPELINES = &#123;</span><br><span class="line"> <span class="string">'yiche.pipelines.YichePipeline'</span>: <span class="number">300</span>,</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure>
</li>
<li><p>修改pipeline文件</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># -*- coding: utf-8 -*-</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">import</span> MySQLdb</span><br><span class="line"><span class="keyword">import</span> MySQLdb.cursors</span><br><span class="line"><span class="keyword">import</span> logging</span><br><span class="line"><span class="keyword">from</span> twisted.enterprise <span class="keyword">import</span> adbapi</span><br><span class="line"></span><br><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">YichePipeline</span><span class="params">(object)</span>:</span></span><br><span class="line"> <span class="function"><span class="keyword">def</span> <span class="title">__init__</span><span class="params">(self)</span>:</span></span><br><span class="line"> self.dbpool = adbapi.ConnectionPool(</span><br><span class="line"> dbapiName =<span class="string">'MySQLdb'</span>,<span class="comment">#数据库类型,我这里是mysql</span></span><br><span class="line"> host =<span class="string">'127.0.0.1'</span>,<span class="comment">#IP地址,这里是本地</span></span><br><span class="line"> db = <span class="string">'scrapy'</span>,<span class="comment">#数据库名称</span></span><br><span class="line"> user = <span class="string">'root'</span>,<span class="comment">#用户名</span></span><br><span class="line"> passwd = <span class="string">'root'</span>,<span class="comment">#密码</span></span><br><span class="line"> cursorclass = MySQLdb.cursors.DictCursor,</span><br><span class="line"> charset = <span class="string">'utf8'</span>,<span class="comment">#使用编码类型</span></span><br><span class="line"> use_unicode = <span class="keyword">False</span></span><br><span class="line"> )</span><br><span class="line"></span><br><span class="line"> <span class="comment"># pipeline dafault function</span></span><br><span class="line"> <span class="function"><span class="keyword">def</span> <span class="title">process_item</span><span class="params">(self, item, spider)</span>:</span></span><br><span class="line"> query = self.dbpool.runInteraction(self._conditional_insert, item)</span><br><span class="line"> logging.debug(query)</span><br><span class="line"> <span class="keyword">return</span> item</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"> <span class="comment"># 插入数据到数据库</span></span><br><span class="line"> <span class="function"><span class="keyword">def</span> <span class="title">_conditional_insert</span><span class="params">(self, tx, item)</span>:</span></span><br><span class="line"> parms = (item[<span class="string">'Date'</span>],item[<span class="string">'CarName'</span>],item[<span class="string">'Type'</span>],item[<span class="string">'SalesNum'</span>])</span><br><span class="line"> sql = <span class="string">"insert into yiche (Date,CarName,Type,SalesNum) values('%s','%s','%s','%s') "</span> % parms</span><br><span class="line"> <span class="comment">#logging.debug(sql)</span></span><br><span class="line"> tx.execute(sql)</span><br></pre></td></tr></table></figure>
</li>
</ol>
<h1 id="开始爬取"><a href="#开始爬取" class="headerlink" title="开始爬取"></a><strong>开始爬取</strong></h1><ul>
<li>终端执行命令开始爬取<figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">scrapy crawl yiche</span><br></pre></td></tr></table></figure>
</li>
</ul>
<p>结束之后,可以看到总共发送701个get请求,状态码是200的有701个,说明每一个都请求成功,当然还有其他日志文件log等等信息。。。</p>
<p><img src="/img/bgods002.png" alt=""></p>
<p>最后,我们再去数据库看都爬了多少数据</p>
<p><img src="/img/bgods003.png" alt=""></p>
<p>可以看到,数据有大概2W+条记录,和网上对比一下数据还是很完整的。</p>
<hr>
<font size="3" color="red"><strong>说明:本文使用的环境是ubuntu+python2.7.11+scrapy1.03</strong></font>
</content>
<summary type="html">
<p> 接触爬虫也有一段时间了,起初都是使用request库爬取数据,并没有使用过什么爬虫框架。对于scrapy这个框架,之前仅仅是好奇,这两天看了一下<a href="http://scrapy-chs.readthedocs.org/zh_CN/1.0/intro/tutorial.html">scrapy</a>文档,也试着去爬了一些数据,发现还是很方便的。<br>
</summary>
<category term="python" scheme="http://bgods.top/tags/python/"/>
<category term="爬虫" scheme="http://bgods.top/tags/%E7%88%AC%E8%99%AB/"/>
<category term="scrapy" scheme="http://bgods.top/tags/scrapy/"/>
</entry>
<entry>
<title>CSS笔记</title>
<link href="http://bgods.top/2016/06/27/CSS%E7%AC%94%E8%AE%B0/"/>
<id>http://bgods.top/2016/06/27/CSS笔记/</id>
<published>2016-06-27T00:50:37.000Z</published>
<updated>2016-09-16T13:55:49.000Z</updated>
<content type="html"><h1 id="CSS-规则"><a href="#CSS-规则" class="headerlink" title="CSS 规则:"></a><strong>CSS 规则:</strong></h1><ol>
<li>由选择器和声明(一条或多条)组成;</li>
<li>选择器通常是您需要改变样式的 HTML 元素;</li>
<li>每条声明由一个属性和一个值组成;</li>
<li>CSS声明总是以分号(;)结束,声明组以大括号({})括起来.</li>
</ol>
<a id="more"></a>
<p><img src="/img/bgods001.jpg" alt=""><br><figure class="highlight css"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="selector-tag">p</span> &#123;</span><br><span class="line"> <span class="attribute">color</span>:red;</span><br><span class="line"> <span class="attribute">text-align</span>:center;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure></p>
<hr>
<h1 id="CSS-注释"><a href="#CSS-注释" class="headerlink" title="CSS 注释:"></a><strong>CSS 注释:</strong></h1><ul>
<li>CSS注释以 “/*“开始, 以”*/“结束。</li>
</ul>
<hr>
<h1 id="选择器"><a href="#选择器" class="headerlink" title="选择器"></a><strong>选择器</strong></h1><ol>
<li><p><strong>标签选择器:</strong></p>