Skip to content

Latest commit

 

History

History
153 lines (134 loc) · 7.95 KB

golang_crawler.md

File metadata and controls

153 lines (134 loc) · 7.95 KB

原生爬虫

源码仓库:这里


通过 channel - 爬取珍爱网

单任务版爬虫架构

+--------------+            +--------------+      response      +-----------------+
|              |  request   |              +------------------->+                 |
|     Seed     +----------->+    Engine    |  requests,items    |     Parser      |
|              |            |              |<-------------------|                 |
+--------------+            ++-----------+-+                    +-----------------+
                     +------^^           ^ ^------+                       
                     |-------|           |        |
                     ||                  |        |
                     ||                  |url     |response
                     vv                  |        |
     +---------------++------+           | +------+--------------+
     |                       |           | |                     |
     |     Task Queue        |           +->       Fetcher       |
     |                       |             |                     |
     +-----------------------+             +---------------------+

并发版爬虫架构

                                       +----------------------------------------------+
                                       |                                              |
                                       |                                     Worker   |
                                       |                                              |
                                       |                                              |
+--------------+            +--------------+      response      +------- ---------+   |
|              |  request   |          |   +------------------->+                 |   |
|     Seed     +----------->+    Engine|   |  requests,items    |     Parser      |   |
|              |            |          |   +<-------------------+                 |   |
+--------------+            +------------+-+                    +-----------------+   |
                     +------->         | > ^------+                                   |
                     |-------+         | |        |                                   |
                     ||                | |        |                                   |
                     ||                | |url     |response                           |
                     <>                | |        |                                   |
     +-----------------------+         | |        +-------------+------------------+  |
     |                       |         | |                      |                  |  |
     |     Task Queue        |         | ----------------------->     Fetcher      |  |
     |                       |         |                        |                  |  |
     +-----------------------+         |                        +------------------+  |
                                       |                                              |
                                       +----------------------------------------------+

实现I:


                           +---------------+
                           |   OutPut      |
                           |               |
                           +---------------+
                                  <>
                                  ||
                                  || Items
                                  ||
                                  ||
+--------------+            +--------------+                    +-----------------+
|              |  request   |              |   requests,items   | +-----------------+
|     Seed     +----------->+    Engine    +--------------------+ | +-----------------+
|              |            |              |                    | | |                 |
+--------------+            +-----------+--+                    +-+ |      Worker     |
                                        |                         +-+                 |
                                        | request                   +--+--------------+
                                        |                              |
                                        |                              |
                                        v                              |
                                      +-+------------------------------+----+
                                      |                                     |
                                      |               Scheduler             |
                                      |                                     |
                                      +-------------------------------------+

实现II:并发分发Request

                                       +-----------------+
                                       | +-----------------+
                                       | | +-----------------+
                                       | | |                 |
         +                             +-+ |      Worker     |
         | request                       +-+                 |
         |                                 +--+--------------+
         |                                    ^
         |                                    | request
         v                                    |
+--------+---------+    create for     +------------------+
|                  |    each request   | +------------------+
|     Scheduler    +------------------>+ | +-------------------+
|                  |                   | | |                   |
+------------------+                   +-+ |     Goroutine     |
                                         +-+                   |
                                           +-------------------+

实现III:Request队列和Worker队列

                +
                |
                | request
                |
                |
                v                          +--------------------------------+
        +-------+---------------+          |                                |
        |                       |          |  +----------+     +----------+ |
        |       Scheduler       +--------->+  |  Request +---->+  Worker  | |
        |                       |          |  +----------+     +----------+ |
        +-+-------------------+-+          |                                |
          |                   |            +--------------------------------+
          |                   |
          v                   |
+---------+-------+  +--------+---------+                   +------------------+
|                 |  |                  |                   | +------------------+
|  Request Queue  |  |   Worker Queue   +-------------------+ | +-------------------+
|                 |  |                  |                   | | |                   |
+-----------------+  +------------------+                   +-+ |      Worker       |
                                                              +-+                   |
                                                                +-------------------+

感谢