Getting Started: Basics

**If you haven't grab a local copy of our Examples, click here to learn how.

Code

The following will be based off

ai.preferred.crawler.single

in our Examples package.

Let's learn

To get started on crawling, we need to do a few basic steps:

Create a Fetcher (SingleCrawler.java)
Create a Crawler (SingleCrawler.java)
Create a Handler (SingleHandler.java)
Submit the request (SingleCrawler.java)

1. Create a Fetcher (SingleCrawler.java)

In today's example, we will create a AsyncFetcher. The AsyncFetcher is similar to a web client or browser, which will go to a given URL and obtain the specified page.

Let's create a method to create an AsyncFetcher. This method will create a AsyncFetcher with default options.

  private static Fetcher createFetcher() {
    return AsyncFetcher.buildDefault();
  }

Or alternatively, if you need more control of the options you can use this.

  private static Fetcher createFetcher() {
    return AsyncFetcher.builder()
        // You can look in builder the different things you can add like...
        // .setUserAgent(() -> "UserAgent")
        .build();
  }

There are tons of options in the builder, for a list of options you can set check out the Documentation

2. Create a Crawler (SingleCrawler.java)

The Crawler is the brain behind the process, it will control the sleep time, retries and handle the returns from the fetcher.

Let's make a method to create a Crawler that takes a Fetcher.

  private static Crawler createCrawler(Fetcher fetcher) {
    // You can look in builder the different things you can add
    return Crawler.builder()
        .setFetcher(fetcher)
        .build();
  }

Of course, similar to the fetcher, the builder in the Crawler has tons of options, to check them out visit our Documentation.

3. Create a Handler (SingleHandler.java)

The Handler processes the response from the page you requested, handles how the information is being stored and whether more requests should be scheduled. The handle method is automatically called when there is a response received from your request.

Let's check out SingleHandler.java for the handle method.

  @Override
  public void handle(Request request, VResponse response, Scheduler scheduler, Session session, Worker worker) {
    LOGGER.info("Processing {}", request.getUrl());

    // Get content type
    System.out.println(response.getContentType());

    // Get HTML
    final String html = response.getHtml();
    System.out.println(html);

    // Get our IP
    final Document document = response.getJsoup();
    final String ip = document.select("#content > div.main-box > div > div.column > div > strong")
        .first().text();

    LOGGER.info("My IP is {}, let's go to {} to verify", ip, request.getUrl());
  }

What this handler does is to print the content type of the page we requested, the HTML source of the page and our IP address which is found using the CSS selector:

#content > div.main-box > div > div.column > div > strong

There are a few notable functions available such as response.getHtml(), which will get the HTML source of the web page you requested and response.getJsoup() which returns a jsoup document that you can use CSS selectors for querying. More about jsoup can be found at jsoup's cookbook on Extracting data.

4. Submit the request (SingleCrawler.java)

In order to run the application, we will need a main method.

  public static void main(String[] args) {

    final Fetcher fetcher = createFetcher();
    try (Crawler crawler = createCrawler(fetcher).start()) {
      LOGGER.info("Starting crawler...");
      ...
    } catch (Exception e) {
      LOGGER.error("Could not run crawler: ", e);
    }

  }

In the method above, we create the Fetcher and the Crawler. It is essential that you remember to call the method start() on the Crawler, if not the crawler will not process any requests.

To submit a request, we pass these three lines:

      // pass in URL and handler
      final Request request = new VRequest(URL);
      final Handler handler = new SingleHandler();
      crawler.getScheduler().add(request, handler);

We create a Request with the url and a Handler to handle the response. We then schedule the request using getScheduler().add(), attaching a handler to handle the response.

Congratulations

Now you have learnt to created a crawler and fetcher which you can specify different options and you can now run this to obtain your current IP address.

Venom (c) Your preferred open source focused crawler for the deep web

Blazing fast | Customizable | Robust | Simple and Handy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly