Saturday, September 19, 2015

Krwkrw 0.1.3 Released

Just pushed the latest release (0.1.3) of Krwkrw to Maven central.

Krwkrw is a web scraper. You can read how it came into being here

A quick run down of the changes in 0.1.3.
  1. Ability to express URL's to be included/excluded using Regex pattern. For example:
    Krwkrw crawler = new Krwkrw(action);
    crawler.match("(\\S+)(/projects/)(\\S+)")
    
    makes sure that only contents in the /projects/ path would be processed while
    Krwkrw crawler = new Krwkrw(action);
    crawler.skip("(\\S+)(/projects/)(\\S+)")
    
    will fetch and process all the contents except, the ones in the /projects/ path
  2. Ability to have random delays in between requests.
    Before now it was only possible to set the seconds to wait between requests. For example:
    Krwkrw crawler = new Krwkrw(action);
    crawler.setDelay(5) // waits 5 seconds between requests
    
    With the 0.1.3 release, it is possible to have random delay; that is, the requests will be delayed by number of seconds picked randomly from a lower and upper bound, for example:
    Krwkrw crawler = new Krwkrw(action);
    // waiting seconds will be any number between 5 and 20
    crawler.setDelay(5, 20) 
    
  3. Change in API. doKrawl method replaced with crawl
  4. Fix issue where it was possible for the crawler to crawl pages outside the origin url
  5. Some minor improvements here and there...
If using Maven as your build tool, you can add it to your project via:
<dependency>
<groupid>com.blogspot.geekabyte.krwkrw</groupid>
<artifactid>krwler</artifactid>
<version>0.1.3</version>
</dependency>

If using Gradle, then:
dependencies {
compile "com.blogspot.geekabyte.krwkrw:krwler:0.1.3}"
}

The Javadoc can be accessible online here.

You can also check out Krwkrw on Github

No comments: