Skip to content
sha1dy edited this page Sep 14, 2010 · 10 revisions

The main point in Logly (log crawler using ruby and NoSql as a backend) is a lack of free or open source solutions. Splunk is the best what you could find but it cost enormous amount of money.

Requirements

Preliminary requirements:

  • Cross platform (work on Windows and Solaris)
  • Provide an ability to search through logs for particular string
  • Store log files somewhere and provide an ability to view not only current log files but also files which were already rotated

It should be something like Splunk – you should be able to view logs right from web frontend without logging into actual boxes and also you should be able to view logs which were cleaned up (like yesterday logs). Also there should be a search functionality which should search for inputted string across all logs (actual and historical) and show a list of found log lines and links to those log lines to investigate.

Some additional requirements:

  • Search index and log storage should be distributed in order to make use of all available boxes and space (we have for instance 3 development boxes with 10Gb space and we need to use them all because or logs are 20Gb each day).

Technical aspect:

  • Make use of Ruby?
  • Make use of NoSql?

Log storing solutions

Requirements

  • Work on Windows and Solaris
  • Sharding of data across nodes
  • Storing data in compact format

MongoDB

MongoDB is a document-oriented database written in C++.
Pros:

  • Has good and quality drivers for Ruby and other languages.
  • Supports sharding (via collections sharding).
  • Ability to create capped data.
  • Supports sorting.
  • Fast on inserts.
  • As a document-oriented database is well suited for saving various log messages.

Cons:

  • (showstopper) Doesn’t work on Solaris.
  • Has 2Gb size limitation on 32 bit OS’es.

CouchDB

CouchDB is a document-oriented database written in Erlang.
Pros:

  • Possibly works on Solaris.
  • Supports querying for ranges
  • Supports sorting (sorts only by index keys).
  • Has frameworks for Ruby.
  • As a document-oriented database is well suited for saving various log messages.

Cons:

  • Possibly a disk space hog.
  • Possibly best suited for small amount of data.
  • (showstopper) Unofficial sharding support (via smart proxy, collections are distributed randomly across shards).
  • Slow on inserts.

Cassandra

Cassandra is a BigTable variation written in Java.
Pros:

  • Supports sharding (via partitioning).
  • Supports querying for ranges.
  • Supports sorting (sorting is defined in schema).
  • Has frameworks for Ruby.

Cons:

  • Absence of a good documentation.
  • Disk space hog (200Mb of logs are stored in Cassandra as 550Mb files)
  • Semi-automatic load balance (requires loadbalance tool to be run frequently to rebalance the cluster).
  • Schema driven.

HBase

HBase is a BigTable variation written in Java running on Hadoop.
Pros:

  • Supports compression (via hadoop).
  • Supports querying for ranges (via scanners).
  • Supports sorting (sorting is defined in schema).
  • Supports sharding (via regions).
  • Flexible schema (very similar to document-oriented database).

Cons:

  • Absence of a good documentation.
  • Requires cygwin on Windows.
  • Lack of good Ruby frameworks.
  • Keys are sorted as bytes (log10 is lower that log1).
  • Requires many efforts to setup and deploy.

HDFS

HDFS is a distributed files storage written in Java as part of Hadoop.
Pros:

  • Perfectly fits for storing unstructured data in a distributed environment.
  • Supports compression.
  • Supports querying for ranges (via File API).
  • Supports sharding (via regions).

Cons:

  • Absence of a good documentation.
  • Requires cygwin on Windows.
  • API exists only for Java.
  • Requires some efforts to setup and deploy.
Clone this wiki locally