Skip to content
This repository has been archived by the owner on Dec 19, 2023. It is now read-only.

Latest commit

 

History

History
939 lines (783 loc) · 21.5 KB

README.md

File metadata and controls

939 lines (783 loc) · 21.5 KB

Scalavro

A runtime reflection-based Avro library in Scala.

A full description of Avro is outside the scope of this documentation, but here is an introduction from avro.apache.org:

Apache Avro ™ is a data serialization system.

Avro provides:

  • Rich data structures.
  • A compact, fast, binary data format.
  • A container file, to store persistent data.
  • Remote procedure call (RPC).
  • Simple integration with dynamic languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages.

Avro provides functionality similar to systems such as Thrift, Protocol Buffers, etc.

Scalavro takes a code-first, reflection based approach to schema generation and (de)serialization. This yields a very low-overhead interface, and imposes some costs. In general, Scalavro assumes you know what types you're reading and writing. No built-in support is provided (as yet) for so-called schema resolution (taking the writer's schema into account when reading data).

Goals

  1. To provide an in-memory representation of avro schemas and protocols.
  2. To synthesize avro schemas and protocols dynamically for a useful subset of Scala types.
  3. To dynamically generate Scala bindings for reading and writing Avro-mapped Scala types to and from Avro binary.
  4. Generally, to minimize fuss required to create an Avro-capable Scala application.

Overview

Interacting with Scalavro is easy from a user's perspective. Scalavro will use reflection to inspect some Scala type you supply and return an instance of AvroType. With this object in hand, you are one or two method calls away from schema generation, type dependency graph inspection, or binary/JSON (de)serialization.

"Crash Course" of Major Features, for the Impatient:

import com.gensler.scalavro.types.AvroType
import com.gensler.scalavro.io.AvroTypeIO
import scala.util.{ Try, Success, Failure }

// obtaining an instance of AvroType
val intSeqType = AvroType[Seq[Int]]

// obtaining an Avro schema for a given AvroType
intSeqType.schema

// obtaining an AvroTypeIO object for a given AvroType (via the `io` method)
val io: AvroTypeIO[Seq[Int]] = intSeqType.io

// binary I/O
io.write(Seq(1, 2, 3), outputStream)
val Sucess(readResult) = io read inputStream

// json I/O
val json = io writeJson Seq(1, 2, 3) // [1,2,3]
val Success(readResult) = io readJson json

Obtaining Scalavro

The current release is 0.7.1-magine, built against Scala 2.11.8.

Using SBT:

libraryDependencies += "com.gensler" %% "scalavro" % "0.7.1-magine"

API Documentation

Index of Examples

Type Mapping Strategy

Primitive Types

Scala Type Avro Type
Unit null
Boolean boolean
Byte int
Char int
Short int
Int int
Long long
Float float
Double double
String string
scala.xml.Node string
scala.collection.Seq[Byte] bytes

Complex Types

Scala Type Avro Type
scala.Array[T] array
scala.collection.Seq[T] array
scala.collection.Set[T] array
scala.collection.Map[String, T] map
scala.Enumeration#Value enum
enum (Java) enum
scala.util.Either[A, B] union
scala.util.Option[T] union
com.gensler.scalavro.util.Union[U] union
com.gensler.scalavro.util.FixedData fixed
Supertypes of case classes without type parameters union
Case classes without type parameters record

General Information

  • Built against Scala 2.11.1 with SBT 0.13.5
  • Depends upon spray-json
  • Depends upon the Apache Java implementation of Avro (Version 1.7.7)

Current Capabilities

  • Dynamic Avro schema generation from vanilla Scala types
  • Avro protocol definitions and schema generation
  • Support for recursively defined record types
  • Convenient, dynamic binary and JSON (de)serialization
  • Avro RPC protocol representation and schema generation
  • Schema conversion to "Parsing Canonical Form" (useful for Avro RPC protocol applications)

Current Limitations

  • Schema resolution (taking the writer's schema into account when reading) is not yet implemented
  • Although recursively defined records (case classes) are supported, serializing all such instances is not. In particular, reading and writing cyclic object graphs is not supported.
  • Although records are supported (via case classes), only the case class's default constructor parameters are serialized.

Scalavro by Example: Schema Generation

Arrays

scala.Array

import com.gensler.scalavro.types.AvroType
AvroType[Array[String]].schema

Which yields:

{
  "type" : "array",
  "items" : "string"
}

scala.Seq

import com.gensler.scalavro.types.AvroType
AvroType[Seq[String]].schema

Which yields:

{
  "type" : "array",
  "items" : "string"
}

scala.Set

import com.gensler.scalavro.types.AvroType
AvroType[Set[String]].schema

Which yields:

{
  "type" : "array",
  "items" : "string"
}

Maps

import com.gensler.scalavro.types.AvroType
AvroType[Map[String, Double]].schema

Which yields:

{
  "type" : "map",
  "values" : "double"
}

Enums

scala.Enumeration

package com.gensler.scalavro.test
import com.gensler.scalavro.types.AvroType

object CardinalDirection extends Enumeration {
  type CardinalDirection = Value
  val N, NE, E, SE, S, SW, W, NW = Value
}

import CardinalDirection._
AvroType[CardinalDirection].schema

Which yields:

{
  "name" : "CardinalDirection",
  "type" : "enum",
  "symbols" : ["N","NE","E","SE","S","SW","W","NW"],
  "namespace" : "com.gensler.scalavro.test.CardinalDirection"
}

Java enum

Definition (Java):

package com.gensler.scalavro.test;
enum JCardinalDirection { N, NE, E, SE, S, SW, W, NW };

Use (Scala):

import com.gensler.scalavro.types.AvroType
import com.gensler.scalavro.test.JCardinalDirection

AvroType[JCardinalDirection].schema

Which yields:

{
  "name" : "JCardinalDirection",
  "type" : "enum",
  "symbols" : ["N","NE","E","SE","S","SW","W","NW"],
  "namespace" : "com.gensler.scalavro.test"
}

Unions

scala.Either

package com.gensler.scalavro.test
import com.gensler.scalavro.types.AvroType

AvroType[Either[Int, Boolean]].schema

Which yields:

["int", "boolean"]

and

AvroType[Either[Seq[Double], Map[String, Seq[Int]]]].schema

Which yields:

[{
  "type" : "array",
  "items" : "double"
},
{
  "type" : "map",
  "values" : {
    "type" : "array",
    "items" : "int"
  }
}]

scala.Option

package com.gensler.scalavro.test
import com.gensler.scalavro.types.AvroType

AvroType[Option[String]].schema

Which yields:

["null", "string"]

com.gensler.scalavro.util.Union.union

import com.gensler.scalavro.types.AvroType
import com.gensler.scalavro.util.Union._

AvroType[union [Int] #or [String] #or [Boolean]].schema

Which yields:

["int", "string", "boolean"]

Fixed-Length Data

package com.gensler.scalavro.test

import com.gensler.scalavro.types.AvroType
import com.gensler.scalavro.util.FixedData
import scala.collection.immutable

@FixedData.Length(16)
case class MD5(override val bytes: immutable.Seq[Byte])
           extends FixedData(bytes)

AvroType[MD5].schema

Which yields:

{
  "name": "MD5",
  "type": "fixed",
  "size": 16,
  "namespace": "com.gensler.scalavro.test"
}

Records

From case classes

package com.gensler.scalavro.test
import com.gensler.scalavro.types.AvroType

case class Person(name: String, age: Int)

val personAvroType = AvroType[Person]
personAvroType.schema

Which yields:

{
  "name": "com.gensler.scalavro.test.Person",
  "type": "record",
  "fields": [
    {
      "name": "name",
      "type": "string"
    },
    {
      "name": "age",
      "type": "int"
    }
  ]
}

And perhaps more interestingly:

case class SantaList(nice: Seq[Person], naughty: Seq[Person])

val santaListAvroType = AvroType[SantaList]
santaListAvroType.schema

Which yields:

{
  "name": "com.gensler.scalavro.test.SantaList",
  "type": "record",
  "fields": [
    {
      "name": "nice",
      "type": {
        "type": "array",
        "items": [
          {
            "name": "com.gensler.scalavro.test.Person",
            "type": "record",
            "fields": [
              {
                "name": "name",
                "type": "string"
              },
              {
                "name": "age",
                "type": "int"
              }
            ]
          },
          {
            "name": "com.gensler.scalavro.Reference",
            "type": "record",
            "fields": [
              {
                "name": "id",
                "type": "long"
              }
            ]
          }
        ]
      }
    },
    {
      "name": "naughty",
      "type": {
        "type": "array",
        "items": [
          "com.gensler.scalavro.test.Person",
          "com.gensler.scalavro.Reference"
        ]
      }
    }
  ]
}

Whoa -- what happened there?!

Scalavro as of version 0.5.1 supports reference tracking for record instances. Every time Scalavro writes a record to binary, it saves the source object reference and assigns a reference number. If that same instance is required to be written again, it simply writes the reference number instead. Scalavro reverses this process when reading from binary. Therefore, references to shared data exist in the source object graph, then those references in the deserialized object graph will also be shared. This imposes little performance penalty during serialization, and in general reduces serialized data size as well as deserialization time.

Scalavro implements this by replacing any nested record type within a schema with a binary union of the target type and a Reference schema. References are encoded as an Avro long value. Here is the schema for Reference:

{
  "name": "com.gensler.scalavro.Reference",
  "type": "record",
  "fields": [
    {
      "name": "id",
      "type": "long"
    }
  ]
}

For comparison, in versions of Scalavro before 0.5.1, the SantaList schema looked like this:

{
  "name": "SantaList",
  "type": "record",
  "fields": [
    {
      "name": "nice",
      "type": {"type": "array", "items": "Person"}
    },
    {
      "name": "naughty",
      "type": {"type": "array", "items": "Person"}
    }
  ],
  "namespace": "com.gensler.scalavro.test"
}

Here is an example of a simple recursively defined type (a singly-linked list):

package com.gensler.scalavro.test
import com.gensler.scalavro.types.AvroType

case class Strings(data: String, next: Option[Strings])

AvroType[Strings].schema

Which yields:

{
  "name": "com.gensler.scalavro.test.Strings",
  "type": "record",
  "fields": [
    {
      "name": "data",
      "type": "string"
    },
    {
      "name": "next",
      "type": [
        "null",
        [
          "com.gensler.scalavro.test.Strings",
          {
            "name": "com.gensler.scalavro.Reference",
            "type": "record",
            "fields": [
              {
                "name": "id",
                "type": "long"
              }
            ]
          }
        ]
      ]
    }
  ]
}

From supertypes of case classes

Given:

package com.gensler.scalavro.test

abstract class Alpha { def magic: Double }
class Beta extends Alpha { val magic = math.Pi }
case class Gamma(magic: Double) extends Alpha
case class Delta() extends Beta
case class Epsilon[T]() extends Beta
case class AlphaWrapper(inner: Alpha) extends Alpha { def magic = inner.magic }

Usage:

import com.gensler.scalavro.types.AvroType
AvroType[Alpha].schema

Which yields:

[
  [
    {
      "name": "com.gensler.scalavro.test.Delta",
      "type": "record",
      "fields": []
    },
    {
      "name": "com.gensler.scalavro.Reference",
      "type": "record",
      "fields": [
        {
          "name": "id",
          "type": "long"
        }
      ]
    }
  ],
  [
    {
      "name": "com.gensler.scalavro.test.Gamma",
      "type": "record",
      "fields": [
        {
          "name": "magic",
          "type": "double"
        }
      ]
    },
    "com.gensler.scalavro.Reference"
  ],
  [
    {
      "name": "com.gensler.scalavro.test.AlphaWrapper",
      "type": "record",
      "fields": [
        {
          "name": "inner",
          "type": [
            [
              "com.gensler.scalavro.test.Delta",
              "com.gensler.scalavro.Reference"
            ],
            [
              "com.gensler.scalavro.test.Gamma",
              "com.gensler.scalavro.Reference"
            ],
            [
              "com.gensler.scalavro.test.AlphaWrapper",
              "com.gensler.scalavro.Reference"
            ]
          ]
        }
      ]
    },
    "com.gensler.scalavro.Reference"
  ]
]

Note that in the above example:

  • Alpha is excluded from the union because it is not a case class
  • Beta is excluded from the union because it is abstract
  • Epsilon is excluded from the union because it takes type parameters

Scalavro by Example: Binary IO

import com.gensler.scalavro.types.AvroType
import com.gensler.scalavro.io.AvroTypeIO
import scala.util.{Try, Success, Failure}

case class Person(name: String, age: Int)
case class SantaList(nice: Seq[Person], naughty: Seq[Person])

val santaList = SantaList(
  nice = Seq(
    Person("John", 17),
    Person("Eve", 3)
  ),
  naughty = Seq(
    Person("Jane", 25),
    Person("Alice", 65)
  )
)

val santaListType = AvroType[SantaList]

val outStream: java.io.OutputStream = // some stream...

santaListType.io.write(santaList, outStream)

val inStream: java.io.InputStream = // some stream...

santaListType.io.read(inStream) match {
  case Success(readResult) => assert(readResult == santaList) // true
  case Failure(cause)      => // handle failure...
}

Scalavro by Example: JSON IO

import com.gensler.scalavro.types.AvroType
import com.gensler.scalavro.io.AvroTypeIO
import scala.util.{Try, Success, Failure}

case class Person(name: String, age: Int)
case class SantaList(nice: Seq[Person], naughty: Seq[Person])

val santaList = SantaList(
  nice = Seq(
    Person("John", 17),
    Person("Eve", 3)
  ),
  naughty = Seq(
    Person("Jane", 25),
    Person("Alice", 65)
  )
)

val santaListType = AvroType[SantaList]

val json = santaListType.io writeJson santaList

/*
  json.prettyPrint now yields:

  {
    "nice": [{
      "name": "John",
      "age": 17
    }, {
      "name": "Eve",
      "age": 3
    }],
    "naughty": [{
      "name": "Jane",
      "age": 25
    }, {
      "name": "Alice",
      "age": 65
    }]
  }

*/

santaListType.io.readJson(json) match {
  case Success(readResult) => assert(readResult == santaList) // true
  case Failure(cause)      => // handle failure...
}

A Neat Fact about Scalavro's IO Capabilities

Scalavro tries to produce read results whose runtime types are as accurate as possible for collections (the supported collection types are Seq, Set, and Map). It accomplishes this by looking for either a varargs apply factory method or a a nullary Builder-valued method on the target type's companion object. This is why AvroType[ArrayBuffer[Int]].io.read(…) is able to return a Try[ArrayBuffer[Int]], for example.

This also works for custom subtypes of the supported collection types -- as long as you define at least one of the acceptable factory methods as described above in the companion you're good to go.

Reference

  1. Current Apache Avro Specification
  2. Scala 2.10 Reflection Overview
  3. Great article on schema evolution in various serialization systems
  4. Wickedly clever technique for representing unboxed union types, proposed by Miles Sabin

Legal

Apache Avro is a trademark of The Apache Software Foundation.

Scalavro is distributed under the BSD 2-Clause License, the text of which follows:

Copyright (c) 2013, Gensler
All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  • Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

  • Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.