com.twitter.summingbird.example
Summingbird's implementation of the batch/realtime merge requires that the Storm-based workflow store (K, BatchID) -> V pairs, while the Hadoop-based workflow stores K -> (BatchID, V) pairs.
Summingbird's implementation of the batch/realtime merge requires that the Storm-based workflow store (K, BatchID) -> V pairs, while the Hadoop-based workflow stores K -> (BatchID, V) pairs.
The following two injections use Bijection's "Bufferable" object to generate injections that take (T, BatchID) or (BatchID, T) to bytes.
For true production applications, I'd suggest defining a thrift or protobuf "pair" structure that can safely store these pairs over the long-term.
This Injection converts the twitter4j.
This Injection converts the twitter4j.Status objects that Storm and Scalding will process into Strings.
We can chain the Status <-> String injection above with the library-supplied String <-> Array[Byte] injection to generate a full-on serializer for Status objects of the type Injection[Status, Array[Byte]].
We can chain the Status <-> String injection above with the library-supplied String <-> Array[Byte] injection to generate a full-on serializer for Status objects of the type Injection[Status, Array[Byte]]. Our Storm and Scalding sources can now pull in this injection using Scala's implicit resolution and properly register the serializer.
Serialization is often the most important (and hairy) configuration issue for any system that needs to store its data over the long term. Summingbird controls serialization through the "Injection" interface.
By maintaining identical Injections from K and V to Array[Byte], one can guarantee that data written one day will be readable the next. This isn't the case with serialization engines like Kryo, where serialization format depends on unstable parameters, like the serializer registration order for the given Kryo instance.