Scalaz Streams: Parallel computing and better testing

Hi there,

I finally made some time to write this post after being back home for holidays and switching teams in my current job. I’ve been working in this new project since February where we are widely using Scalaz Streams, which gave me the possibility to learn a lot (thanks Fred!).

Last year I wrote about designing a full project using Akka Streams being able to test every piece of the stream as a partial flow graph. Now I’ll talk about doing something similar using Scalaz Streams which has the advantage of being always in control of the concurrency and parallelism. If you want to see a comparison between them, check out the good Adam Warski’s post about it… So let’s start by defining a case!

Definition of the case

We need to consume messages of Orders of Items from Kafka, persisting every order to Cassandra, update the item prices by applying a margination profit and publish the updated order downstream to Rabbit MQ.

That’s the real case but for the sake of simplicity, in the following example I’m not going to use Kafka, Cassandra and Rabbit MQ but an in memory representation to simulate the same behavior.

So I started by drawing the streaming flow and how the different components are connected:


Looks pretty simple isn’t it? Well in this design every order has to be consumed from Kafka, be logged, persisted to Cassandra, updated by the pricer service and finally published to Rabbit MQ. In general, a processing overhead always occurs when I/O operations are involved, in this case the persisting phase. Hence parallelising the streaming between writing and reading is recommended. So this is how the graphic looks like after the flow has been split out in two:

With this design we can continuously consume and persisting orders and in parallel we can read, update and publish the orders downstream.

Okay, show me the code….

Fair enough! In Scalaz Streams a stream is represented as a Process[+F[_], +A] where F is the effect side and A the value side. The most general use case has a Task in place of F[_]. Then we have different type aliases of a Process that are more meaningful like Sink (used for side effects processing), Channel (effectful process that can send values and get back responses) and Writer (a Process[+F[_], +A] where A is a disjunction) among others. We also have an Exchange that represents the interconnection between two systems, a read side and a write side. Knowing about this components we can model the graphic in this way:

Note: If you are not so familiar with all this concepts I recommend you to read this user notes written by Devon Miller.

Kakfa      - consumer: Process[Task, Order]
Logger     - logger: Sink[Task, Order]
Cassandra  - storageEx: Exchange[Order, Order]
Pricer     - pricer: Channel[Task, Order, Order]
Rabbit MQ  - publisher: Sink[Task, Order])

Then we can connect every component as in the graphic and specify the parallel processing using

    consumer          observe   logger    to  storageEx.write,    through   pricer    to  publisher

The code above deserves a bit of explanation. I’m using that can merge N streams (Process[+F[_], +A]) in a nondeterministic way. This means that given stream1 and stream2 is going to take an element from the faster publisher. This means nondeterministic parallelization. It’s the way that the merge function works for Wye using two streams but more optimized (using nondeterminism.njoin under the hood).

This code is part of the PricerStream object. I’ve also created another process to generate random Orders to be consumed by the main process. All the source code for this demo is available on GitHub, check it out here!


Having the definition of the streaming flow isolated makes it easy to test the behavior in isolation. The flow method in the PricerStream object only defines how the components are connected and accepts each one of them as a parameter.  All we have to do is created an in memory representation of this components and invoke the flow. Then we can assert whether an order has been published or not.

For example, this is how we test the main flow defined above:

val (consumer, logger, storageEx, pricer, publisherEx, kafkaQ) = createStreams
val result = for {
  _           <-  eval(kafkaQ.enqueueOne(testOrder))
  _           <-  PricerStream.flow(consumer, logger, storageEx, pricer, publisherEx.write)
  update      <-
} yield {
  update.items foreach { item =>
    item.price should be (OrderTestGenerator.FixedUpdatedPrice)

Take a look at the source code to discover what’s going on here! The createStreams method is returning all the mocks that we need to simulate Kafka, Cassandra and RabbitMQ for instance. Then we create a stream in a for-comprehension style by publishing “testOrder” to the Kafka topic, invoking the Pricer Stream flow and finally asserting on the Order published downstream. This is quite simple once you understand how it works and how powerful it is!

Okay, that’s pretty much what I have today! There are a few points I would like to dig in deeper like benchmarks and performance that are actually beyond the topic of this post but hopefully I can write about it later on.

Recommended Resources

There are a lot of resources out there very useful. I’d like to point you specifically to the classic Daniel Spiewak’s talk and to the excellent Derek Chen Becker’s talk given at LambdaConf 2015. I’ve created a playground project with all the stuff I learned from this talks, check it out! And of course it’s always good to check the Additional Resources wiki page of FS2.

Until next post!


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s