Apache Spark and Storm. There is no war here.

spark vs storm

Usually its the war of the giants. With Spark running in micro batches and Storm able to process streams real-time.

At BPRISE, we found both of them can co-exist beautifully in our ecosystem. Each of the Apache systems have matured in their own fields. While Spark has added quite a bit of depth to perform stream analysis, it’s still not up to the mark like Apache Storm.

As I write this post, we are well on our way to analyze over 1.5 billion points of data annually. This required architectural thinking to ensure that we are future proof.

Of course, we have done our flip flop with technology over the previous few months, with experiments that are awaiting migration. We also know what is the next best tool to analyze data, something better than Spark and Storm. But product maturity and support hinders adoption of what is the latest in the field.

How would Apache Spark help us?

Coupled with Microsoft Azure and HDInsight, we are able to scale the solution in a geo-redundant manner in-country and globally. Very critical aspects to begin with. Of course, one can use Amazon or Google for the same purpose – both are great cloud providers. The reason we went with Azure was because Microsoft was the first to support our endeavor in our effort to build our solution with the Bizspark program. And yes, azure is very good.

The biggest challenge Spark would help us with, is handling petabytes of data tomorrow on standard hardware. Processing information across clusters to build answers fast.

And Apache Storm?

There are customers at stores at all point in time across client stores and around the globe. We cannot just rely on historical data to provide solutions. Depending on the age of the data already in hand, real-time data plays a major role in defining the right strategy. Imagine if I knew historically you were interested in Home Theater systems. Will it help if I send a message with an offer today, when you enter the store? Or should real-time information be brought into the mix? What if you are purchasing a television? Does the Home Theater go with it? What if you didn’t purchase the Television today? Can we combine a good offer on Television and Home Theater systems and send something relevant? One would argue this can be achieved with Spark. But the game changes when you take the solution beyond in-app notifications. Without moving into the specifics of the breadth of our solution, I can safely say that Storm has a major role to play when latency and real-time streaming analysis makes or breaks our entire efforts.

The Challenges we see going forward?

  1. Real-time analysis also needs to deal with a lot of noise that skews data. Hence the analysis needs the application of filters (again in real-time) to smooth the data and de-noise it.
  2. Spark and Storm either favors data scientists who can use languages such as Python or R with a host of pre-built libraries or Java Developers. But not both. This places challenges with respect to visual designing and actual analysis. At the end of the day, the customer should be able to use this data meaningfully for actionable insights.
  3. Monitoring systems are stuck in old times. It is almost impossible to monitor the systems real time for performance. It is as if the many contributors to the Hadoop ecosystem deliberately kept this feature out to favor the companies that provide Hadoop as a service. While we are not against any such company, this places additional burden on startups.

If you are a Hadoop know-it-all, well experienced, a problem solver and live in Mumbai. Then we are looking for you. Contact us via this link.