If you've ever had to merge two Excel spreadsheets then you know how challenging it can be to work with data from multiple sources. At LinkedIn, where tables with hundreds of millions of records are far from unheard of, simply merging those giant datasets for routine queries began to take up massive amounts of time and resources.
"It slowed down the queries and even made them infeasible to execute in our Hadoop environment," says LinkedIn engineer Srinivas Vemuri, who worked on the number-crunching pipeline for XLNT, LinkedIn's A/B testing platform. "The size of the intermediate output of the join was explosive."
That led engineers working on XLNT to carefully craft a suite of Java code that pulls necessary data in from across the company, according to a blog post by engineers Vemuri and fellow engineer Maneesh Varshney.
"Written completely in Java and built using several novel primitives, the new system proved effective in handling joins and aggregations on hefty datasets which allowed us to successfully launch [the framework called] XLNT," the engineers wrote earlier this month.
The code divides data into manageably sized blocks of rows from the different tables, where rows from different tables referring to the same user are guaranteed to be found in corresponding blocks. That let a lot of interesting statistics, like tallies and averages, be computed block-by-block, without ever having to store the whole merged dataset in memory and made it possible to generate A/B test results in a reasonable amount of time.
"However, we soon became victims of our own success," wrote the engineers. "We were faced with extended requirements as well as new use cases from other domains of data analytics. Adding to the challenge was maintaining the Java code and in some cases, rewriting large portions to accommodate various applications."
The company decided to build a general-purpose tool built on that block-by-block principle, creating an open-source framework they called Cubert. Varshney, Vemuri, and some of their colleagues described the principles in detail in a conference paper published in September by Varshney, Vemuri, and other LinkedIn staff.
"We started off with the A/B testing analytics problem, and we went ambitious and decided to generalize these primitives," says Vemuri. "We were very surprised by the diverse nature of the use cases."
With the data broken into the right blocks, Cubert—which takes its name partially as a tribute to the classic block-sliding Rubik's Cube puzzle—makes it efficient to compute statistics broken down by a variety of variables, such as tracking user clicks by factors like day of week, time of day, or demographic factors, the engineers say.
Those kinds of statistical breakdowns are traditionally represented by an OLAP cube—a multidimensional plot where each dimension represents one of the factors in the breakdown. Different points within the cube correspond to the different possible sets of values those variables could take on.
Cubert's programmable through a custom scripting language not too dissimilar from SQL, so analytics experts don't have to write their own custom Java code to take advantage of its speed. A "blockgen" statement, for instance, breaks data down into blocks based on specified factors; a "cube" statement constructs an OLAP cube along specified dimensions.
"It should have the flexibility and the control to go and describe exactly how my algorithm should be run, and it should be simple enough as a scripting language to be written quickly and easy to discuss with somebody," says Varshney.
They also discovered the framework is surprisingly well suited to many network graph problems common at LinkedIn, like finding possible connections between friends of friends.
"Graph processing is technically a very interesting subject area for us," says Vemuri. "What we found is these same sort of primitives, with some twists and some extensions, can perform graph processing very efficiently."
The engineers say they hope making Cubert open source will enable engineers from different companies to work together on solving the kinds of problems it's suited to, rather than continuing to develop their own one-off solutions.