On one of my Ruby on Rails projects, I worked on a story that involved performing calculations on some reference data stored in a table in a database. The table was like a dimension table (in data warehousing parlance) and was non-trivial in size, say about 100 columns and 1000 rows. We had to calculate, something similar to a median of a column, for all the rows, for all 100 columns. And the challenge was to perform this calculation at request time, without being too much of a performance overhead.
My first response to any such calculation intensive activity is to push it offline, to avoid the runtime calculation penalty. But in this case, the calculation involved inputs from the user and hence did not lend itself very well to pre-processing. We could have done something that a Oracle cube does, where it builds aggregates of fact data before hand, to expedite the processing time required. But in our case, pre-processing the results for all the possible user input options would have been extremely wasteful in terms of storage. The other concern with this approach would be that, there would be some time delay between the data being available and being usable, due to the pre-processing.
Then came the suggestion of manipulating the data in stored procedure. It made me shudder! Having recently read Pramod Sadalage’s blog1 on pain points of stored procedures, I certainly was not in a mood to accept this solution. The pain points outlined in the blog that particularly appealed to me were, no modern IDE’s to support refactoring, requiring a database to compile a stored procedure, immaturity of the unit testing frameworks, and vertical scaling being the only option to scale a database engine.
As we all know, Ruby is perceived to be slow and incompetent to perform any computationally intensive tasks, and consequentially was not an option on the table. I thought to myself, Ruby can’t be that slow. The calculation we were doing was non-trivial but not super involved either. I decided to give it a shot.
We started by doing all these calculations using ActiveRecord objects and found that the performance was not good at all. ActiveRecord was the culprit because it was creating all these objects in memory and considerably slowing down the calculation process. We ditched it and opted for straight SQL instead and storing the results in arrays and performing the calculations on those arrays. Better! But not good enough. We found that we were performing operations like finding max, min, order by for a given column values in Ruby and which wasn’t particularly performant. We delegated those to the database engine, since they are typically good at such things and saw quite a hefty performance gain. By doing these simple tricks, we could pretty much get the performance that we were looking for.
Even though we had solved the performance problem, we had an unintended side-effect of our design. Given that we were processing data in arrays and outside of the objects where they were fetched from, the code looked very procedural. To solve this problem, we created some meaningful abstractions to hold the data and operate on it. These weren’t at the same granularity as the ActiveRecord would have created in the first place, but at a much higher level. This way it was a good compromise between, having too many objects and procedural code on the other hand, yet getting the performance we desired.
The biggest win for me in doing all these calculations in Ruby, was keeping the business logic in one place, in the app, and unit testing exhaustively the calculation logic.
I guess none of this is revolutionary in any way, but I guess next time I face a similar situation, I would have the conviction that, Ruby is not all that slow.
1000 rows of data *is* a trivial size. That Ruby, with the in memory version of the data, could not do simple calculations like median, max, or mean, on demand is a very clear indicator that Ruby is unsuited for such calculations. You even state your solution was to push off the work to the database layer.
So while Ruby may be fine for the top-level business logic, I’d say your evidence points to a different conclusion, that it isn’t suited for any type of calculation work.
Yes, 1000 rows is a trivial size but not when you have 100 columns in the table. Though some of the heavy-lifting was done by the database, bulk of the calculation was done in Ruby. If the performance would have been unsatisfactory, I would have thought about offloading it to a separate layer/language, say spinning up a Java/JRuby application and delegating the calculation intensive part to it. But then you also have to think about paying the cost of additional HTTP call over the network (and maintaining it).
I agree to you some degree that I would use Ruby for web/frontend stuff and delegate anything that is computationally intensive to some other performant layer (provided the ROI is significant). To be fair, all the web applications are I/O bound and Ruby processing time does not add that much. Though, one area I can think where this strategy might be useful is when the app is parsing a whole bunch of XML.
May be I wasn’t clear. The challenge was to find the median of a column for 1000 rows, for all the columns in a single request. Does that make it non-trivial now? 😉
A two level nested – and empty – FOR cycle from 1 to 16000 (that is 256 million cycles) takes 35 seconds to finish in Ruby on my virtual machine. The same is less than 1 second in C, or in Javascript.
Here’s a mandelbrot test case: http://shootout.alioth.debian.org/u32/performance.php?test=mandelbrot
Take off the compiled languages as they have significant advantages.
The lack of primitives makes Ruby useless in computation intensive tasks, although I like and use Ruby much, in this case it is useful only for prototyping.
Second thought: You wrote “median of a given column” which makes the original 100 column meaningless (because that is 1 column only) and that don’t sound too big for me. Did I miss something?
As I replied to mortoray, when I am building Rails applications, I would use Ruby for web/frontend stuff and delegate anything that is computationally intensive to some other performant layer (provided the ROI is significant). In this case, the performance was satisfactory and I managed to keep the business logic close to the app, which was a big win.
The challenge was to find medians of all the 100 columns for all the 1000 rows. Sorry for not being clear, I will update the original text.
How is Ruby helping if you are delegating calculations to database? Do they have better database drivers to reduce I/O time?
Or is it simple enough in Ruby to delegate such operations and still enjoy benefits of dynamic language? That could be a reason to not to abandon Ruby for the sake of performance.
Its only part of the calculation logic is being done by the database. Lets say you have a formula which says (x * y)/2 + max(z). In this example, the max(z) comes from the database which is being passed to the formula calculation code which is executed in Ruby.
As I mentioned to other commentators, if it were anything more complex, I would not choose Ruby and go with something faster, but in this case the ROI would not warrant it.
Hope that makes sense.