Performance, Programming, Ruby

Ruby is not all that slow

On one of my Ruby on Rails projects, I worked on a story that involved performing calculations on some reference data stored in a table in a database. The table was like a dimension table (in data warehousing parlance) and was non-trivial in size, say about 100 columns and 1000 rows. We had to calculate, something similar to a median of a column, for all the rows, for all 100 columns. And the challenge was to perform this calculation at request time, without being too much of a performance overhead.

My first response to any such calculation intensive activity is to push it offline, to avoid the runtime calculation penalty. But in this case, the calculation involved inputs from the user and hence did not lend itself very well to pre-processing. We could have done something that a Oracle cube does, where it builds aggregates of fact data before hand, to expedite the processing time required. But in our case, pre-processing the results for all the possible user input options would have been extremely wasteful in terms of storage. The other concern with this approach would be that, there would be some time delay between the data being available and being usable, due to the pre-processing.

Then came the suggestion of manipulating the data in stored procedure. It made me shudder! Having recently read Pramod Sadalage’s blog1 on pain points of stored procedures, I certainly was not in a mood to accept this solution. The pain points outlined in the blog that particularly appealed to me were, no modern IDE’s to support refactoring, requiring a database to compile a stored procedure, immaturity of the unit testing frameworks, and vertical scaling being the only option to scale a database engine.

As we all know, Ruby is perceived to be slow and incompetent to perform any computationally intensive tasks, and consequentially was not an option on the table. I thought to myself, Ruby can’t be that slow. The calculation we were doing was non-trivial but not super involved either. I decided to give it a shot.

We started by doing all these calculations using ActiveRecord objects and found that the performance was not good at all. ActiveRecord was the culprit because it was creating all these objects in memory and considerably slowing down the calculation process. We ditched it and opted for straight SQL instead and storing the results in arrays and performing the calculations on those arrays. Better! But not good enough. We found that we were performing operations like finding max, min, order by for a given column values in Ruby and which wasn’t particularly performant. We delegated those to the database engine, since they are typically good at such things and saw quite a hefty performance gain. By doing these simple tricks, we could pretty much get the performance that we were looking for.

Even though we had solved the performance problem, we had an unintended side-effect of our design. Given that we were processing data in arrays and outside of the objects where they were fetched from, the code looked very procedural. To solve this problem, we created some meaningful abstractions to hold the data and operate on it. These weren’t at the same granularity as the ActiveRecord would have created in the first place, but at a much higher level. This way it was a good compromise between, having too many objects and procedural code on the other hand, yet getting the performance we desired.

The biggest win for me in doing all these calculations in Ruby, was keeping the business logic in one place, in the app, and unit testing exhaustively the calculation logic.

I guess none of this is revolutionary in any way, but I guess next time I face a similar situation, I would have the conviction that, Ruby is not all that slow.

1 With so much pain, why are stored procedures used so much


Why isn’t Ruby 1.8.7 copy-on-write friendly?

You might have heard, Ruby 1.8.7 isn’t copy-on-write (COW) friendly. Lets see why.

To set some context, say you have built an application in Rails using Ruby 1.8.7 and is deployed on a UNIX system. Scaling the application means spawning multiple processes to serve client requests. Since memory cannot be shared between processes, there is whole lot of memory wasted in holding the same Rails code in multiple processes. Modern UNIX systems provide fork functionality which let a child process share memory with the parent. So essentially there would be one parent process holding the, shared by all memory, and it would spawn child processes, which will share the parent’s memory. When something is being written to the shared memory by the parent, each child process is given a copy of that memory and shared memory in the parent process is written with new data. Similarly, when the child writes to the shared memory, the child gets its own copy of that memory and writes to it with the new data. This is called copy-on-write (COW) technique.

Now lets see how the Ruby 1.8.7 garbage collector works. It is apparently simple in its implementation. It uses a mark-and-sweep algorithm to collect unused memory. What this essentially means is, it scans each and every object and marks a flag “on” the object currently being used. At the end of this cycle, it sweeps or collects all the objects that have not been marked. But the problem here is that the “mark” information is stored on the object itself, in essence making each used object “dirty” which triggers a copy-on-write by the underlying UNIX system. All the used objects held in the shared memory, will be copied to each child’s memory space. This nullifies all the memory savings that would have been possible with the copy-on-write technique. This is why Ruby 1.8.7 isn’t copy-on-write friendly.

Ruby Enterprise Edition circumvents this problem by patching the GC to store the “mark” information outside the object. For more information checkout this.

For a longer and better explanation, go here.