Programming, software architecture

You will make mistakes just don’t make the expensive ones!

I have a friend who is a pilot and someone asked him “Aren’t most of the controls on the airplane running on auto mode then why do you get paid so much?”. His answer was “Well, I don’t get paid for the 99% of the times when things go right, I get paid for the 1% when things go wrong and in that case I am the difference between life and death.” Sometimes, I like to view my job as a technology leader through that lens – getting paid to avoid the most expensive mistakes. Although thankfully most of the times, I am not responsible for making life and death decisions. 🙂

Now ideally, you would want to avoid all mistakes. But the matter of fact is, that we cannot avoid all mistakes because we lack necessary context, or the skill set or just bandwidth. So whats the next best thing? I like to categorize mistakes by the cost of fixing them. Here is how I categorize mistakes: multi million dollar, a million dollar or a sub-million dollar mistake. These “buckets” come from the cost of the mistakes I have made personally or experienced from my day job of building software, typically for large, enterprise clients.

Now the question is how do we categorize mistakes?

I haven’t spent enough time thinking about the heuristics for these buckets but I have some examples:

  • Multi million dollar mistake. Not arranging your software by business concepts would be a multi million dollar mistake. This is the type of mistake that doesn’t instantly hurt and is hard to detect. It dramatically limits the organization’s ability to evolve over time. I bet everyone is doing this wrong, some to a lesser degree than the others and its really important to be on top of this one. This is the one that the organization’s CTO should be actively looking at.
  • Million dollar mistake. Not having the right logic in the right layers of your application/service would be a million dollar mistake. For example, your domain logic is embedded in your database and now your processing cannot scale until you scale your database, which means buying more licenses for your database.
  • Sub-million dollar mistake. God classes would be an example of sub-million dollar mistake. This is where there most of your business features depend on one class in your code base. It is very expensive to add new business features or maintain existing ones.

The categorization of the mistakes provides the necessary focus on the ones you should be avoiding – the most expensive ones and the rest you can choose to defer or deal with at the “last reponsible moment”. Avoiding the multi million dollar mistakes takes careful planning.

Going back to our examples,when looking at a particular software, you should be focusing on if the software is arranged by the business concepts (where Domain Driven Design thinking can be a huge help), before you get too deep into the weeds of the code and solving for God classes. If you spend your time wisely avoiding the multi million dollar mistakes, you could still end up with some God classes but at least it won’t be as expensive to fix them.

Standard
Android, architecture, Design, Mobile, Programming

MDM Operation serialization

This is the third blog in the series on building custom MDM (Mobile Device Management) solution for Android tablets. The first blog gives a general overview of the MDM solution and talks about the server architecture. The second one is about the Android MDM agent architecture.

The high level idea is that we have an MDM agent that runs on the tablet, fetches ‘operations’ from the MDM server, executes them and returns back the status to the server. In this blog, I will be talking about some implementation specific details on the Android MDM agent, specifically around serializing the processing of ‘operations’, which required us to implement multi-threaded synchronization in Android.

To start with, what are operations?

Operations are tasks that the server wants the tablet to execute, for example, set password policy, install an application, block settings on the tablet, et al.

So, what’s the problem?

The problem is making sure, during the operation processing on the tablet, only one operation of a given type is executed at any given point in time. This means that two PasswordPolicy operations should not be executed at the same time. It is ok though, to have a PasswordPolicy operation execute in parallel with InstallApp operation. In fact, that is desirable. Executing operations of the same type in parallel has a serious drawback.

Race condition

Lets say the operation being executed in parallel is the PasswordPolicy operation. One of the operation wants to apply the password policy as Normal (4 digit password) whereas the other wants it to be Strong (6 alphanumeric password). Now based on which one gets executed last, the password policy will be set on the tablet and not necessarily the one that the server generated last. Worse the server could get out of sync with the tablet, based on which operation sent its status last.

So, what’s the solution?

Enter Android. As mentioned before, the MDM agent is an Android component. Android provides the IntentService that can solve a similar problem. Its basically a Service that processes one Intent (asynchronous request) at a time. It maintains a queue for the incoming Intents and spawns a single worker thread to process them sequentially. It provides a method for handling the Intent. Once the method is executed, it will pick up the next Intent waiting on the queue. Sounds like what we want, right? All we do is create an IntentService for each of the operation types and just keep firing Intents at the appropriate IntentService, as and when the operations arrive. Here’s how it would look.

mutlithreaded-serialization-naive

But wait! There is always a but, in’it?

What’s the problem now?

Enter multi-threading. Some of the operations have to jump to another thread to finish their processing, viz., the Main thread, also known as the UI thread. Android mandates that all UI activities be performed on the Main thread. So, our PasswordPolicy operation would go on to the Main thread to show the dialog box for changing the password to confirm to the new password policy. When the control is passed to the Main thread, the worker thread thinks that its done, without waiting for the response from the Main thread. It happily continues processing the next operation, while the previous one could still be waiting for a user response on the Main thread. Damn! Serialization in multi-threaded environment is not easy.

So now, what do we do?

Now what we do is, block the worker thread until the Main thread has finished doing what its doing. We provide a callback when we jump off of the worker thread. When the main thread finishes, the callback is called which unblocks the worker thread, performs finish up activities for the operation (sending status to the server) and then releases control. Android will pick up the next Intent and process the next operation. Mission accomplished! 🙂 Here is a pseudo sequence diagram for doing it.

mutlithreaded-serialization

We could have implemented this in a purely non-blocking way, by having the finishing operation trigger the next one. This way though its hard to have expiration time on the operation processing since it moves between threads. Blocking makes it very easy to do that. We use the Java CountDownLatch with a timeout on it for blocking the thread. Also conceptually its easier to understand. And since its a background/worker thread its not such a big deal blocking it.

Thanks for reading! Comments/suggestions welcome.

Standard
Android, architecture, Design, Mobile, User Experience

MDM Tablet Architecture

This is the second part of my blog series that talks about building a custom MDM (Mobile Device Management) solution. If you haven’t read the first part, you should check it out here.

As mentioned in the previous blog, as part of the MDM solution, we have the MDM agent running on the tablet that fetches operations from the server, executes them and returns back status of the operations to the MDM server. As a quick reminder, here are some examples of these operations:

  • Silent installation of application/s
  • Block settings on the tablet
  • Block application/s on the tablet
  • Reset the pin on the tablet
  • Factory reset the tablet

In this post, I am going to talk about how we realized the implementation of some of these challenging operations.

So, to put things in context, if you are developing a regular Android application, you would have access only to Android’s publicly exposed APIs. Your architecture would look something like the one below.
normal-app-tablet-architecture

But when you are building an MDM agent, the publicly exposed APIs are not sufficient for implementing some of the operations. So, for example, if you wanted to silently install an application (install happens in the background, without the user knowing about it), there is no public API to do such a thing. But if you look at the source code of Android there exists a method that does exactly that. Now what do we do?

Now here is where we need to broaden our horizon a little bit. As I mentioned before, we are responsible for the end-to-end solution, hardware and software. This gives us great leverage in terms of customizing the Android platform and here is how we do it.

We work directly with the hardware vendor for procuring the Android devices. The devices come with an Android system image that is built specifically for our needs. A system image is a combination of the Android operating system, OEM (Original Equipment Manufacturer, in this case, the hardware vendor) applications, device specific drivers, et al. As we all know, Android is an open source platform and hardware vendors are free to customize it, as long as it satisfies Android’s CTS (Compatibility Test Suite).

We built an Android service that would expose some of the non-publicly available Android APIs. Lets call it the “bridge service”. We handed this service over to our hardware vendor for it to be included in our custom system image. Including the bridge service in the system image, gives it access to Android’s non-public APIs, since the service will be considered part of the Android operating system. This bridge service will in turn be accessed by our MDM agent for realizing the implementation of some of the operations. And boom, there you go! Now we have access to Android’s non-public APIs.

We did not stop with just exposing non-public APIs; we went one step further. We have some custom Android components included in the system image. For example, one of the business requirements is to, only allow certain applications to run on the tablet. For this, we have a App Killer utility that kills applications that are not part of a “allowed apps” list. This list is fetched from the MDM server as part of an operation and fed to the App Killer utility on the tablet.

Some of the other requirements, like blocking certain settings on the tablet, are implemented by overriding the Settings application code that’s part of the Android operating system. With all these customizations, this is how our architecture looks like.

mdm-tablet-architecture

Its a pretty cool realization that we can do anything we want with the Android operating system! :). But, of course, all of this, comes at a cost.

First of all, this bridge service presents a security risk and we mitigate it by making sure that it can only be accessed by our MDM agent. The other big concern is that, when Android changes one of these APIs, which they are free to do, since they are not publicly exposed, we would have to make corresponding changes to our bridge service. Also, working out the logistics hasn’t been easy, primarily due to the long feedback cycles involved in testing the integration between system image, bridge service and the MDM agent. Each of the system image changes have to be certified by Google by passing the CTS, which adds to the delay. Making changes to the system image requires us to go back to the hardware vendor which is time consuming. At the moment, it takes the hardware vendor roughly about 3 days to turn over a single system image change, depending on the complexity of the change. Hence, we have to plan these changes well in advance.

But when you are building a mobile device management solution, you need these capabilities. It has made a vast improvement in the user experience for the tablet users. So for example, on the tablet, we can silently install applications without the user having to go through the normal process of clicking OK button on multiple screens. Imagine if the user has to do this for 100s of applications! At that point, it becomes a necessity as opposed to a ‘nice-to-have’.

Its been incredible fun “hacking” with the Android operating system. I hope you have liked the blog. Feel free to let me know what you think. Thanks for reading!

Standard
Android, architecture, Design, Mobile

Mobile Device Management

For the last year or so, I have been incredibly lucky to be working on building a custom MDM (Mobile Device Management) solution for Android tablets. This is a blog series, where I talk about the general architecture of the MDM solution and the specifics of the Android tablet component and the server component of the MDM solution.

Its broken down in to 5 parts. Here they are:

Introduction
Tablet Architecture
Operation Serialization
Tablet Compliance
Operation Processing Workflow

Hope you enjoy the series. Comments welcome.

Standard
architecture, Craftsmanship, Design, Java, Refactoring

Soul coding

Yesterday I had a 12 hour non-stop[1] code fest to refactor a thin slice of 2-tiered web application into a 3-tiered one. It was very productive and I must say this is the kind of stuff that soothes my developer soul and hence the name. 🙂

The primary driver for the refactoring was that the core logic of the application was tightly coupled on both ends to the frameworks being used. On one side, it was tied to the web application framework, Play and on the other end the ORM, Ebean. We managed to move the business logic in to a separate tier, which is good on it own, but also let us unit test the business logic without bothering with the frameworks, which can be frankly quite nasty. As a follow on effect, we also managed to split the models into 2 variants, one to support the database via Ebean and the other to generate JSON views using Jackson. This let us separate the two concerns and test them nicely in isolation. Similarly, we could test the controllers in isolation. We got rid of bulk of our functional tests that were testing unhappy paths for which we had unit tests at the appropriate place viz., the controller, view model, service and database models.

I was quite amazed at how much we could get done in a day. Here are some of the important take aways from this experiment:

  • We had a discussion the previous day about how we wanted to restructure the code, primarily focusing on separation of responsbilities and improving unit testability of the code. We agreed upon certain set of principles, which served as a nice guideline going into the refactoring. On the day of refactoring, we realized that not all the things we discussed were feasible, but we made sensible adjustments along the way.
  • Keep your eye on the end goal. Keep reminding yourself of the big picture. Do not let minor refactorings distract you, write it down on a piece of paper so you dont forget.
  • Pairing really helps. If you are used to pairing, when you are doing refactoring at this scale, it doubly helps. Helps you keep focused on the end goal, solves problems quickly due to collective knowledge and also decision making cycle time is considerably reduced when making adjustments to the initial design. Also I would say pick a pair who is aligned with you on the ground rules of how you are going to approach development. You don’t want to get into a discussion of how or why you should be writing tests and what is a good commit size.
  • Having tools handy that get you going quickly. Between me and my pair, we pretty much knew what tool to use for all the problems at hand. At one point, we got stuck with testing static methods and constructors. My pair knew about PowerMock, we gave it a spin and it worked. And there it was, included in the project. Dont spend too much time debating, pick something that works and move on. If it does not work for certain scenarios, put it on your refactoring list.
  • Thankfully for us, we had a whole bunch of functional tests already at our disposal to validate the expected behavior, which was tremendously useful to make sure we weren’t breaking stuff. If you dont have this luxury, then pick a thin slice of functionality to refactor which you can manually test quickly.
  • Small, frequent commits. Again the virtuosity of this is amplified in this kind of scenario.
  • Say no to meetings. Yes, you can do without them for a day, even if you are the president of the company. 🙂

Have you done any soul coding lately? 🙂

[1] Ok, not quite 12 hours, but it was on my mind all the time. 😉

Standard
Agile, Programming

Tasking

I have been following this seemingly innocuous practice of “tasking” when programming on a story, which I find very useful and I recommend you try it. Here are the what, when and whys of tasking.

What: Breaking the story into small tasks that could be worked on sequentially.

When: Before writing the first line of code for the implementation of the story.

Why:
Understanding: By breaking the story into smaller chunks, it gives a good grasp on the entire scope of the story. It helps define the boundary of the story and gives a good idea of what is in scope and out of scope.
Completeness: Tasking makes you think of the edge case scenarios and is a conversation starter for missed business requirements or even cross functional requirements like security, performance, et al.
Estimation: Doing tasking right in the beginning of the story gives a sense of the story size. On my current project we are following the Kanban process. Ideally we would like to have “right-sized” stories, not too-big not too-small. Tasking helps me decide if the story is too big and if it is, then how could it be possibly split into smaller ones.
Orientation: This has been a big thing for me. I feel I go much faster when I have a sense of direction. I like to know what is coming next and then just keep knocking off those items one by one.
Talking point: If you have 1 task per sticky which I recommend, it serves as a good talking point for say business tasks vs refactoring/tech tasks, prioritizing the tasks, et al.
Pair switch: If you are doing pair programming, like we do, then you could be switching pairs mid way through the story. Having a task list helps retain the original thought process when pairs switch. Stickies are transferable and they travel with you if you change locations.
Small commits: Another big one. Each task should roughly correspond to a commit. Each task completion should beg the question “Should we commit now?”. If you are doing it sooner, even better.
Minimize distration: There is a tendency as a developer to fix things outside the scope of the story like refactoring or styling changes to adjacent pieces of code. If you find yourself in that situation, just create a new task for it and play it last.

Thanks for reading and feel free to share your experiences.

Standard
Design, Programming

Simplicity via Redundancy

Typically when we do a time versus space tradeoff analysis for solving a computational problem, we have to give up one for the sake of the other. By definition, you trade reduced memory usage for slower execution or vice versa, faster execution for increased memory usage. Caching responses from a web server is a good example of space traded for time efficiency. You can store the response of a web request in a cache store hoping that you could reuse it, if the same request needs to be served again. On the other hand, if you wanted to be space efficient, you would not cache anything and perform the necessary computations on every request.

When I am having a discussion with my colleagues, this time versus space tradeoff analysis comes up quite frequently ending in a time efficient solution or a space efficient solution based on the problem requirements. Recently I got into a discussion which gave me a new way of thinking about the problem. How about if we also include simplicity of the solution as another dimension which directly contributes to development efficiency. In an industrial setting, simplicity is just as important or the lack thereof costs money just as time or space inefficient solutions would.

So the recent discussion I had was about a problem that involved some serious computation at the end of a workflow that spanned multiple web requests in a web application. This computation would have been considerably simpler, if we stored some extra state on a model in the middle of the workflow. It started to turn into a classic time versus space efficiency debate. Time was not an issue in this case and hence giving up that extra space was considered unnecessary. But I was arguing for the sake of simplicity. Quite understandably, there was some resistance to this approach. The main argument was “We dont have to store that intermediate result, then why are we storing it”. I can understand that. If there is no need, then why do it. But if it makes the solution simple, why not?

I admit there might be certain pitfalls in storing redundant information in the database, because it is not classic normalized data. Also, there could be inconsistencies in the data, if one piece changes and the other piece is not updated along with it. Also it might be a little weird storing random pieces of information on a domain. Luckily in my favor, none of these were true. The data made sense on the model and could be made immutable as it had no reason to change once set and hence guaranteeing data consistency.

Extending this principle to code, I sometimes prefer redundancy over DRYness (Don’t Repeat Yourself), if it buys me simplicity. Some developers obsess over DRYness and in the process make the code harder to read. They would put unrelated code in some shared location just to avoid duplication of code. Trying to reduce the lines of code, by generating methods on the fly that are similar to each other, using meta-programming can take it to a whole new level. It might certainly be justified in some cases but I am not sure how many developers think about the added complexity brought on by this approach.

A good way to think about reusability is thinking about common reason for change. If two classes have a similar looking method, and you want to create a reusable method and have it share between the two, then you should also think about if those two classes have a common reason for that method to change. If yes, then very clearly that method should be shared, if not, may be not so much, especially if it makes the code hard to understand.

I feel redundancy has value sometimes and simplicity should get precedence over eliminating redundancy. Thank you for reading. Comments welcome.

Standard
architecture, Programming, Scala

To graph or not to graph?

I have been recently working on the Credit Union Findr application and I think it is a pretty interesting problem to solve. To give you some idea about the functional requirements, the application intends to help people find credit unions they are “eligible” for. If you don’t know what a credit union is, it is a non-profit financial institution that is owned and operated entirely by its members. They provide a variety of financial services, including lending and saving money at a much better rate than regular banks. Credit unions can only accept members that satisfy certain eligibility criteria. The eligibility criteria are based on where you live, work, worship or volunteer. They are also based on who you work for, your religious, professional affiliations, et al.

So on the face of it, it seems like a simple enough problem. Take the eligibility criteria for credit unions, take the user’s personal details and then match the two. Sounds simple, right? Not so much. When we started building the application, there were some interesting challenges on the domain side (non-software) as well as on the implementation side (the software bits). Although the focus of this blog is to talk about the software side of the application, I will quickly run through the domain challenges.

First of all, gathering the credit union eligibility data is a monumental task. Thankfully, the product owner team, Alternate Banking System (ABS) handled that for us. Second, we were not sure which questions to ask the user. Because we could ask them everything under the sun and still not have any qualifying credit unions as a result, which is rare, but still a possibility. Moreover there were concerns around how much personal information would the user be comfortable sharing with us. All these challenges were tackled by our UX (user experience) team when building the prototype and they made the suggestion that we should treat this application like a dating site. You ask the user a minimum set of relevant personal questions, ask them what they were looking for in a credit union and then match the two. The analogy made it much simpler to visualize the system.

The match making analogy stuck with us developers. But the big question from an implementation stand point was how do we model the data. Max Lincoln, one of the brilliant engineers on our team, made a suggestion that we use a graph database. Frankly at first I found that a bit odd. I was thinking more along the lines of a document store. You throw credit union eligibility criteria on a document, user details on another document and match the two. Graph database. Really? But any way what do we do when we have two approaches to solve a problem. Well, we spike them out! Both the spikes were quick and revealed the merits and demerits of each solution. Mustafa Sezgin led the document store spike and used ElasticSearch. As I thought more and more about the problem, I became convinced that graph solution was the way to go. Let me explain how.

So lets start off by an example to see how we could model this problem as a graph. Lets say, you have a credit union that accepts people working in Manhattan. And you have a user who works in Manhattan. Now on a graph, you would have a node for the credit union which has a relationship called “acceptsWorkingIn”, represented by an edge, to the node Manhattan. And on the user side, you would have a user node, that has a relationship called “worksIn” to the same Manhattan node that also connects to the credit union. Now lets say you had another credit union that accepted people working in Manhattan. Then you would have a new credit union node to represent it and with a relationship “acceptsWorkingIn” to the same Manhattan node. Now to find the credit unions that the user is eligible for, you would simply traverse the graph starting from the user node until you find all leaf credit union nodes.

One of the reasons why the graph solution really shines is “hotspot questions”. Let me first explain what a hotspot is. So in our example, a hotspot is a node to which you have a lot of credit union nodes attached to. So in our previous example, if we keep adding all the credit union eligibility data as far as “acceptsWorkingIn” relationship is concerned, and you find out that 90% of the credit union nodes are attached to Manhattan node, then Manhattan node clearly becomes a “hotspot”. Remember what I mentioned earlier, that we had a difficulty coming up with which questions should we be asking the user. Hotspot questions! The questions that lead us to a hotspot quickly are hotspot questions. So a hotspot question in this example would be “Do you work in Manhattan?”. The UX folks were delighted because if we could ask a hotspot question as our first question and get a positive response, the user becomes eligible for 90% of the credit unions right away. More results quickly equals big win for usability! Now there is another side to this hotspot questioning. Lets say all credit unions accept people who live on the Ellis Island, and what if nobody lives on Ellis Island, you would get a negative response for all the users. Doesn’t make it a great question, does it? Now lets say we had a healthy database of users and you analyze the data from the user side and you find out that a large percentage of users work for NYC Metropolitan government and end up qualifying for some credit union. So you can find the user an eligible credit union if they also happen to work for NYC Metropolitan government. So the hotspot question would be “Do you work for NYC Metropolitan government?”.

Other reason why I liked the graph model is because a graph model is flexible and easy to extend. Lets say you have a credit union that accepts folks who work for NYC Metropolitan government and you have a user who works for NYC Transit. Now in a graph model, you could easily have a connection between NYC Metropolitan government and NYC Transit via the relationship “belongsTo”. This way, over time as you gather more information, you could have multiple levels of relationship between the user and the credit union. The technical implementation remains exactly the same. All you have to do is traverse the graph from the user node to the leaf credit union nodes.

What I have realized is that, what would be a value in a document store, is a first-class entity in a graph model. So for example, in a document store you have an attribute “worksIn” on a user document with the value Manhattan. On a graph database the value Manhattan gets its own node which could have its own properties, could have its own relationships to any other node in the system. More so, there is one unique node to represent Manhattan, thus giving rise to the concept of hotspot which has been immensely valuable in our case.

Any way we have just started building this application and I am really excited about it. For the technology stack we are using Scala, Play! and Neo4J and will be deploying to Heroku.

Thanks for reading and comments welcome!

Standard
Performance, Programming, Ruby

Ruby is not all that slow

On one of my Ruby on Rails projects, I worked on a story that involved performing calculations on some reference data stored in a table in a database. The table was like a dimension table (in data warehousing parlance) and was non-trivial in size, say about 100 columns and 1000 rows. We had to calculate, something similar to a median of a column, for all the rows, for all 100 columns. And the challenge was to perform this calculation at request time, without being too much of a performance overhead.

My first response to any such calculation intensive activity is to push it offline, to avoid the runtime calculation penalty. But in this case, the calculation involved inputs from the user and hence did not lend itself very well to pre-processing. We could have done something that a Oracle cube does, where it builds aggregates of fact data before hand, to expedite the processing time required. But in our case, pre-processing the results for all the possible user input options would have been extremely wasteful in terms of storage. The other concern with this approach would be that, there would be some time delay between the data being available and being usable, due to the pre-processing.

Then came the suggestion of manipulating the data in stored procedure. It made me shudder! Having recently read Pramod Sadalage’s blog1 on pain points of stored procedures, I certainly was not in a mood to accept this solution. The pain points outlined in the blog that particularly appealed to me were, no modern IDE’s to support refactoring, requiring a database to compile a stored procedure, immaturity of the unit testing frameworks, and vertical scaling being the only option to scale a database engine.

As we all know, Ruby is perceived to be slow and incompetent to perform any computationally intensive tasks, and consequentially was not an option on the table. I thought to myself, Ruby can’t be that slow. The calculation we were doing was non-trivial but not super involved either. I decided to give it a shot.

We started by doing all these calculations using ActiveRecord objects and found that the performance was not good at all. ActiveRecord was the culprit because it was creating all these objects in memory and considerably slowing down the calculation process. We ditched it and opted for straight SQL instead and storing the results in arrays and performing the calculations on those arrays. Better! But not good enough. We found that we were performing operations like finding max, min, order by for a given column values in Ruby and which wasn’t particularly performant. We delegated those to the database engine, since they are typically good at such things and saw quite a hefty performance gain. By doing these simple tricks, we could pretty much get the performance that we were looking for.

Even though we had solved the performance problem, we had an unintended side-effect of our design. Given that we were processing data in arrays and outside of the objects where they were fetched from, the code looked very procedural. To solve this problem, we created some meaningful abstractions to hold the data and operate on it. These weren’t at the same granularity as the ActiveRecord would have created in the first place, but at a much higher level. This way it was a good compromise between, having too many objects and procedural code on the other hand, yet getting the performance we desired.

The biggest win for me in doing all these calculations in Ruby, was keeping the business logic in one place, in the app, and unit testing exhaustively the calculation logic.

I guess none of this is revolutionary in any way, but I guess next time I face a similar situation, I would have the conviction that, Ruby is not all that slow.

1 With so much pain, why are stored procedures used so much

Standard
architecture, Integration, Programming

Integration-first development

Yay, yay, I know its another one of those X-first development, but this one has been a bit of a revelation to me. I have been convinced of its value, so late in my career, when it should have been obvious day one. Its not that I did not know about it, but I never realized its paramount importance or conversely, the havoc it can cause, if not adhered to. Oh well, its never too late to do the right thing :). Now that I talked so much about it, let me say a few words about what it means to me. It means, when you are starting a new piece of work, always start development at your integration boundaries. Yes, it is that simple and yet was so elusive, at least to me until now.

We have had numerous heart burns in the past over, not being able to successfully integrate with external systems, leading to delays and frustrations. Examples of integration points in a typical software system could be web services, databases, dropping files via FTP into obscure locations, GUI. Number one reason for our failure to integrate, has been our lack of ability to fully understand, firstly, the complexity of the data being transferred across the boundary and secondly, the medium of transfer itself. We think we understand these 2 things and then proceed with the development with mostly the sunny-day scenario in mind. We happily work on the story development until we hit this integration boundary and then realize, damn, why is this thing working differently than we expected. Not completely different from what we thought, but different enough to warrant a redesign of our system and/or having to renegotiate some integration contract. By then we have spent a lot of time working on things which are very much under our control and could have been done later.

Now what are the possibilities that we could not have potentially thought about in the first place? Let me start with a real-world example that I have come across. We were integrating with a third party vendor that delivered us some data via XML files. We had to ingest these files, translate them to HTML and display on the screen. Easy enough? We thought so too. We got a sample file from them, made sure we were parsing the XML using the schema provided and translating it into HTML that made sense for us. We were pretty good at exception handling and making sure that we did not ingest bad data. The problem started when we were handed a file of over 10 MB of data. Parsing such a huge file at request time and displaying the data on the screen in a reasonable amount of time was impossible. At this point we were forced into rethinking our initial “just-in-time” parsing strategy. We moved a lot of processing offline, as in, pre-process all the XML in to HTML and store it in the database. This alone did not solve the problem, for it was still not possible to render the raw HTML on the screen in a reasonable amount of time, since the HTML had to be further processed to be displayed correctly with all the styling. Obviously the solution was to cache the rendered HTML in memcached. We hit another road block with memcached, being unable to store items greater than 1 MB in size. Surely, we could have used gzip memcached store or some other caching library but that wouldn’t have solved our problem either. We chose to create another cache store in the database. After doing all this, some of the pages were still taking too long to load because of the sheer size and as a result, the business had to compromise on not showing those pages as links at all. But this meant that we had to know the page sizes before hand, to determine where to show links and where not to. All this led to a lot of back and forth, which is fine, but had we known about this size issue earlier, it would have saved us a lot of cycles. By the way, we could have incrementally loaded the page using Ajax but it was too late and wasn’t trivial for our use case. We could have known about the size issue if we had integrated first and worked with a realistic data set rather than a sample file.

So as per the example above, our integration point dictated our architecture in a huge way and also, more importantly, business functionality which is another reason to practice integration-first development. In an ideal world, you would like to insulate yourself from all the idiosyncrasies of your integration points but it rarely happens in a real world.

Other issues around integration points that have influenced our design are, web services going down often, and hence having to code defensively by caching the previous response from the service and setting the appropriate expiration time on it based on the time sensitivity of the data. In some cases, you might want to raise an alarm if the service goes down. If you know that it happens too often, then you might want to have it raised less frequently and be more intelligent about it.

There have been cases, when we have had no test environment to test a service and hence having to test with the production version of the service, which could be potentially dangerous. We had to do it in such a way, so as to not expose our test data in production. This meant our response had to change based on what environment they were being sent from. Certain services charged us a fee per request hence we had to reduce the number of service calls to be more economical.

Some vendors offered data via database replication and occasionally sent us bad data. In this case, we had a blue-green database setup where we had two identical databases, blue and green. The production would point to one of them, say it was pointing to blue. We would load the new data in the green database, run some sanity tests on the new data and only when the tests pass, point production to green. We would then update blue to keep it in sync.

Some of the vendors offered data via flat files. Sometimes they would not send us files on time or completely skip them. We updated our UI to reflect the latest updated time so as to be transparent. In addition to the file size episode I mentioned above, we had another problem with large file sizes. Our process would blow up when we processed large amounts of data. It was because we were saving large amounts of data to the database in one go, which wasn’t apparent initially.

Performance of external systems was an issue and obviously affected the performance of our app. We had to stress/load test the external system before hand and build the right checks into the system if the system would not function as expected.

We have tried to mitigate these integration risks in a number of ways. We implement a technical spike before integrating with an external system, to understand its interactions or even feasibility in some cases. When integrating with data focused stories, what I have found to be most useful is, as a first step, striving to get all the (minimum releasable) dataset into the system, without even worrying slightly about UI* jazz. Don’t bother the developers with UI since it can be distracting. Have the raw data displayed on the screen and then “progressively enhance” it by adding sorting, searching, pagination, etc. This is popular concept used in UI view rendering where you load the most significant bits of the page first and then add layers of UI (javascript, CSS) to spruce it up so that the user can get maximum value first.

On a recent refactoring exercise, we moved a tiny sliver to our application to a new architecture to test out fully the integration points and have a working model first. After that it was all about migrating other functionality to the new architecture.

There are so many things that could possibly go wrong when integrating and it is always a good idea to iron out the good and bad scenarios, before you start working on anything that you know you have full control over. I am not saying you could predict all these cases before hand, but if you start integrating first you have a much better chance of coming up with a better design and delivering the functionality in a reasonable time.

I would like to leave with you two thoughts: integrate-first and test with real data.

* UI can have exacting requirements too, in which case it should be looked at along with data requirements. Some UI requirements that could influence your architecture are loading data in certain amount of time or in a certain way, like all data should be available on page load.

Standard