torsdag 7. april 2011

3xP: Predicting production performance pt. 2

In my previous post, I ranted on about testing environments, and how everything would look like if us tech nerds were to run a project. Well .. everything would be tested and running smoothly, but the users might not enjoy the command line interface.

Anyway, what to do when you're unable to reproduce performance issues that appear in production, disregarding excuses and previous assumptions ?

There are a number of non-intrusive metrics that is - or should be - available from your production environment. These should be used to monitor the application, and warn whenever something is about to come out of control.

But since we're already working on this - I guess the metrics failed, or someone did ignore the alerts and flags and whistles. Here's my short recipe for finding performance bottle necks:

1) Draw a diagram of the application stack
2) Map out the different diagnostics available from each layer
3) Figure out if each layer itself believes it is healthy
4) Let each layer figure if the layer below is healthy
5) Start fixing issues from the bottom. With each change, re-assert that the layers below still are hearty.

A typical example would be a long-running sql statement. It would be visible in database monitoring tools, and easily fixed either in the database or by altering the statement. Good.

But it's still a bottle neck. Why? Looking good at the data layer, there may still be issues with transporting the query results to the application server. So, next you look at the database driver, and see that there are a lot of result pagination. You query yielding 20 000 rows gets paginated into 1000 chucks of 20 rows.

So you fix the prefetch size, and now things run a bit smoother. But there's still some issues, because instantiating all of these results can take a while.

So you may be altering the data transfer model to be be more precise of which objects to fetch. But in doing so, you complicate the query again, yielding longer execution time ..

And so on you weave a path upwards to the application level, possibly introducing and resolving several issues as you go. This process is orders of magnitude simpler if you are able to simulate the results instead of actually deploying small fixes to production. In addition, you may unknowingly touch upon stuff that break other performance metrics. So again - monitoring of the entire stack is crucial.

Finally, when you reach a conclusion and have solved all of the current issues - ensure that your current footprint or benchmark is monitored for changes. Future maintenance and development will surely introduce new issues. Get at them early instead of letting them grow and interweave with each other. Then your next job won't be untying complicated knots of system behavior.

Ingen kommentarer:

Legg inn en kommentar