Monday, December 30, 2013

Ignoring method invocations with RSpec message expectations

I often find myself with RSpec test examples that are verifying that log output meets expectations, but where the test is only concerned with verifying a single, specific log message has been generated. But usually the logger is receiving additional output that the test is not concerned about. Unfortunately, this output must still be "expected", lest the test will fail. There is (apparently) no way in RSpec to write an message expectation with specific arguments that should occur "at some point", when the message is also received at other times with different arguments.  For example this simple test example fails if logger.info is called more than once by the code being tested:

it "logs a specific message" do
  logger.should_receive(:info).with(/text that must be logged/)
  invoke_your_test_code()
end

RSpec will complain with something like:

expected: (/text that must be logged/)
     got: ("starting process...")


Writing and maintaining the code that explicitly expects all of the other logging calls is cumbersome to write and maintain, and makes the test fragile to log output changes that are not of concern to the test.

However, I found that one can gracefully ignore this extraneous log output by using the any_number_of_times expectation method in conjunction with the specific expectation to ignore all of the other method invocations, as follows:

it "logs a specific message message" do
  logger.should_receive(:info).with(/text that must be logged/)
  logger.should_receive(:info).any_number_of_times
  invoke_your_test_code()
end

The any_number_of_times expectation must occur after the specific expectation that is being verified; otherwise the expected log output will be "consumed" by the any_number_of_times expectation, and the test will fail.

Monday, November 11, 2013

Two Things Your Configuration Management Should Make Easy

For a given story/issue/ticket, can you produce the associated code diff?

On your project, can you determine the story/issue/ticket associated with the last change that was made for a given line of code?

If your configuration management system makes it possible to do these things, can you do it quickly? Or is it merely "theoretically possible"?

[http://www.sqlite.org/talks/wroclaw-20090310.pdf, slide 141 "Situational Awareness in CM"]

Wednesday, November 6, 2013

Notes: "The Trouble With Types", Martin Odersky Presentation

http://www.infoq.com/presentations/data-types-issues

Good designs are:
  • discovered, not invented.
  • opposite of random
Strong typed languages help produce good designs. (I couldn't agree more, after spending almost 2 years now in Ruby space.)

Patterns (abstractions) & Constraints (types)

Type systems should be 1) precise, 2) sound, 3) simple

Odersky addresses issue of Scala's type system complexity, admitting that the complexity comes out of the large combinations of typing features that arise from the set of possible combinations of its modular (OO) and functional typing features.

DOT (Dependent Object Type calculus) & Dotty (experimental language): Simplify typing complexity by focusing on just supporting the module typing features internally, i.e., in the compiler.

It's comforting to know that the parts of Scala I've found most confusing (e.g. when do I define a class using 'class MyClass[T]' or 'class MyClass { type T }') are at the core of what Odersky is trying to simplify.

Sunday, November 3, 2013

A Few Noteworthy UNIX Commands

A few noteworthy UNIX commands I've used recently:

  • parallel: Utilize your multiple {core,cpu}s!
  • fdupes: "Finds duplicate files in a given set of directories"
  • s3cmd sync: Like rsync, but with S3 as the file/dir destination
As it turns out you don't need much more than this to backup your photo archives to AWS S3.

I used parallel to speed up various batches of S3 operations whose per-request latency is noticeable for large batch sizes.

Note that s3cmd sync will not copy symbolic links to S3, but it will copy hard links, so fdupes -H can be used to eliminate uploading of these redundant files.

Tuesday, April 2, 2013

Database Optimization: Query Performance vs. Request Performance

When it comes time to optimize your web application's database performance, there are (at least) two types of tools you can leverage. You can use an analysis tool like pgFouine, which is focused on individual query performance, or a tool like Scout or New Relic, which is focused on request performance.

pgFouine is fantastic at isolating any single query that, in aggregate, is slowing down your application the most. This is a function of the query's invocation count multiplied by its average execution time. Of course, it will report not just the single most expensive query, but the top N most expensive queries. (The tool will also will show you the most frequently executed queries and the overall single slowest queries, but these metrics are less useful for determining what queries to focus on).

But what if the top N queries are all taking roughly the same amount of time? Which one(s) should you try to optimize or eliminate? pgFouine cannot help you make this decision intelligently. Instead you need a tool that will allow you to focus on performance at the level web requests.  Consider that each request your web application handles is likely comprised of many queries. So if you have one type of request that is taking 80% of your web application's processing time, then you probably want to see the full set of queries that together are causing these requests to be non-performant. Perhaps you can eliminate many of these queries, with an improved application design. This is when a tool such as New Relic (or Scout's "Application" feature) becomes invaluable.

For each type of request, you can see the most expensive queries, but more importantly the full set of queries that are being issued for a given request. (These tools are equally useful for finding application layer code that is non-performant, but I'm only concerned with database performance here.) With a request-level view you can start to evaluate your overall design to determine whether you can eliminate certain queries altogether. For example, you might realize that you are issuing two similar but different queries that can be combined into a single query. pgFouine might show that these two queries are equally performant, while a request-level analysis tool will show you that they are being executed side-by-side while serving a single request. This is the hint one needs to start understanding where optimizations can be made at a design level higher than an individual query.

Thursday, March 14, 2013

PostgreSQL Column-to-Row Transposition

I recently had a need to generate a geolocation history of user activity. The result set needed to be a linear history of users' activities, with each row consisting of a user identifier, activity type, location, and timestamp. Unfortunately our database schema stored the location and timestamp of three different types of activities across 3 separate pairs of columns in the same table. To accomplish the required output, I needed to transpose the columns into rows. To accomplish this, I was able to make use of PostgreSQL's array constructor syntax and unnest array function.

From a table with the following columns:

  • user_id
  • activity1_location
  • activity1_timestamp
  • activity2_location
  • activity3_timestamp
  • activity3_location
  • activity3_timestamp

I issued the following query:

SELECT user_id, 
unnest(ARRAY['activity1', 
             'activity2',
             'activity3']) as "activity type",

unnest(ARRAY[activity1_location, 
             activity2_location, 
             activity3_location]) as "location",
unnest(ARRAY[activity1_timestamp, 
             activity2_timestamp, 
             activity3_timestamp]) as "timestamp";

The unnest function generates multiple rows, one row per element of the specified array. This produces a result such as:

user_id | activity_type | location    | timestamp
--------+---------------+-------------+--------------------
      1 | activity1     | address1    | 2012-03-13 00:00:00
      1 | activity2     | address2    | 2012-03-13 00:00:01
      1 | activity3     | address3    | 2012-03-13 00:00:02
      1 | activity1     | address4    | 2012-03-13 00:01:00
      1 | activity2     | address5    | 2012-03-13 00:01:01
      1 | activity3     | address6    | 2012-03-13 00:01:02
      2 | activity1     | address7    | 2012-03-14 00:00:00
      2 | activity2     | address8    | 2012-03-14 00:00:01
      2 | activity3     | address9    | 2012-03-14 00:00:02
...

I learned about these PostgrSQL array functions from this highly recommended slide presentation, "Postgres: The Bits You Haven't Found".

Tuesday, January 29, 2013

Quicker Code Reviews

This year I resolve to spend less time doing code reviews by only reviewing the test code. This should work, right?