Reusable Coder: 2011

Tuesday, November 1, 2011

Subversion Directory Tree Conflicts

Came across this animation on a blog while looking for some answers on how to properly resolve a Subversion tree conflict on a directory. This about describes how I feel at the moment, after having already spent a large part of the day working with merging source code branches. In fact, I often feel like this, when I can't find proper documentation for the software I'm using.

For what it's worth, the "answer" I was looking for was found in the last paragraph here, which tells me that Subversion will be of no help in resolving my particular problem. Joy!

Other tree conflicts

There are other cases which are labelled as tree conflicts simply because the conflict involves a folder rather than a file. For example if you add a folder with the same name to both trunk and branch and then try to merge you will get a tree conflict. If you want to keep the folder from the merge target, just mark the conflict as resolved. If you want to use the one in the merge source then you need to SVN delete the one in the target first and run the merge again. If you need anything more complicated then you have to resolve manually.

Tuesday, September 20, 2011

Scala "for" iteration with indexes

In Scala, to iterate through a collection of items while keeping an index, Seq.zipWithIndex:

for (e <- items.zipWithIndex) {
println(e._1 + " at index " + e._2)
}

(I find this especially useful when writing Scala code that calls into Java library setter methods that are index-based.)

Saturday, June 11, 2011

Getting Started is the Hardest Part

Too often, when I'm trying to get started on a small, personal software project, I'm stymied by the time it takes to get the development environment and project infrastructure setup. With a full-time job as a developer, an addiction to cycling, and the responsibilities associated with being the parent of a two-year child, it's hard to find the mental energy and time to work on even a small software idea. So when I do have an hour of mental energy available, the last thing I want to spend it on is project setup and configuration task. Maven archetypes to the rescue! Archetypes allow you to setup your project nearly instantly, and if you have appropriate Maven support in your IDE, you'll be ready to code within second (okay, minutes). If--and this is a big if--you can find an appropriately up-to-date archetype that provides the exact stack of technologies upon which your project will rely. So far, I don't seem to have such luck (can any one tell me where I can find well designed sampling of Scala-based Maven archetypes?) So instead of trying to start off with someone else's half-baked archetype each time I need to start a project, I've decided to take the time create my own archetype(s) that I can reuse and evolve for my own needs. The following Maven reference page was all I needed to figure out how to generated my own custom archetypes: http://maven.apache.org/archetype/maven-archetype-plugin/advanced-usage.html.

Tuesday, April 26, 2011

Find Most Recently Modified File

To find the most recently modified file in the current directory tree:

find . -type f -printf '%T@\t%t\t%p\n' | sort -nr | head -n 1 | cut -f 2,3

Thursday, January 20, 2011

How to nest single quotes in BASH command

You can't nest single quotes in a bash command, since there's no way to escape it, but you can do something like this instead:

alias a='perl -e '\''print "a\n"'\'

Friday, January 14, 2011

UNIX command to analyze Amazon S3 logs

I wanted to monitor the download activity on a particular file that I made publicly available on my Amazon S3 account. Here's how I do it with from a bash command-line:

~/dev/s3-curl/s3curl.pl --id= -- https://s3.amazonaws.com/ 2> /dev/null | xpath -e '//Contents/Key/text()' 2> /dev/null | grep '^logs/' | xargs -i ~/dev/s3-curl/s3curl.pl --id= -- https://s3.amazonaws.com//{} 2> /dev/null | grep 'GET\.OBJECT.*'

where and are variables representing the name of your S3 bucket and the filename you'd like to monitor and

is the name of the profile you configured in your ~/.s3curl file.

The above command assumes these S3 Logging options for the bucket:

Enabled is checked
Target Bucket is the same as the bucket containing the file being monitored
Target Prefix is "logs/"

In a nutshell, this works by first downloading the file listing of the bucket, then extracting the log file names, then downloading each log file in turn and finally grep'ing them for the download activity (GET.OBJECT).

You'll need to have these command-line tools available:

s3curl
xpath

Tuesday, January 4, 2011

Internal vs. External Events in Databases

A database generally comprises data that records the (instantaneous) state of entities in the real world. If the historical states of the data are of interest, then a history of values may be recorded for these entities. This history of values will usually be timestamped, and thus this historical list of data values can be considered as recording the occurrence of an event in the real world. In other words, the real-world event changed the state of a real-world entity, whose updated state must then be recorded in the database. Let's call these real-world events external events.

But now we must also consider that the process of updating the database to record the new real-world value is an event unto itself. Let's call this an internal event, since it is an event that is intimately tied to the database itself. This event may be a human operator who manually types in a new value or hardware/software that captures the new real-world value and updates the database. Consider that the act of recording the new value may or may not take place at the same time as the real-world value itself changed. In other words the external event is distinct from the internal event.

I believe it is important to differentiate between external and internal events when developing a data model that captures the history of data states. For one, recording internal events can help to audit data entry errors. For example, erroneous data can sometimes be detected and even corrected if it is examined in the context of other temporally-proximal data entry events (e.g., a data entry operator that repeated a value from a previous data entry record).

Recording internal events explicitly can also help to troubleshoot querying anomalies caused by failure to take into account the difference in time between the occurrence of an external and its corresponding internal event. For example, why did a query that was run at time t not reflect the updated state of the real world event that occurred at time t-1? It will not if the data entry--the internal event--occurred at time t+1.

Recording internal events can aid in the determining the time periods during which a database is "out of sync" with real-world entity values, due to data input errors. Ideally, a database will always correctly reflect the state of the real-world, but in practice this is rarely the case. Inevitably, bad data will enter a system, and, at best, is corrected at some point in the future. When corrections are made and recorded as internal events, the original and the corrected internal event timestamps can be used to determine when and for how long the database maintained inaccurate state. If internal events can be marked as having been invalid, it is then also possible to generate reports that either ignore or include erroneous data states. The advantage is that the database is not attempting to forget or otherwise hide the fact that erroneous data existed. Much as database designers contend that data should never be deleted, one can argue that erroneous data should also not be deleted, but simply flagged. Knowing that a database was temporarily maintaining bad data can be just as important as storing the correct values and their history.

Note that both external and internal events can also be used to record persons and comments associated with the event, in addition to just timestamps.

Internal events are most commonly recorded in log files, rather than as data in the database itself. It can be very useful though to record internal events directly in the database, as this avoids the need to join log file output (of internal events) with database records of external events when performing troubleshooting or auditing tasks.