Awk, Awkward?

Today, I would like to thanks Alfred Aho, Peter Weinberg and Brian Kernighan for the AWK programming language. If you haven’t played with that language then I urge you to learn it since it might saved you a lot of time and effort next time you need to parse/edit a text-based file.

Here’s some samples to show you the power of AWK.

Sample 1:

Let’s say you got a huge file that you want to split when a certain token appears.

text1 $ text2 $
text3 $ … $
textN-1 $ textN $

then all you got to do is :

awk ‘BEGIN { RS=”$”; ORS=””} { print $0 > “text”++i”.out”}’ fileName

The BEGIN block will be executed once and before it stars to process the lines of the files. We set the Record Separator(RS) to our token (here $) the default is a newline. Then we set the Output Record Separator (ORS) to nothing (“”) since the default behavior of the print function is to append a newline to the fields it prints (and we don’t want that in our case).

The main block is executed for each record found in the file. We print the whole record (represented by $0) and redirect the output to a generated filename.

Sample 2:

You want to know how many times each user of your web application has logged in and all you have is a humongous application.log . Let’s say in the file there is a line printed each time someone logs in :

2008-11-17 12:13:35.123 UserManager.logUser: User: Bobby has logged in

The awk program to solve this problem is still a simple one :

awk ‘$3~/UserManager.logUser/{ a[$5]++;} END{ for(i in a){ print a[i] ” ” i;}’ application.log

The block of code is executed only if $3 (the 3rd field of each record) matches the regular expression found between the /’s.

a is a map where the key is the 5th field (here the user name) and the value is number of occurence of the user name.

Then the END Block iterates through the map and prints something like that :

2 Bobby
13 Alan
4 admin
22 Joe
55 support

Of course, if you want only the 10 most active users of your application you would do something like this :

awk ‘$3~/UserManager.logUser/{ a[$5]++;} END{ for(i in a){ print a[i] ” ” i;}’ application.log | sort -nr | head -10

Both samples presented here are really life “programs” I had to write to save myself some time. (I cannot imagine myself hand splitting 50 files containing 900 “$” into 4500 files or hand counting the users that logged in). Of course, both problems could have been solved by a small java/c# programs but they would have required around 50-to-100 lines of code and to be compiled.

AWK might be an “old” language (developed in the 70’s) but as you can see, it is appropriate to rapidly extract data from text files ( or redirected output) and compute statistics. So next time you need to process some files forget about the big “ones” and bring out AWK.

Know your API

This is item 47 of the Effective Java book. And as with your enemies, your API is something you should really learn to know. why? If you don’t know what is already done, you will start reinventing the wheel (probably a square one) . Since that newly created wheel implements the same functionality of the good old API, problems will soon appears when a change in that functionallity is required ( do you patch both code, only yours so you don’t affect others? is that OK? …). So please next time you code something do yourself a favor and explore the available APIs, you will save both of us a lot of time.

Guidelines of a successful multi-branch Development

I was happy to see Jeff’s post about branching (http://www.codinghorror.com/blog/archives/000968.html), since it was a good post about different branch strategies to use and the common pitfalls to avoid when branching. I’ll like to share some guidelines for the developers to successfully merge your branches. (I’ll suppose a main product with a branch per project affecting the product)

  1. Don’t refactor code, outside your project scope. Something as harmless as a “re-organise imports” on the whole product codes can break a build after a merge. Example : Eclipse will expand import java.util.* to import java.util.Vector but imagine that the product code change to use ArrayList in the code. Subversion will successfully merge your changes into the product code but the build is broken. the fix is simple but it’s annoying.
  2. Don’t reformat code that is outside your project scope. OK, you don’t like the opening bracket on a new line that someone else used in some class that you happen to look at. If you are not going to modify the class code. Please refrain your desire to change the format. It will just make the merge more difficult because you increase the risk of conflicts.
  3. Be aware of the other projects scope. Ideally, projects shouldn’t impacts other projects code. But in reality 2 projects can affect the same module of your product. When that situation happen, developers should be more cautious, since the risk of merge conflicts increases. Here, communications is the key.
  4. Despite what your project manager might says, keep the product vision in mind. If the success of the project means that in affects the product in a bad way, raise the flag early. If not, the project is likely the never return to the product trunk or will need a rewrite to integrate the main product since a merge will create too many conflicts. 

In summary, the key to easy merge is to keep focus on your project scope, communicate with the other teams (projects and product) and being responsable.