Mar 13 2010

How do you find a needle in a haystack?

Actually, the title should read “How do you find a needle in a haystack, not knowing that you are looking for a needle?”.

I just came across this at work and found a simple/quick solution that I thought I share. I had a few gigabytes of log file data in front of me and needed to find “weird stuff” in there. “Weird stuff” being messages that only appeared a handful of times in the millions of log lines. I did not know where the stuff was coming from and I did not know what kind of data was in there.

After some inspection of the first few hundred lines of the log files, I realized that I was looking at the same templates of messages, except that the data injected into the templates changed (those are three log lines I wrapped manually and I also changed the details to look more generic):

Mar  8 04:09:46 hostname class: TAG 08-03-2010 04:09:47.020  988-8720 INFO
(com.company.class:aMethod:1084) - Error logging in email@address.com:12345678:
95be7ab76bb6ee838dd4a55085f15b99. error: Unable to signIn: InvalidLogin
Mar  8 04:09:47 hostname2 class: TAG 08-03-2010 04:09:47.166  74-13915 INFO
(com.company.class2:bMethod:143) - PERFORMANCE: 1203 ms
Mar  8 04:09:47 hostname3 class: TAG 08-03-2010 04:09:47.182  22-78422 INFO
(com.company.class3:cMethod:714) - Successfully connected to
tcp://some.host.com:123456

Typical syslog lines, merged from multiple hosts into a single file. The timestamp is easy to detect (Mar 8 04:09:46) as well as the host that sent the message (hostname, hostname2, hostname3) and the syslog class. What followed (TAG, another timestamp, thread information and source location) was also not of special interest. The actual message seemed to start right after the “-” in each line. And that’s the part I was interested in. Using ‘sed’ on the log file, just extracted that portion:

$ sed 's/^.* \- //' file.log
Error logging in email@address.com:12345678:95be7ab76bb6ee838dd4a55085f15b99. error:
Unable to signIn: InvalidLogin
PERFORMANCE: 1203 ms
Successfully connected to tcp://some.host.com:123456
$

The trick was then to remove all variable items from the log-lines and return to the template that was used in the source. So for the first line, something like this was used in the source:

fprintf(logfh, "Error logging in %s:%d:%s. error: %s",
   emailAddress, someNumber, hexNumber, errorDetail);

I used perl and a number of regular expressions to find most of the variable data and removed it from each line. This meant that the first line was reduced to:

Error logging in ::. error: Unable to signIn: InvalidLogin

I computed an MD5 hash for the remainder and stored the hash along with location information in an associative array. The next time the same pattern was detected, I would increase the number of times that particular pattern was in the log-files and if that number was below a certain threshold, I would also record the location. The (reduced) perl function to do that, looked like this:

sub findUnusualLines {
	my($host, $lineno, $msg) = @_;
	my $orgMsg = $msg;
 
	# trim line to remove "data" - reduce to template
	# remove path names
	$msg =~ s/[A-Z]:[\\\S\.]+//g;
	# remove numbers
	$msg =~ s/(\W)\d+/$1/g;
	# remove hex numbers
	$msg =~ s/(\W)[0-9a-f]{4,}/$1/gi;
 
	# empty?
	return if ($msg =~ /^\s*$/);
 
	# compute digest for message
	my $md5=md5($msg);
 
	# first time?
	if(++$seenLines{$md5}->[0] == 1) {
		$seenLines{$md5}->[1]=$orgMsg;
	} elsif($seenLines{$md5}->[0] < MAX_MENTIONS) {
		push(@{$seenLines{$md5}},$host,$lineno);
	}
}

Once all log lines were processed, I walked over %seenLines, found all the ones where the first element of the array reference was less than MAX_MENTIONS and printed those along with the location where they appeared:

foreach my $l (keys %seenLines) {
	my $n=shift(@{$seenLines{$l}});
	next if ($n >= MAX_MENTIONS);
	my $txt=shift(@{$seenLines{$l}});
	print "$txt appears $n times at: ",
	while(scalar(@{$seenLines{$l}})) {
		my $host=shift(@{$seenLines{$l}});
                my $line=shift(@{$seenLines{$l}});
                print "$host:$line ";
        }
        print "\n";
}

Using this I was able to find a set of messages that only appeared a handful of times in the log files (some of those were critical errors we needed to look at). And on top of that, it was blazingly fast as well.

Leave a Reply