pattern matching problem

rein · October 26, 2007, 7:29am

I have a file with the following contents;

NEW 85174 MP081 /29OCT07
CNL 85986 MP098 /28OCT07
NEW 86014 MP098 /28OCT07
NEW 86051 MP097 /27OCT07
CNL 86084 MP097 /27OCT07

Now I have to retrieve all lines that start with NEW and where the next line starts with CNL and where the MP codes are the same in both lines.

So it has to return the last two lines from this example.

Yogesh_Sawant · October 26, 2007, 8:09am

This may not be the efficient solution:

#!/usr/bin/perl
# get_lines.pl
use strict;
my $check_next_line = "no";
my ($new_line, $mp);
while (<>) {
    chomp;
    if (/^NEW/) {
        $new_line = $_;  # save the line beginning with word NEW
        $check_next_line = "yes";
        $mp = (split (/\s+/, $_))[2];  # save the field containing MP code
        next;
    }
    elsif (/^CNL/) {
        if ($mp eq (split (/\s+/, $_))[2]) {
            print $new_line, "\n";
            print;
            print "\n";
        }
    }
    $check_next_line = "no";
}

Run this script as:

perl get_lines.pl new_file

radoulov · October 26, 2007, 9:46am

With awk:

awk '/^CNL/&&x=="NEW"$3&&$0=y RS$0
{x=$1$3;y=$0}' filename

Use nawk or /usr/xpg4/bin/awk on Solaris.

rein · October 26, 2007, 9:59am

Ok, great both solutions work.

rein · October 26, 2007, 10:49am

With awk it took me 3 hours and no result and with Java 30 mins and result ... the awk solution is much smaller though but I can't understand it.

import java.io.*;


public class GetLine {

	private String prevWord = "";
	private String prevLine = "";
	
	private void parse(String fileName){
		try {
	        BufferedReader in = new BufferedReader(new FileReader(fileName));
	        System.out.println("Reading file: " +fileName);
	        String line = "";
	        while ((line = in.readLine()) != null) {
	        	//System.out.println(str);
	        	String [] word = line.split("[\t]");
	        	//System.out.println(word[2]);
	        	if ( prevWord.equals(word[2])){
	        		System.out.println(prevLine);
	        		System.out.println(line);
	        	}
	        	prevWord = word[2];
	        	prevLine = line;
	        }
	        in.close();
	    } catch (IOException e) {
	    	System.err.println(e);
	    }
	}
	
	public static void main(String [] args){
		GetLine getLine = new GetLine();
		getLine.parse(args[0]);
	}
	
}

anyone a solution in C perhaps? Just for fun?

drl · October 26, 2007, 11:28am

Hi, rein.

You mentioned the awk and Java. It looked like the perl script from Yogesh Sawant would work -- did you time it?

This looks mostly IO bound, so except for coding the algorithm, I would not expect drastically different times. For example, my experience is that perl is very close to c for IO cases, but not so close for arithmetic-dense code.

The perl might be made a bit more efficient by using the suffix "o" for matching constant patterns, and possibly not splitting more fields than needed -- but I'd think those are making very small contributions ... cheers, drl

radoulov · October 26, 2007, 11:37am

I believe those are not execution timings

drl · October 26, 2007, 12:02pm

Hi.

Ah, you're suggesting it was rein's time, apparently. Yes, I could see how one's familiarity with the language at hand could affect time-to-solution ... cheers, drl

ghostdog74 · October 26, 2007, 11:44pm

if you have Python, here's an alternative:

#!/usr/bin/python
data=open("file").readlines()
for num in range(len(data)):
    if data[num].startswith("NEW") and data[num+1].startswith("CNL"):
        if data[num].split()[2] == data[num+1].split()[2]:
            print data[num],data[num+1]

output:

 # python test.py
NEW 86051 MP097 /27OCT07
CNL 86084 MP097 /27OCT07

'