pattern matching problem

I have a file with the following contents;

NEW 85174 MP081 /29OCT07
CNL 85986 MP098 /28OCT07
NEW 86014 MP098 /28OCT07
NEW 86051 MP097 /27OCT07
CNL 86084 MP097 /27OCT07

Now I have to retrieve all lines that start with NEW and where the next line starts with CNL and where the MP codes are the same in both lines.

So it has to return the last two lines from this example.

This may not be the efficient solution:

#!/usr/bin/perl
# get_lines.pl
use strict;
my $check_next_line = "no";
my ($new_line, $mp);
while (<>) {
    chomp;
    if (/^NEW/) {
        $new_line = $_;  # save the line beginning with word NEW
        $check_next_line = "yes";
        $mp = (split (/\s+/, $_))[2];  # save the field containing MP code
        next;
    }
    elsif (/^CNL/) {
        if ($mp eq (split (/\s+/, $_))[2]) {
            print $new_line, "\n";
            print;
            print "\n";
        }
    }
    $check_next_line = "no";
}

Run this script as:

perl get_lines.pl new_file

With awk:

awk '/^CNL/&&x=="NEW"$3&&$0=y RS$0
{x=$1$3;y=$0}' filename

Use nawk or /usr/xpg4/bin/awk on Solaris.

Ok, great both solutions work.

With awk it took me 3 hours and no result and with Java 30 mins and result ... the awk solution is much smaller though but I can't understand it.

import java.io.*;


public class GetLine {

	private String prevWord = "";
	private String prevLine = "";
	
	private void parse(String fileName){
		try {
	        BufferedReader in = new BufferedReader(new FileReader(fileName));
	        System.out.println("Reading file: " +fileName);
	        String line = "";
	        while ((line = in.readLine()) != null) {
	        	//System.out.println(str);
	        	String [] word = line.split("[\t]");
	        	//System.out.println(word[2]);
	        	if ( prevWord.equals(word[2])){
	        		System.out.println(prevLine);
	        		System.out.println(line);
	        	}
	        	prevWord = word[2];
	        	prevLine = line;
	        }
	        in.close();
	    } catch (IOException e) {
	    	System.err.println(e);
	    }
	}
	
	public static void main(String [] args){
		GetLine getLine = new GetLine();
		getLine.parse(args[0]);
	}
	
}

anyone a solution in C perhaps? Just for fun?

Hi, rein.

You mentioned the awk and Java. It looked like the perl script from Yogesh Sawant would work -- did you time it?

This looks mostly IO bound, so except for coding the algorithm, I would not expect drastically different times. For example, my experience is that perl is very close to c for IO cases, but not so close for arithmetic-dense code.

The perl might be made a bit more efficient by using the suffix "o" for matching constant patterns, and possibly not splitting more fields than needed -- but I'd think those are making very small contributions ... cheers, drl

I believe those are not execution timings :slight_smile:

Hi.

Ah, you're suggesting it was rein's time, apparently. Yes, I could see how one's familiarity with the language at hand could affect time-to-solution ... cheers, drl

if you have Python, here's an alternative:

#!/usr/bin/python
data=open("file").readlines()
for num in range(len(data)):
    if data[num].startswith("NEW") and data[num+1].startswith("CNL"):
        if data[num].split()[2] == data[num+1].split()[2]:
            print data[num],data[num+1]

output:

 # python test.py
NEW 86051 MP097 /27OCT07
CNL 86084 MP097 /27OCT07

'