how to take out common of two lines in a file

jam_ali49 · November 16, 2007, 4:28am

I use sed and awk. I am not a big expert but to some extent. I have file like this

PFA0165c ctg_6843
PFA0335w ctg_6843 ctg_6871 ctg_6977 ctg_6654 ctg_7052 ctg_6899 ctg_6840 ctg_7202 ctg_6638
PFA0155c ctg_6877 ctg_7169 ctg_7179 ctg_6843 ctg_6871

Now I want output like this

PFA0165c PFA0335w ctg_6843
PFA0165c PFA0335w PFA0155c ctg_6843

It means 1st columns of a line should be appended to that of next line. And in front of that common of these two lines should be printed. First white space is tab and subsequent single spaces in each line. Common word may be anywhere in line, like ctg_6843 is in 5th column in 3rd line.

Smiling_Dragon · November 18, 2007, 7:42pm

Sorry, I just can't understand what are wanting to do

I think you are looking for elements that appear in more than one line but I get confused after that.

Could you try explaining again? Perhaps a few more examples might help me see what you mean...

jam_ali49 · November 21, 2007, 6:11am

I thank u for taking interest in this problem.

The input file is like this, first whitespace is tab and subsequent are single spaces.
Here are 3 lines of the file.

PFA0165c ctg_6843
PFA0335w ctg_6843 ctg_6871 ctg_6977 ctg_6654 ctg_7052 ctg_6899 ctg_6840 ctg_7202 ctg_6638
PFA0155c ctg_6877 ctg_7169 ctg_7179 ctg_6843 ctg_6871

I want comparison like this

Compare line1 with line 2 and take out the common
Compare line 2 with line 3 and take out tthe common
Compare line 3 with line 4 and take out the common

- - - - -- - -- -- -- -
        Compare line (n-1) with line n and take out the common

First field of every line is unique and it is tab separated from rest of the line, so in awk u can declare an array a[$1]=$2 with FS="\t". So the only problem is to compare $2 of two adjacent lines.

Now I want to print out
first field of line 1 and line 2 and the common
first field of line 2 and line 3 and the common

- - - - -- - -- --
        first field of line (n-1) and line n and the common

Hence the output will be like this
PFA0165c PFA0335w ctg_6843
PFA0335w PFA0155c ctg_6843 ctg_6871

Smiling_Dragon · November 21, 2007, 9:26pm

I think I understand now, for any given line, you want to print the first element, followed by the first element of the line below, followed by any items common to both lines - right?

Smiling_Dragon · November 21, 2007, 10:30pm

Because this requires a few things to stay in memory, it looks like it would lend itself well to awk or perl. As my awk is rather weak, I'll try perl:

$prevleader="";
$previtems="";
while(<>) {
  if (/^([^\s]+)\s+(.*)$/) {
    $leader=$1;
    $items=$2;
    if ($prevleader) {
      print "$prevleader $leader";
      foreach $item (split(/\s/,$items) {
        if ($previtems =~ /\s${item}\s/) {
          print " $item";
        }
      }
      print "\n";
      $prevleader=$leader;
      $previtems=$items;
    }
  }
}

Not tested but it should do the trick or get you close.
I suspect awk can do it better though

jam_ali49 · November 22, 2007, 12:48pm

thanks a lot smiling dragon

ranj1 · November 23, 2007, 3:48am

awk -f test.awk testfile.dat

where test.awk contains,

NR==1 {old_cnt=split($0,old_arr,"[ \t]");}

NR!=1 {
new_cnt=split($0,new_arr,"[ \t]");
for(i=2;i<=old_cnt;i++)
 for(j=2;j<=new_cnt;j++)
 {
  if(old_arr==new_arr[j]) {cmn=cmn " "  old_arr}
 }

printf("%s %s ",old_arr[1],new_arr[1]);
out_cnt=split(cmn,out_arr," ");
for(i=1;i<=out_cnt;i++)
 printf("%s ",out_arr);
printf("\n");
old_cnt=new_cnt;
for(i=1;i<=new_cnt;i++) old_arr=new_arr;
cmn=" ";
}