how to take out common of two lines in a file

I use sed and awk. I am not a big expert but to some extent. I have file like this

PFA0165c ctg_6843
PFA0335w ctg_6843 ctg_6871 ctg_6977 ctg_6654 ctg_7052 ctg_6899 ctg_6840 ctg_7202 ctg_6638
PFA0155c ctg_6877 ctg_7169 ctg_7179 ctg_6843 ctg_6871

Now I want output like this

PFA0165c PFA0335w ctg_6843
PFA0165c PFA0335w PFA0155c ctg_6843

It means 1st columns of a line should be appended to that of next line. And in front of that common of these two lines should be printed. First white space is tab and subsequent single spaces in each line. Common word may be anywhere in line, like ctg_6843 is in 5th column in 3rd line.

Sorry, I just can't understand what are wanting to do :confused: :confused:

I think you are looking for elements that appear in more than one line but I get confused after that.

Could you try explaining again? Perhaps a few more examples might help me see what you mean...

I thank u for taking interest in this problem.

The input file is like this, first whitespace is tab and subsequent are single spaces.
Here are 3 lines of the file.

PFA0165c ctg_6843
PFA0335w ctg_6843 ctg_6871 ctg_6977 ctg_6654 ctg_7052 ctg_6899 ctg_6840 ctg_7202 ctg_6638
PFA0155c ctg_6877 ctg_7169 ctg_7179 ctg_6843 ctg_6871

I want comparison like this

Compare line1 with line 2 and take out the common
Compare line 2 with line 3 and take out tthe common
Compare line 3 with line 4 and take out the common

                                  • -- - -- -- -- -
                                    Compare line (n-1) with line n and take out the common

First field of every line is unique and it is tab separated from rest of the line, so in awk u can declare an array a[$1]=$2 with FS="\t". So the only problem is to compare $2 of two adjacent lines.

Now I want to print out
first field of line 1 and line 2 and the common
first field of line 2 and line 3 and the common

                                  • -- - -- --
                                    first field of line (n-1) and line n and the common

Hence the output will be like this
PFA0165c PFA0335w ctg_6843
PFA0335w PFA0155c ctg_6843 ctg_6871

I think I understand now, for any given line, you want to print the first element, followed by the first element of the line below, followed by any items common to both lines - right?

Because this requires a few things to stay in memory, it looks like it would lend itself well to awk or perl. As my awk is rather weak, I'll try perl:

$prevleader="";
$previtems="";
while(<>) {
  if (/^([^\s]+)\s+(.*)$/) {
    $leader=$1;
    $items=$2;
    if ($prevleader) {
      print "$prevleader $leader";
      foreach $item (split(/\s/,$items) {
        if ($previtems =~ /\s${item}\s/) {
          print " $item";
        }
      }
      print "\n";
      $prevleader=$leader;
      $previtems=$items;
    }
  }
}

Not tested but it should do the trick or get you close.
I suspect awk can do it better though :confused:

thanks a lot smiling dragon

awk -f test.awk testfile.dat

where test.awk contains,

NR==1 {old_cnt=split($0,old_arr,"[ \t]");}

NR!=1 {
new_cnt=split($0,new_arr,"[ \t]");
for(i=2;i<=old_cnt;i++)
 for(j=2;j<=new_cnt;j++)
 {
  if(old_arr==new_arr[j]) {cmn=cmn " "  old_arr}
 }

printf("%s %s ",old_arr[1],new_arr[1]);
out_cnt=split(cmn,out_arr," ");
for(i=1;i<=out_cnt;i++)
 printf("%s ",out_arr);
printf("\n");
old_cnt=new_cnt;
for(i=1;i<=new_cnt;i++) old_arr=new_arr;
cmn=" ";
}