Awk match multiple columns in multiple lines in single file

jacobs.smith · June 26, 2012, 4:00pm

Hi,

Input

7488	7389	chr1.fa	chr1.fa
3546	9887	chr5.fa	chr9.fa
7387	7898	chrX.fa	chr3.fa
7488	7389	chr21.fa	chr3.fa
7488	7389	chr1.fa	chr1.fa
3546	9887	chr9.fa	chr5.fa
7898	7387	chrX.fa	chr3.fa

Desired Output

7488	7389	chr1.fa	chr1.fa	2
3546	9887	chr5.fa	chr9.fa	2
7387	7898	chrX.fa	chr3.fa	2
7488	7389	chr21.fa	chr3.fa	1
7488	7389	chr1.fa	chr1.fa	2
3546	9887	chr9.fa	chr5.fa	2
7898	7387	chrX.fa	chr3.fa	2

I want to count each line's occurrence and print its occurrence in the fifth column.

Even though the first and second columns (second and sixth records) are interchanged and fourth and fifth columns (first and fifth records) are changed, it still needs to be counted.

So, far I tried this and got the undesired output below

awk -F, 'NR==FNR{a[$0]++;next}{print $0 "\t" a[$0]}' input input

7488	7389	chr1.fa	chr1.fa	2
3546	9887	chr5.fa	chr9.fa	1
7387	7898	chrX.fa	chr3.fa	1
7488	7389	chr21.fa	chr3.fa	1
7488	7389	chr1.fa	chr1.fa	2
3546	9887	chr9.fa	chr5.fa	1
7898	7387	chrX.fa	chr3.fa	1

---------- Post updated at 04:00 PM ---------- Previous update was at 03:34 PM ----------

Hi Corona,

Each line's occurence

For ex:

hello world
world hello

should be considered the same while reading the input. Then the output will be

hello world 2
world hello 2

because we are considering hello world is present two times in the file.

Corona688 · June 26, 2012, 4:00pm

awk 'NR==FNR {
        if($1 < $2) { A=$1; B=$2 } else { A=$2; B=$1 }
        ARR[A ":" B]++; next }

        {
                if($1 < $2) { A=$1; B=$2 } else { A=$2; B=$1 }
                print $0, ARR[A ":" B];
        }' OFS="\t" input input

jacobs.smith · June 27, 2012, 9:18am

corona688:

awk 'NR==FNR {
   if($1 < $2) { A=$1; B=$2 } else { A=$2; B=$1 }
   ARR[A ":" B]++; next }

   {
   if($1 < $2) { A=$1; B=$2 } else { A=$2; B=$1 }
   print $0, ARR[A ":" B];
   }' OFS="\t" input input

Hi corona,

Thanks for the solution.

I think it considering only the first two columns, but the last two columns should also be considered. This is the output from your solution

7488	7389	chr1.fa	chr1.fa	3
3546	9887	chr5.fa	chr9.fa	2
7387	7898	chrX.fa	chr3.fa	2
7488	7389	chr21.fa	chr3.fa	3
7488	7389	chr1.fa	chr1.fa	3
3546	9887	chr9.fa	chr5.fa	2
7898	7387	chrX.fa	chr3.fa	2

---------- Post updated 06-27-12 at 09:18 AM ---------- Previous update was 06-26-12 at 04:04 PM ----------

Any thoughts, any one?