Merge and Sort tabular data from different text files

I have 42 text files; each containing up to 34 lines with following structure;

file1

H-01 23
H-03 5
H-05 9
H-02 14
.
.

file2

H-01 17
H-02 43
H-04 7
H-05 8
H-03 7
.
.

file3

H-01 11
H-05 14
H-04 2
H-02 8
H-06 7
.
.

My desired output;

     file1 file2 file3 . . . file42
H-01 23 17 11
H-02 14 43 8
H-03 5 7 -
H-04 - 7 2
H-05 9 8 14
H-06 - - 7
.
.

The resulting file will contain a data matrix of 42 columns and 34 fields (excluding the headers). I have tried grep, but it prints the first occurrence only. Kindly help me.

Hello Syeda,

For file1 and file2, could you please try following and let me know if this helps.

awk 'BEGIN{print "     file1 file2"}FNR==NR{;A[$1]=$NF;next} ($1 in A){print $1 OFS A[$1] OFS $NF} !($1 in A){print $1 OFS "-" OFS $NF}' file1 file2 | sort -k1,3

Output will be as follows.

     file1 file2
H-01 23 17
H-02 14 43
H-03 5 7
H-04 - 7
H-05 9 8
H-224 - 8
 

Please do let us know if you have any queries on same.

Thanks,
R. Singh

1 Like

Thank you for the script @RavinderSingh13.

There are two queries with the current script;

  1. The output lacks the line which should contain a dash (-).
  2. The headers are appearing at the bottom of each column.

The current output is;

H-01 23 17
H-02 14 43
H-03 5 7
H-05 9 8
     file1 file2

Hello Syeda,

Not sure which command you have used, following and post#2 command provided be me works fine for me as follows.

awk 'BEGIN{print "     file1 file2"}FNR==NR{;A[$1]=$NF;next} ($1 in A){print $1 OFS A[$1] OFS $NF} !($1 in A){print $1 OFS "-" OFS $NF}' file1 file2 | sort -k1,3
     file1 file2
H-01 23 17
H-02 14 43
H-03 5 7
H-04 - 7
H-05 9 8

NOTE: It checks those contents which are present in file2 and print - for those which are present in file2 and NOT in file1.

Thanks,
R. Singh

1 Like

Thank you RavinderSingh13 for the comment. Query 1 is resolved if I place the file with less lines as file1 and that with more lines as file 2. But I need a script irrespective of the file order. And it should be applied to many files (42 in my case).

Query 2 still persists, but it is not a major issue. I will overcome it.

Perhaps something more like:

awk '
BEGIN {	printf("    ")
	s = "sort"
}
FNR == 1 {
	printf(" %s", files[++fc] = FILENAME)
}
{	d[$1, fc] = $2
	c1[$1]
}
END {	print ""
	for(i in c1) {
		printf("%s", i) | s
		for(j = 1; j <= fc; j++)
			printf(" %s", ((i, j) in d) ? d[i, j] : "-") | s
		print "" | s
	}
	close(s)
}' file1 file2 file3 ... file42

As always, if someone wants to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

When run with just your sample file1 , file2 , and file3 with the trailing lines containing periods removed, it produces the output:

     file1 file2 file3
H-01 23 17 11
H-02 14 43 8
H-03 5 7 -
H-04 - 7 2
H-05 9 8 14
H-06 - - 7
3 Likes

Thank you Mr. Don Cragun for the correct answer.