Merge and Sort tabular data from different text files

Syeda_Sumayya · October 15, 2015, 1:30am

I have 42 text files; each containing up to 34 lines with following structure;

file1

H-01 23
H-03 5
H-05 9
H-02 14
.
.

file2

H-01 17
H-02 43
H-04 7
H-05 8
H-03 7
.
.

file3

H-01 11
H-05 14
H-04 2
H-02 8
H-06 7
.
.

My desired output;

     file1 file2 file3 . . . file42
H-01 23 17 11
H-02 14 43 8
H-03 5 7 -
H-04 - 7 2
H-05 9 8 14
H-06 - - 7
.
.

The resulting file will contain a data matrix of 42 columns and 34 fields (excluding the headers). I have tried grep, but it prints the first occurrence only. Kindly help me.

RavinderSingh13 · October 15, 2015, 1:46am

Hello Syeda,

For file1 and file2, could you please try following and let me know if this helps.

awk 'BEGIN{print "     file1 file2"}FNR==NR{;A[$1]=$NF;next} ($1 in A){print $1 OFS A[$1] OFS $NF} !($1 in A){print $1 OFS "-" OFS $NF}' file1 file2 | sort -k1,3

Output will be as follows.

     file1 file2
H-01 23 17
H-02 14 43
H-03 5 7
H-04 - 7
H-05 9 8
H-224 - 8

Please do let us know if you have any queries on same.

Thanks,
R. Singh

Syeda_Sumayya · October 15, 2015, 2:54am

Thank you for the script @RavinderSingh13.

There are two queries with the current script;

The output lacks the line which should contain a dash (-).
The headers are appearing at the bottom of each column.

The current output is;

H-01 23 17
H-02 14 43
H-03 5 7
H-05 9 8
     file1 file2

RavinderSingh13 · October 15, 2015, 2:59am

Hello Syeda,

Not sure which command you have used, following and post#2 command provided be me works fine for me as follows.

awk 'BEGIN{print "     file1 file2"}FNR==NR{;A[$1]=$NF;next} ($1 in A){print $1 OFS A[$1] OFS $NF} !($1 in A){print $1 OFS "-" OFS $NF}' file1 file2 | sort -k1,3
     file1 file2
H-01 23 17
H-02 14 43
H-03 5 7
H-04 - 7
H-05 9 8

NOTE: It checks those contents which are present in file2 and print - for those which are present in file2 and NOT in file1.

Thanks,
R. Singh

Syeda_Sumayya · October 15, 2015, 3:16am

Thank you RavinderSingh13 for the comment. Query 1 is resolved if I place the file with less lines as file1 and that with more lines as file 2. But I need a script irrespective of the file order. And it should be applied to many files (42 in my case).

Query 2 still persists, but it is not a major issue. I will overcome it.

Don_Cragun · October 15, 2015, 5:40am

Perhaps something more like:

awk '
BEGIN {	printf("    ")
	s = "sort"
}
FNR == 1 {
	printf(" %s", files[++fc] = FILENAME)
}
{	d[$1, fc] = $2
	c1[$1]
}
END {	print ""
	for(i in c1) {
		printf("%s", i) | s
		for(j = 1; j <= fc; j++)
			printf(" %s", ((i, j) in d) ? d[i, j] : "-") | s
		print "" | s
	}
	close(s)
}' file1 file2 file3 ... file42

As always, if someone wants to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

When run with just your sample file1 , file2 , and file3 with the trailing lines containing periods removed, it produces the output:

     file1 file2 file3
H-01 23 17 11
H-02 14 43 8
H-03 5 7 -
H-04 - 7 2
H-05 9 8 14
H-06 - - 7

Syeda_Sumayya · October 16, 2015, 6:32am

Thank you Mr. Don Cragun for the correct answer.