Hi. I am not sure the title gives an optimal description of what I want to do. Also, I tried to post this in the "UNIX for Dummies Questions & Answers", but it seems no-one was able to help out.
I have several text files that contain data in many columns. All the files are organized the same way, but the data in the columns might differ. I want to count the number of times data occur in specific columns, sort the output and make a new file. However, I want check several files for the occurrence of the same data, count the number of times it occurs, append the file name to each one and make a new file sorted by the number of occurrences.
File 1:
xx xx xx aab rrt xx
xx xx xx ccd bbt xx
xx xx xx ggt iir xx
File 2:
xx xx xx ggt iir xx
xx xx xx ccd bbt xx
File 3:
xx xx xx aab rrt xx
xx xx xx ggt iir xx
First I made a modification to the files, individually (any better way?) to make the file name occur in the first column:
sed 's/^/File1\t/' file1.temp > 1.txt
This gives files with:
File1:
File1 xx xx xx aab rrt xx
File1 xx xx xx ccd bbt xx
File1 xx xx xx ggt iir xx
File2:
File2 xx xx xx ggt iir xx
File2 xx xx xx ccd bbt xx
File3:
File3 xx xx xx aab rrt xx
File3 xx xx xx ggt iir xx
Then I extracted the columns of interest and sorted them and made a new file:
awk '{print $1,$5,$6}' *.txt |sort -k2 > output.txt
The output.txt file could look like this:
File1 aab rrt
File3 aab rrt
File1 ccd bbt
File2 ccd bbt
File2 ggt iir
File3 ggt iir
File1 ggt iir
Now, I want to count the number of times column 2 and column 3 are identical for every line and keep the first column information in the output file, separated by comma or similar. I want to result to be like this:
2 ccd bbt File1
2 aab rrt File1,File3
3 ggt iir File1, File2, File3
It would be good (but not a requirement) to have the last column in the final file to be sorted, lane1, lane2, lane3 etc. The lane* can also be separated by columns if that is easier.
So far I have tried to use:
awk '{print $1,$5,$6}' *.txt |sort -k2|uniq -f1 -c|sort -g > final_output.txt
However, I am not able to get the column data merged in the final output file. How should I go about to do that?
-James