Counting number of times content in columns occurs in other files

ncwxpanther · September 13, 2016, 3:47pm

I need to figure out how many times a location (columns 1 and 2) is present within a group of files. I figured using a combination of 'while read' and 'grep' I could count the number of instances but its not working for me.

cat file.txt | while read line
do 
grep $line *08-new.txt | wc -l
done

The contents of column 3 are arbitrary in this case. Note that all locations in file.txt (source input) should have at least 1 instance in *08-new.txt, but there could be locations in *08-new.txt that are not in file.txt (source input). The files are titled {1981..2016}08-new.txt

file.txt (source input)

   30.68   -88.24   34
   32.30   -86.41   0.26
   35.14   -111.67   1.27
   33.43   -112.00   -0.08
   34.65   -112.43   0.63
   32.13   -110.96   0.60

198108-new.txt

   30.68   -88.24   1.14
   32.30   -86.41   0.26
   35.14   -111.67   1.27
   33.43   -112.00   -0.08
   34.65   -112.43   0.63
   32.13   -110.96   0.60

198208-new.txt

   30.68   -88.24   1.14
   32.30   -86.41   0.26
   35.14   -111.67   1.27
   30.48   -87.19   0.51
   34.65   -112.43   0.63
   32.13   -110.96   0.60

201608-new.txt

   30.68   -88.24   1.14
   32.30   -86.41   0.26
   35.14   -111.67   1.27
   33.43   -112.00   -0.08
   34.65   -112.43   0.63
   39.94   -91.19   1.80

Output

grep: invalid option -- .
Usage: grep [OPTION]... PATTERN [FILE]...
Try `grep --help' for more information.
0

Expected output

   30.68   -88.24   3
   32.30   -86.41   3
   35.14   -111.67   3
   33.43   -112.00   2
   34.65   -112.43   3
   32.13   -110.96   2

In the end I need to be able to have each location and its count (column 3) stored in a variable to be used further down the script.

Padow1 · September 13, 2016, 4:14pm

Here is one way to do it. If you need the values put in to a variable for later use, follow the example below of i=`command` to have it for later use. Note that this script follows your logic of counting the number of files that the string appears in, not the actual count of the number of times that it occurs.

while read line; do 
  i=`echo $line |awk '{ print $1, " ",  $2 }'`  # need to preserve the 3 spaces
  echo "$i `grep  "$i" *new.txt| wc -l`"
  done < file.txt

Mannu2525 · September 13, 2016, 4:27pm

You can use Sort mechanism to get your desired output. Sort by keys of 1st and 2nd column, get the count and print.

sort -k1,1nr -k2,2nr file.txt *08-new.txt | uniq -c | awk '{print $2,$3,$1}

HTH

RudiC · September 13, 2016, 4:52pm

Try also

awk 'NR==FNR {T[$1,$2]; next} ($1,$2) in T {T[$1,$2]++} END {for (t in T) print t, T[t]}' SUBSEP=" " file[1-4]
33.43 -112.00 2
35.14 -111.67 3
32.30 -86.41 3
34.65 -112.43 3
30.68 -88.24 3
32.13 -110.96 2

RavinderSingh13 · September 13, 2016, 5:55pm

rudic:

Try also

awk 'NR==FNR {T[$1,$2]; next} ($1,$2) in T {T[$1,$2]++} END {for (t in T) print t, T[t]}' SUBSEP=" " file[1-4]
33.43 -112.00 2
35.14 -111.67 3
32.30 -86.41 3
34.65 -112.43 3
30.68 -88.24 3
32.13 -110.96 2

Hello RudiC,

Thank you for nice code. I think OP has requested if there are at least one occurrences of fields 1st and 2nd from source file/Input_file(file.txt).
So if have to put a condition at last to check that else it will print those lines too which are coming only in very first/source file too.

awk 'NR==FNR {T[$1,$2]; next} ($1,$2) in T {T[$1,$2]++} END {for (t in T){if(T[t]>1){print t, T[t]}}}' SUBSEP=" "  file.txt  198108-new.txt  198208-new.txt  201608-new.txt

Output will be as follows.

35.14 -111.67 3
32.30 -86.41 3
30.68 -88.24 3
34.65 -112.43 3
33.43 -112.00 2
32.13 -110.96 2

Thanks,
R. Singh

RudiC · September 14, 2016, 4:21am

ncwxpanther · September 15, 2016, 2:21pm

Thanks Padow. This worked as needed. I did not get a chance to try the other suggestions. But hope to soon.