Count and merge based on file-name

My file looks like this

NYC0001_S100.dat
NYC0001_S101.dat
NYC0001_S102.dat
NYC0002_S100.dat
NYC0002_S101.dat
NYC0002_S103.dat
NYC0003_S101.dat
NJ0003_S104.dat
NYC0004_S101.dat
NJ0005_S107.dat

I want to count the file-stem (using delimiter "_") and get count of number of files per unique file-stem.

To get the number of input files, I can do the following:
cat file | cut -d_ -f1 | sort |uniq -c | awk '{print $2, $1}' OFS="\t"

this gives me :

NJ0003  1
NJ0005  1
NYC0001 3
NYC0002 3
NYC0003 1
NYC0004 1

Now, I want to write a script that allows :
merging of same Identifier (NYC0001) using a program

./datamerge -I <1st input file> \ 
 			-I <2nd input file> \ 
 			-I <3rd input file> -O NYC0001.dat 

where each input -I is dependent on the number of unique file-counts eg.3 in this example:

so, for e.g:

./datamerge -I NYC0001_S100.dat -I NYC0001_S101.dat -I NYC0001_S102.dat -O NYC0001.dat
1 Like

Assuming that the part of the input files before the underscore is what you want to work on, you could get a list of those and loop around them something like this:-

for file_prefix in $(cut -f1 -d "_"   input_file|sort -u)
do
   unset input_params         # Making sure that there is nothing left over from the previous file-prefix
   for data_file in $(grep "^${file_prefix}_"   input_file
   do
      input_params="${input_params} -I ${data_file}"
   done

   printf "Running the datamerge with parameters\n\t%s\nand writing to file\n\t%s\n\n" "${input_params}"  "${file_prefix}.dat
   ./datamerge "${input_params}" -O "${file_prefix}.dat"
done

The printf is just to show you what it's doing, so you can check it is correct. You can remove it if you prefer.

You could add a counter to the inner loop if you want to report file counts. This also negates any concerns about missing files in the sequence, e.g. if you had these listing in your input file:-

Robin1_S101.dat
Robin1_S102.dat
Robin1_S104.dat
Robin1_S105.dat

would you want it to report four files and try to read Robin1_S101 to Robin_S104 and then fail because Robin1_S103.dat is missing or would you want it to use the names you have? You may prefer to fail and alert that something is missing, of course. That is, of course, your choice based on your requirement.

I hope that this helps,
Robin

A smart solution with bash

#!/bin/bash
# bash 4 required
declare -A arr

# Read the input file into the array
while IFS="_" read base ext
do
  arr[$base]+=" -I ${base}_${ext}"
done < file

# Loop over the indices
for i in "${!arr[@]}"
do
  echo ./datamerge "${arr[$i]}" -O "$i".dat
done

Omit the echo if you want to run the lines as commands.

2 Likes

I got some errors
line 11: unexpected EOF while looking for matching `"' ./rbatte1.sh: line 14: syntax error: unexpected end of file