Select and copy .csv files based on row and column number

rcsapo · February 25, 2019, 6:02am

Dear UNIX experts,

I'm a command line novice working on a Macintosh computer (Bash shell) and have neither found advice that is pertinent to my problem on the internet nor in this forum.

I have hundreds of .csv files in a directory. Now I would like to copy the subset of files that contains exactly 6 rows and 4 columns to a new directory. Since the .csv files are indeed comma-separated, I found that

awk -F',' ' { print NF }' test.csv

yields a column of 6 4's for the wanted files but I do not know how to make use of this output to select and copy numerous files in a loop.

I would be very grateful for any advice and apologize if I missed that a similar question had already been asked on this forum.

Thank you,
Robert

RudiC · February 25, 2019, 6:33am

Welcome to the forum.

I'm afraid it's not that easy and straightforward as you assumed. Still it's not a major problem. Try

awk -F',' '
FNR == 1        {if (NFOK && (OLDFNR == 6)) print "echo cp -v " OLDFN " new_dir/"
                 NFOK = 1
                }
NF != 4         {NFOK = 0
                }
                {OLDFNR = FNR
                 OLDFN  = FILENAME
                }
END             {if (NFOK && (OLDFNR == 6)) print  "echo cp -v " OLDFN " new_dir/"
                }
 ' *.csv | sh

This runs through all the .csv files in your current working dir, checks if each consistently has four fields in each line, and exactly 6 lines, and, for each match, prints a copy command into the pipe for the subsequently running shell.
When happy with what you see, you can remove the echo so the real copy is executed.

rcsapo · February 25, 2019, 10:24am

Dear Rudi,

Many thanks for your suggestion. I'm sure I'm missing something stupid but this prompts the following error message:

I've created the directory new_dir/ within the directory containing the .csv's. Was there anything else I had to do to adjust your code?

I should add: After removing the

echo

command, the script successfully copies a small number of correctly chosen .csv's before stopping and prompting the above error message.

Thanks again,
Robert

RudiC · February 25, 2019, 10:58am

More info needed. Drop the | sh and post the printed result. You didn't need to stick to the "new_dir" - feel free to adapt to your situation; change to an entire different path if need be.

rcsapo · February 25, 2019, 11:18am

OK, here it comes...

awk -F',' '
FNR == 1        {if (NFOK && (OLDFNR == 6)) print "echo cp -v " OLDFN " new_dir/"
                 NFOK = 1
                }
NF != 4         {NFOK = 0
                }
                {OLDFNR = FNR
                 OLDFN  = FILENAME
                }
END             {if (NFOK && (OLDFNR == 6)) print  "echo cp -v " OLDFN " new_dir/"
                }
 ' *.csv | sh

yields

cp -v file1.csv new_dir/
cp -v 497_BIA_riablo_2016-Jan-28.csv new_dir/
cp -v 502_BIA_riablo_2016-Mar-08.csv new_dir/
sh: line 4: syntax error near unexpected token `('
sh: line 4: `echo cp -v 503_BIA_riablo_2016-Jan-13_(1).csv new_dir/'

I suppose this might be related to the parentheses in the file name?

awk -F',' '
FNR == 1        {if (NFOK && (OLDFNR == 6)) print "echo cp -v " OLDFN " new_dir/"
                 NFOK = 1
                }
NF != 4         {NFOK = 0
                }
                {OLDFNR = FNR
                 OLDFN  = FILENAME
                }
END             {if (NFOK && (OLDFNR == 6)) print  "echo cp -v " OLDFN " new_dir/"
                }
 ' *.csv

runs through smoothly and prompts lines like the following for approximately 90 files, which I assume are all I have that meet the above conditions. Note that also the file

echo cp -v 503_BIA_riablo_2016-Jan-13_(1).csv new_dir/

ran through.

awk -F',' '
FNR == 1        {if (NFOK && (OLDFNR == 6)) print " cp -v " OLDFN " new_dir/"
                 NFOK = 1
                }
NF != 4         {NFOK = 0
                }
                {OLDFNR = FNR
                 OLDFN  = FILENAME
                }
END             {if (NFOK && (OLDFNR == 6)) print  " cp -v " OLDFN " new_dir/"
                }
 ' *.csv

runs through prompting something like

 cp -v 503_BIA_riablo_2016-Jan-13_(1).csv new_dir/

for the above ~90 files but does not copy anything.

By contrast,

awk -F',' '
FNR == 1        {if (NFOK && (OLDFNR == 6)) print " cp -v " OLDFN " new_dir/"
                 NFOK = 1
                }
NF != 4         {NFOK = 0
                }
                {OLDFNR = FNR
                 OLDFN  = FILENAME
                }
END             {if (NFOK && (OLDFNR == 6)) print  " cp -v " OLDFN " new_dir/"
                }
 ' *.csv | sh

does copy 3 files before stopping yielding the

sh: line 4: syntax error near unexpected token `('
sh: line 4: ` cp -v 503_BIA_riablo_2016-Jan-13_(1).csv new_dir/'

error message.

RudiC · February 25, 2019, 11:32am

Yes, absolutely.

Try to quote the file names:

awk -F',' '
FNR == 1        {if (NFOK && (OLDFNR == 6)) print "echo cp -v \"" OLDFN "\" new_dir/"
                 NFOK = 1
                }
NF != 4         {NFOK = 0
                }
                {OLDFNR = FNR
                 OLDFN  = FILENAME
                }
END             {if (NFOK && (OLDFNR == 6)) print  "echo cp -v \"" OLDFN "\" new_dir/"
                }
' *.csv

I hope I didn't mess anything else up in my hurry...

rcsapo · February 25, 2019, 1:16pm

That did the trick! Thanks a huge lot, you've just made my life quite a bit easier!
Best, Robert

drl · February 26, 2019, 1:57pm

Hi, Robert

Good that you have a solution. I am going to present another solution, one that does not use awk. I definitely recommend that you learn awk, especially if you are going to deal with columns (I usually call them fields). If you did wish to use only the command line, then here is one approach. First, we will ignore files that do not have exactly 6 lines. Then we will ignore all remaining files that do not have a total of 24 fields, which we will assume 4 fields on each of the 6 lines. We will do this, in turn, by assuming that each line would have 3 commas (separators), thus 6 X 3 -> 18. The utilities we will use are wc to count lines and characters, and tr which can delete (and transform) characters. So consider a file that has your 6 lines, and 4 fields per line. We will delete everything except the commas, and then count those commas. Any file that has other than 18 will be ignored. Any filename that remains will be collected into a text string (variable).

This script has a lot of other stuff in it to show the data files, results, etc., such as a debug function db. (To see what the results look like without debugging output, simply reverse the order of the 2 lines defining the db function.) Just concentrate on the ideas above and look at the files on which wc and tr operate. Ready?

#!/usr/bin/env bash

# @(#) s2       Demonstrate counting lines and fields, wc and tr.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
em() { pe "$*" >&2 ; }
db() { : ; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
C=$HOME/bin/context && [ -f $C ] && $C head wc tr

FILE=${1-data?}

pl " Input data files $FILE:"
head $FILE

pl " Line counts for data files $FILE:"
wc -l $FILE

pl " Results:"
l6x4=""
for item in $FILE
do
  if  [ $(wc -l <$item) != 6 ]
  then
    db " ignore file $item, not 6 lines"
        continue
  else
    if [ $(tr -d -c ',' <$item|wc -c) != 18 ]
        then
          db " ignore file $item, not 18 commas"
          continue
        fi
    l6x4="$l6x4 $item"
  fi
done
db " list of files to copy: $l6x4"
pe " sample: cp $l6x4 some-directory"

exit 0

producing:

$ ./s2

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.16.0-7-amd64, x86_64
Distribution        : Debian 8.11 (jessie) 
bash GNU bash 4.3.30
head (GNU coreutils) 8.23
wc (GNU coreutils) 8.23
tr (GNU coreutils) 8.23

-----
 Input data files data?:
==> data1 <==
1.1,1.2,1.3,1.4
2.1,2.2,2.3,2.4

==> data2 <==
1.1,1.2,1.3,1.4
2.1,2.2,2.3,2.4
3.1
4.1
5.1
6.1
7.1

==> data3 <==
1.1,1.2,1.3,1.4
2.1,2.2,2.3,2.4
3.1,3.2,3.3,3.4,3.5

==> data4 <==
1.1,1.2,1.3,1.4
2.1,2.2,2.3,2.4





==> data5 <==
1.1,1.2,1.3,1.4
2.1,2.2,2.3,2.4
3.1,3.2,3.3,3.4
4.1,4.2,4.3,4.4
5.1,5.2,5.3,5.4
6.1,6.2,6.3,6.4

==> data6 <==
1.1,1.2,1.3,1.4
2.1,2.2,2.3,2.4
3.1,3.2,3.3,3.4
4.1,4.2,4.3,4.4
5.1,5.2,5.3,5.4
6.1,6.2,6.3;6.4 <- semi-colon is not a comma

==> data7 <==
1.1,1.2,1.3,1.4
2.1,2.2,2.3,2.4
3.1,3.2,3.3,3.4
4.1,4.2,4.3,4.4
5.1,5.2,5.3,5.4
6.1,6.2,6.3,6.4

==> data8 <==
1.1,1.2,1.3,1.4
2.1,2.2,2.3,2.4
3.1,3.2,3.3,3.4
4.1,4.2,4.3,4.4
5.1,5.2,5.3,5.4
6.1,6.2,6.3,6.4,6.5

==> data9 <==
1.1,1.2,1.3,1.4
2.1,2.2,2.3,2.4
3.1,3.2,3.3,3.4
4.1,4.2,4.3,4.4
5.1,5.2,5.3         <- only 3 fields: so 2 commas
6.1,6.2,6.3,6.4,6.5 <- but  5 fields: so 4 commas

-----
 Line counts for data files data?:
  2 data1
  7 data2
  3 data3
  6 data4
  6 data5
  6 data6
  6 data7
  6 data8
  6 data9
 48 total

-----
 Results:
 db,  ignore file data1, not 6 lines
 db,  ignore file data2, not 6 lines
 db,  ignore file data3, not 6 lines
 db,  ignore file data4, not 18 commas
 db,  ignore file data6, not 18 commas
 db,  ignore file data8, not 18 commas
 db,  list of files to copy:  data5 data7 data9
 sample: cp  data5 data7 data9 some-directory

OK, so what's good here? It's fairly straightforward command-line stuff, although some, like the comparisons may be new. That means you don't need to learn another language like awk or perl (but, of course, you probably should.) Just simple utilities are used.

So what's bad? First, awk and perl can process multiple operations, whereas here, each file causes a number of separate executions for each file. For a small number of files, say a few hundred, that probably would not matter much, especially if you can get a script running quickly compared to the time it may take to get an awk script (or especially a perl script) running correctly. Secondly, we assumed your files are all syntactically correct, in that each line has 4 fields. If it does not, like data9, the sum of the commas could still be 18, but it may cause an error down the line.

The latter issue could be addressed by looking at each line. Another utility (hefty though it might be) is datamash, which you can use to verify the number of fields in a file. For example:

$ datamash -t, check 4 fields <data9
line 5 (3 fields):
  5.1,5.2,5.3         <- only 3 fields: so 2 commas
datamash: check failed: line 5 has 3 fields (expecting 4)

Because it is another code that deals with fields, you may wish to look at it:

datamash        command-line calculations (man)
Path    : /usr/local/bin/datamash
Version : 1.2
Type    : ELF 64-bit LSB executable, x86-64, version 1 (SYS ...)
Help    : probably available with -h,--help
Repo    : Debian 8.11 (jessie) 
Home    : https://savannah.gnu.org/projects/datamash/ (pm)
Home    : http://www.gnu.org/software/datamash (doc)

The code datamash was in my repository for a machine like:

OS, ker|rel, machine: Linux, 3.16.0-7-amd64, x86_64
Distribution        : Debian 8.11 (jessie) 
datamash (GNU datamash) 1.2

Best wishes ... cheers, drl

rcsapo · February 27, 2019, 4:12am

Hi drl,

Many thanks for taking the time to provide such an extensive answer - the support provided on this forum is really extraordinary. I could not have produced a script like yours with all the bells and whistles but the central for-if-else-loop seems indeed pretty straightforward.

I'm aware I should probably familiarize with awk but as a lay programmer (I use MATLAB quite a bit but that's about it) I find its syntax not particularly intuitive and time is just such a limited resource... *sigh*

In any case, many thanks again - both you and Rudi have been super helpful!

Robert