Identify file pattern, take count of pattern, then act

ampsys · May 13, 2015, 12:52am

Guys -
Need your ideas on a section of code to finish something up. To make a long story short, I'm parsing a print output file that goes to pre-printed forms. I'm intercepting it, parsing it, formatting it, cutting it up into individual pages, grabbing the text I want in zones, building an .fdf (declares the zones), then populating (pdftk).

All of this is working fine.... In the end, I separate docs that have multiple pages (from those that don't) into a separate directory where I want to slam them together..in order. Single .pdf...for these multipages all with common "ID" name.

No problem getting the concatenation to work manually - I need your help with the code at the end of this bash script.

I've got a directory ($DIRECTORY) that has files that looks like this:

368363.pdf
368363-2.pdf
368363-3.pdf
368373.pdf
368373-2.pdf
368373-3.pdf
368389.pdf
368389-2.pdf

The beauty is, this directory will ONLY contain files that have other files matching it's first six characters as I move through line by line. And it WILL be in order with the first file being 'filname.pdf' and everything following...by line...being 'filename-2(++).pdf'.

What I want to do is simple...read the directory, take in a file, one at a time, store all it counterparts with the -'X'.pdf into a variable, then slam them all together with pdftk before I get to the next line (i.e - pdftk $ALL $pdform cat output $linem (m to indicate multipage since pdftk is a bitch about using same input and output names. I can put in a line at the end to move it all and cleanup).

Something like:

for f in $DIRECTORY; do
         FILE=$(echo ${f##*/})         #Cut out the path leaving filename.pdf
        CATEM=$(ls -1f|grep $FILE|sort|uniq|tr '\r\n' ' ') # Gives me each file, separated by space (rather than line) to input to pdftk as a variable...

I'm lost after this. I don't think my plan is going to work because the original file will contain EVERYTHING and won't be re-read as I 'do stuff'....

Need some fresh eyes/ideas on this...this script is HUGE and does a ton of processing and I think I'm just getting tired of looking at it! Lol...

Thanks, fellas.

clx · May 13, 2015, 1:48am

You can skip the further processing if the files are "counterparts". To be clear, if the filename contains "-" (or any other check which you think would be safe).

Something like..

for f in $DIRECTORY; do
  echo $f | grep -q '-'
  if [ $? -eq 0 ]; then
    continue # pick the next file if its not the main file
  else
    FILE=$(echo ${f##*/})
    CATEM=$(ls -1f|grep $FILE|sort|uniq|tr '\r\n' ' ')
    # any other stuff 
  fi
done

RudiC · May 13, 2015, 4:28am

With extended pattern matching, available in recent bash , you could try

for f in !(*-*).pdf
   do CATEM=$(ls ${f%.*}*)
      echo $CATEM
   done
368363-2.pdf 368363-3.pdf 368363.pdf
368373-2.pdf 368373-3.pdf 368373.pdf
368389-2.pdf 368389.pdf

If the order is relevant, you'll need another step.

Don_Cragun · May 13, 2015, 4:34am

ampsys:

Guys -
Need your ideas on a section of code to finish something up. To make a long story short, I'm parsing a print output file that goes to pre-printed forms. I'm intercepting it, parsing it, formatting it, cutting it up into individual pages, grabbing the text I want in zones, building an .fdf (declares the zones), then populating (pdftk).

All of this is working fine.... In the end, I separate docs that have multiple pages (from those that don't) into a separate directory where I want to slam them together..in order. Single .pdf...for these multipages all with common "ID" name.

No problem getting the concatenation to work manually - I need your help with the code at the end of this bash script.

I've got a directory ($DIRECTORY) that has files that looks like this:
368363.pdf
368363-2.pdf
368363-3.pdf
368373.pdf
368373-2.pdf
368373-3.pdf
368389.pdf
368389-2.pdf
The beauty is, this directory will ONLY contain files that have other files matching it's first six characters as I move through line by line. And it WILL be in order with the first file being 'filname.pdf' and everything following...by line...being 'filename-2(++).pdf'.

What I want to do is simple...read the directory, take in a file, one at a time, store all it counterparts with the -'X'.pdf into a variable, then slam them all together with pdftk before I get to the next line (i.e - pdftk $ALL $pdform cat output $linem (m to indicate multipage since pdftk is a bitch about using same input and output names. I can put in a line at the end to move it all and cleanup).

Something like:
for f in $DIRECTORY; do
   FILE=$(echo ${f##*/})         #Cut out the path leaving filename.pdf
   CATEM=$(ls -1f|grep $FILE|sort|uniq|tr '\r\n' ' ') # Gives me each file, separated by space (rather than line) to input to pdftk as a variable...
I'm lost after this. I don't think my plan is going to work because the original file will contain EVERYTHING and won't be re-read as I 'do stuff'....

Need some fresh eyes/ideas on this...this script is HUGE and does a ton of processing and I think I'm just getting tired of looking at it! Lol...

Thanks, fellas.

The above code snippet doesn't even come close to doing what the comments indicate it should do. The for loop will execute once with f set to the directory named by $DIRECTORY . It will set FILE to the last component of that directory; not the name of a file in that directory.

Note also that with filenames like 123456.pdf and 123456-2.pdf , ls or anything else that sorts its output alphabetically will sort 123456-2.pdf before 123456.pdf since - sorts before . in ASCII and in the C Locale sort order.

There is no end to your sample for loop, so I don't know what you're planning to do after you get the list of files. And, I'm not sure whether you are trying to create a list of filenames (with $DIRECTORY stripped off) that can be used in the directory where the files are located, or a list of pathnames (including $DIRECTORY so the list can be used in another directory. It is also not clear whether $DIRECTORY is an absolute pathname for that directory or a relative pathname for that directory. The following code will only work if $DIRECTORY expands to an absolute pathname for a directory.

Rather than creating a list of files as a scalar variable, with bash or ksh it would be much easier to do this with an array (especially if your directory structure or filenames might ever contain any whitespace characters). Perhaps the following will give you something you can build on to get what you want:

#!/bin/bash
DIRECTORY="$(PWD)/dir"
cd "$DIRECTORY" || exit 1
for pdf in ??????.pdf
do	base=${pdf%.pdf}
	FILES=("$pdf" "$base-"*.pdf)
	PATHS=("$DIRECTORY/$pdf" "$DIRECTORY/$base-"*.pdf)
	echo "FILES (${#FILES[@]} elements):"
	printf '\t"%s"\n' "${FILES[@]}"
	echo "PATHS (${#PATHS[@]} elements):"
	printf '\t"%s"\n' "${PATHS[@]}"
	echo
done

This loop runs in the directory specified by $DIRECTORY and creates a list of your desired filenames and a list of pathname for those filenames. I assume you'll want one of those and can delete the code for the one you don't want. I also assume that you'll want to replace one of those printf commands with a pdftk command, but I don't see how $ALL , $linem , or $pdform from your description relate to the list of filenames or pathnames you want to use; so I'm leaving that as an exercise for the reader.

In a directory that contains the files:

1 2 3 -2.pdf
1 2 3 -3.pdf
1 2 3 -4.pdf
1 2 3 .pdf
368363-2.pdf
368363-3.pdf
368363.pdf
368389-2.pdf
368389.pdf

it produces the output:

FILES (4 elements):
	"1 2 3 .pdf"
	"1 2 3 -2.pdf"
	"1 2 3 -3.pdf"
	"1 2 3 -4.pdf"
PATHS (4 elements):
	"/Users/dwc/test/unix.com/shell/Identify_file_pattern,.../dir/1 2 3 .pdf"
	"/Users/dwc/test/unix.com/shell/Identify_file_pattern,.../dir/1 2 3 -2.pdf"
	"/Users/dwc/test/unix.com/shell/Identify_file_pattern,.../dir/1 2 3 -3.pdf"
	"/Users/dwc/test/unix.com/shell/Identify_file_pattern,.../dir/1 2 3 -4.pdf"

FILES (3 elements):
	"368363.pdf"
	"368363-2.pdf"
	"368363-3.pdf"
PATHS (3 elements):
	"/Users/dwc/test/unix.com/shell/Identify_file_pattern,.../dir/368363.pdf"
	"/Users/dwc/test/unix.com/shell/Identify_file_pattern,.../dir/368363-2.pdf"
	"/Users/dwc/test/unix.com/shell/Identify_file_pattern,.../dir/368363-3.pdf"

FILES (2 elements):
	"368389.pdf"
	"368389-2.pdf"
PATHS (2 elements):
	"/Users/dwc/test/unix.com/shell/Identify_file_pattern,.../dir/368389.pdf"
	"/Users/dwc/test/unix.com/shell/Identify_file_pattern,.../dir/368389-2.pdf"