Combining files with specific patterns of naming in a directory

A-V · November 26, 2012, 11:06am

Greetings Unix exports,
I am facing some problems in combining files with different name patterns with a directory and I would appreciate if you can help me
I have more than 1000 files but they follow a specific pattern of naming. e.g. 64Xtest01.txt
They are divided into two sets of test and train
The Train set pattern is the following: e.g. 64XtrainY1.txt-James-Maggie.txt

The test set pattern is the following: e.g. 64xtest14.txt-James-Maggie.txt

And each of these files have only one line in them
Now I want to combine the files that have the unique patterns before the �.txt� and combine the rest of the files in them. e.g.

I am wondering what is the best way to deal with it
I have tired to combine all of them into a single file and then divide them best of a line with GREP but that is not an afficient way to do it I am sure.

FILES="XXXXXXX/*"
for X in $FILES
do
	name=$(basename $X) 
	awk '{printf "%s,%s\n",FILENAME,$0}' $X 
done > test-result.txt
cat test-result.txt | grep "count/64Xtrain*" > Xtrain.txt
cat test-result.txt | grep "count/64Xtest*" >  Xtest.txt
cat test-result.txt | grep "count/64Ytrain*" > Ytrain.txt
cat test-result.txt | grep "count/64Ytest*" >  Ytest.txt
�.

And then divide them based on names per line again but it�s a nightmare if u have loads of file.
So would really appreciate any help:confused:

Don_Cragun · November 26, 2012, 8:11pm

Greetings to you to A-V.

I am facing some problems in combining files with different name patterns with a directory and I would appreciate if you can help me
I have more than 1000 files but they follow a specific pattern of naming. e.g. 64Xtest01.txt
They are divided into two sets of test and train
The Train set pattern is the following: e.g. 64XtrainY1.txt-James-Maggie.txt
1)	2 fixed digits: 64
2)	A Capital letter which may vary
3)	�train�
4)	Another Capital letter
5)	One digit number
6)	�.txt-�
7)	Another pattern of �bla-bla� ---25 to 50 different names
8)	.txt --- the format

OK. I understand this set of filename requirements.

The test set pattern is the following: e.g. 64xtest14.txt-James-Maggie.txt

1)	2 fixed digits 64
2)	A capital letter which may vary
3)	�test�
4)	Two digit number, may vary
5)	�.txt-�
6)	Another pattern of �bla-bla� ---25 to 50 different names
7)	.txt --- the format

but the x in 64xtest14.txt-James-Maggie.txt doesn't match rule 2) since "x" is not a capital letter. Should the "x" be "X" instead, or is rule 2) in the above list a mistake?

And each of these files have only one line in them
Now I want to combine the files that have the unique patterns before the �.txt� and combine the rest of the files in them. e.g.
64XtrainY1.txt
64XtrainY2.txt
64YtrainX1.txt
64Xtest01.txt
64Ytest02.txt
I am wondering what is the best way to deal with it

With the rules stated so far, this could be done with the script:

#!/bin/ksh
for f in 64[A-Z]test[0-9][0-9].txt-*.txt 64[A-Z]train[A-Z][0-9].txt-*.txt
do      cat "$f" >> "${f%%.txt*}.txt"
done

I have tired to combine all of them into a single file and then divide them best of a line with GREP but that is not an afficient way to do it I am sure.

FILES="XXXXXXX/*"
for X in $FILES
do
	name=$(basename $X) 
	awk '{printf "%s,%s\n",FILENAME,$0}' $X 
done > test-result.txt
cat test-result.txt | grep "count/64Xtrain*" > Xtrain.txt
cat test-result.txt | grep "count/64Xtest*" >  Xtest.txt
cat test-result.txt | grep "count/64Ytrain*" > Ytrain.txt
cat test-result.txt | grep "count/64Ytest*" >  Ytest.txt
....

Now I'm lost.
The XXXXXXX/* implies that all of these files reside in a subdirectory that was not mentioned before and the count/64* in the grep commands search patterns impiles that the contents of these files contain the string count/ and the name of the file as part of the single line in each file, but that hasn't been explicitly stated. (The awk command adds the filename at the end of the contents of the files, but not the count/ preceding the filename.)

And, it looks like the desired final filenames have the 64 stripped from the front of the filenames as well as having the uppercase letters and digits stripped from the ends of the filenames before the first .txt in the filenames rather than the names shown earlier. So, do you want both sets of output files (i.e.,64XtrainY2.txt, 64YtrainX1.txt, 64Xtest01.txt, and 64Ytest02.txt AND Xtest.txt, Xtrain.txt, Ytest.txt, and Ytrain.txt or do you just want one set of these files (and if so, which set do you want)?

Do you want to remove the original files if they are successfully merged into one of the consolidation files?

Do you want the source file's name appended to the contents of files when they are added to a consolidation file?

Do you want the consolidation files placed in the same directory as the source files, or do you want them to be created in a different direcotry? (If in a new directory, what directory?)

A-V · November 27, 2012, 5:56pm

Sorry for the confusions
Q1) yes, it is a capital X
Q2) directory name can be anything XXXX or count or ...
Q3) as 64 is a fixed digit it does not make any important role... the name should present the letter which indicates what area they are from + are they train or test - of so what group of it (letter+# for train and # only for test)
Q4) I dont know what difference it will make
Q5) I am not sure I understand the question

Q6) I am still learning Unix -- "what is a source file?" --- it can be in another directory --it would be easier to see the results

o wow... I just tested it and it works like magic

may I ask you to explain what "f%%" does?
and how can I make it read from higher directory and put the results in another
such as puredate/* to count/*

---------- Post updated at 05:56 PM ---------- Previous update was at 11:05 AM ----------

one more question?

would it be possible to put every letter in one new folder which will include both the train and the test? 64X, 64Y

Don_Cragun · November 28, 2012, 5:16am

a-v:

Sorry for the confusions
Q1) yes, it is a capital X
Q2) directory name can be anything XXXX or count or ...
Q3) as 64 is a fixed digit it does not make any important role... the name should present the letter which indicates what area they are from + are they train or test - of so what group of it (letter+# for train and # only for test)
Q4) I dont know what difference it will make
Q5) I am not sure I understand the question

Q6) I am still learning Unix -- "what is a source file?" --- it can be in another directory --it would be easier to see the results

o wow... I just tested it and it works like magic

may I ask you to explain what "f%%" does?
and how can I make it read from higher directory and put the results in another
such as puredate/* to count/*

---------- Post updated at 05:56 PM ---------- Previous update was at 11:05 AM ----------

one more question?

would it be possible to put every letter in one new folder which will include both the train and the test? 64X, 64Y

OK. I think I understand what you want.

In this context a source file is any one of the input files that matches either your Train set pattern or your Test set pattern.

The construct ${var%%pattern} expands to the contents of the shell variable var with the longest string that matches pattern at the end of the string removed. Similarly ${var%pattern} expands to the contents of the shell variable var with the shortest string that matches pattern at the end of the string removed, ${var##pattern} expands to the contents of the shell variable var with the longest string that matches pattern at the start of the string removed, and ${var#pattern} expands to the contents of the shell variable var with the shortest string that matches pattern at the start of the string removed. If the given pattern doesn't match the appropriate part of the expansion of $var , $var is expanded in full.

So, for example if $src is set to

puredate/64Xtest14.txt-James-Maggie.txt

or to

/home/dwc/test/puredate/64Xtest14.txt-James-Maggie.txt

then the command:

sf=${src##*/}

will set sf to 64Xtest14.txt-James-Maggie.txt , and then the command:

df="${sf%%.txt*}"

will set df to 64Xtest14 , and then the commands:

df=${df#64[A-Z]train}
df=${df#64[A-Z]test}

will set df to 14 (with the 1st command leaving df unchanged and the 2nd command removing the leading 64Xtest . (With a source filename matching the pattern with train in it, the 1st command would remove the leading part of the string up to and including train and the 2nd command would leave the value unchanged.)

If you save the following script in a file, name it consolidate, make it executable, and execute it; it will consolidate all text in the files in and under the current working directory that match the pattern 64[A-Z]test[0-9][0-9].txt-*.txt or the pattern 64[A-Z]train[A-Z][0-9].txt-*.txt into files named 64[A-Z]/[A-Z][0-9][0-9].txt or 64[A-Z]/[A-Z][A-Z][0-9].txt under the current working directory, respectively:

#!/bin/ksh
# Usage: consolidate
#  The consolidate utility copies the contents of source files with
#  names matching one of two patterns in or under the current working
#  directory into summary files in directories (with the directory
#  name and file name derived from the name of the source file).
#   */64[A-Z]test[0-9][0-9].txt-*.txt -> 64[A-Z]/[A-Z][0-9][0-9].txt
#   */64[A-Z]train[A-Z][0-9].txt-*.txt -> 64[A-Z]/[A-Z][A-Z][0-9].txt
ec=0    # Script exit code.
find .  -name '64[A-Z]test[0-9][0-9].txt-*.txt' -o \
        -name '64[A-Z]train[A-Z][0-9].txt-*.txt' | while read src
do
        # Get last component of pathname of source file ($sf).
        sf="${src##*/}"
        # Target directory ($dir) will be "64x" (where x is a single upper case
        # letter) after throwing away train* or test*.
        dir="${sf%%t*}"
        # Create the target directory if it doesn't already exist.
        if [ ! -d "$dir" ]
        then    mkdir "$dir"
                rc=$?
                if [ $rc -ne 0 ]
                then    ec=1
                        printf "%s: \"%s\" not processed.\n" "$0" "$src" >&2
                        continue
                fi
        fi
        # Change source filename ($sf) to destination filename ($df):
        df="${sf%%.txt*}"       # Get rid of trailing ".txt-*.txt"
        df="${df#64[A-Z]train}" # Get rid of leading "64[A-Z]train" or
        df="${df#64[A-Z]test}"  #   "64[A-Z]test".
        df="${dir#64}$df.txt"   # Put back the "[A-Z]" removed in last step and
                                #   add trailing ".txt".
        cat "$src" >> "$dir"/"$df"
        rc=$?
        if [ $rc -eq 0 ]
        then    ;# printf "%s: cat %s >> %s succeeded\n" "$0" "$src" "$dir/$df"
                # rm "$src"
        else    ec=1
                printf "%s: cat %s >> %s failed (%d)\n" \
                        "$0" "$src" "$dir/$df" "$rc" >&2
        fi
done
exit $ec

This was written and tested using ksh, but only uses shell features specified by the POSIX standards and the Single UNIX Specifications (so it should work the same with any shell that conforms to these standards). It could be made a little more efficient using features that are only available in more recent versions of ksh, but the script shown here should work with any version of ksh as well as any other standards conforming shell.

If you would like to see a status report of the files successfully processed while this script is running, remove the ;# from the then clause of the last if command.

If you want to remove the source files after they have been successfully written into one of the consolidation files, remove the # in front of the rm command if the same then clause. Note that if you do this, you should also check the exit status of this rm command like the script does with the mkdir and cat commands.

You could also add options to be interpreted by this script to enable removing the source files that have been successfully copied, to enable printing of successfully completed copies, to set a different source directory, and to set a different destination directory, but I'll leave that as an exercise for the reader.

Hope this helps,
Don

A-V · November 28, 2012, 12:42pm

o. wow. this is amazing
thank you so much for everything
I am gonna try to understand everything and learn before trying the code
really appreciate your help

I am getting syntax errors for the final if loop...
1) for the ";" just after then

bash: syntax error near unexpected token `;'

2) and it delete that following is what I get

bash: syntax error near unexpected token `else'
$                 printf "%s: cat %s >> %s failed (%d)\n" \
>                         "$0" "$src" "$dir/$df" "$rc" >&2
bash: cat  >> / failed (0)
$         fi
bash: syntax error near unexpected token `fi'
$ done
bash: syntax error near unexpected token `done'

Don_Cragun · November 29, 2012, 9:18pm

a-v:

o. wow. this is amazing
thank you so much for everything
I am gonna try to understand everything and learn before trying the code
really appreciate your help

I am getting syntax errors for the final if loop...
1) for the ";" just after then
bash: syntax error near unexpected token `;'
2) and it delete that following is what I get
bash: syntax error near unexpected token `else'
$                 printf "%s: cat %s >> %s failed (%d)\n" \
>                         "$0" "$src" "$dir/$df" "$rc" >&2
bash: cat  >> / failed (0)
$         fi
bash: syntax error near unexpected token `fi'
$ done
bash: syntax error near unexpected token `done'

As I'm sure you've noticed, I used ksh instead of bash. When I was testing it, I was using:

        if [ $rc -eq 0 ]
        then    printf "%s: cat %s >> %s succeeded\n" "$0" "$src" "$dir/$df"
                # rm "$src"
        else    ec=1
                printf "%s: cat %s >> %s failed (%d)\n" \
                        "$0" "$src" "$dir/$df" "$rc" >&2
        fi

to make it easy to verify the code was doing way I expected. The shell grammar specifies that there is a compound list between the then and the else in an if clause but after looking more closely at the grammar (even though ksh93 accepts the clause as written), a portable script must have something between the then and the else and just a semicolon isn't enough.

If you want to see a list of directories as they are processed, remove the ;# ; if you want to remove the source files that have been successfully consolidated, change:

        then    ;# printf "%s: cat %s >> %s succeeded\n" "$0" "$src" "$dir/$df"
                # rm "$src"

to:

        then    rm "$src"

If you don't want either or both of those actions, change the if statement to:

        if [ $rc -ne 0 ]
        then    ec=1
                printf "%s: cat %s >> %s failed (%d)\n" \
                        "$0" "$src" "$dir/$df" "$rc" >&2
        fi

or, of course, you could just set a variable that you'll never use before the semicolon and leave the comments as they are.

Note that if you got the bash error:

bash: cat  >> / failed (0)

from the printf command I had, it means that you had a mismatched " somewhere before the cat %s >> %s succeeded\n" .

A-V · November 30, 2012, 9:01am

Thank you so much for all the information and help...
I am quite new and still learning everything.
I will make sure I understand things and will give it a go and let u know if I face any more problems.

A-V · January 29, 2013, 11:03am

Dear Don Cragun

I am facing problems will running it even by ksh

however I have tried your simple code at those work

for f in 64[A-Z]test[0-9][0-9].txt-*.txt 64[A-Z]train[A-Z][0-9].txt-*.txt
do      cat "$f" >> "${f%%.txt*}.txt"
done

now I am trying to work around it in a different directory but dont manage it...
i know this one should be a stupid mistake but cant find where it is

mkdir files-back
FILES="count/*"
for f in $FILES
do
	name=$(basename $f) 
	for f in 64[A-Z]test[0-9][0-9].txt-*.txt 64[A-Z]train[A-Z][0-9].txt-*.txt
	do      
		cat "$f" >> files-back/"${f%%.txt*}.txt"
	done
done

Don_Cragun · January 29, 2013, 12:11pm

a-v:

Dear Don Cragun

I am facing problems will running it even by ksh
$ ksh test.ksh
find: expected an expression after '-o'
test.ksh[11]: -name: not found
test.ksh[12]: syntax error: `do' unexpected

Getting this error means that you added something after the backslash ( \ )in test.ksh at the end of this line in the script I gave you:

find .  -name '64[A-Z]test[0-9][0-9].txt-*.txt' -o \

The backslash character in that line has to appear just before the newline that terminates that line.

a-v:

however I have tried your simple code at those work
for f in 64[A-Z]test[0-9][0-9].txt-*.txt 64[A-Z]train[A-Z][0-9].txt-*.txt
do      cat "$f" >> "${f%%.txt*}.txt"
done
now I am trying to work around it in a different directory but dont manage it...
i know this one should be a stupid mistake but cant find where it is
mkdir files-back
FILES="count/*"
for f in $FILES
do
	name=$(basename $f) 
	for f in 64[A-Z]test[0-9][0-9].txt-*.txt 64[A-Z]train[A-Z][0-9].txt-*.txt
	do      
		cat "$f" >> files-back/"${f%%.txt*}.txt"
	done
done

You don't say how it is failing, but there are several obvious problems here.

Using f as the variable name in two nested for loops can't possibly be what you want. Setting name and never using it seems pointless. If you'd like to provide more details about what you're trying to do with this nested loop we may be able to help.

If you just want the script to do the same processing in a different directory, the simple thing to do is:

cd directory_name

and then run the script again.

A-V · January 31, 2013, 10:11am

Thank you so much for your guidance and help.... very useful

for the code, I have realized there is a problem with the system I am working as it those not read any type of them.

However, I did manage to make the simple code work...
I didn't wanted to go in a sub-directory so changed the FOR LOOP to IF LOOP to solve that problem.

Thanx a lot again