Looping through input/output

zajtat · August 19, 2016, 11:48pm

Hi,

I've got a directory of about 6000 txt files that look like this:

a b c d
e f g h
k l m n

I need to execute a command on them to combine them and, in the end, have one big file with all the needed columns taken form all the 6000 files. I've got the "combining" program, but my problem is that once I've combined the first two files that output file should be the input file for adding the third one and so on.
Here is a schematic:

combining.executable infile1 infile2 > outfile1
combining.executable infile3 outfile1 > outfile2
combining.executable infile4 outfile2 > outfile3
combining.executable infile5 outfile3 > outfile4
etc

I've created a list of all the files than need to be combined (named master-infile) and wrote this loop:

for i in $(cat master-infile);
do
   for((a=1;a<=6000;i++);
   do
      combining.executable ${i} ${i} > ${a}
   done
done

But the looping variables seem to be all wrong and all sorts of weird combinations of files get combined. I guess I need to get a count of output files? Any ideas?

Any help would be greatly appreciated!

RudiC · August 20, 2016, 3:38am

That's quite an academic specification. How about showing some more details about the concatention of lines and columns therein? The logics behind it?
Do you know about the paste command?

zajtat · August 20, 2016, 9:01am

Each of the files consisted of 7 columns: the first three columns are the same in each file and the other 4 columns are the actual data (but we need to use only the last three of those). The merging would mean to keep the first three columns and then add the the last three columns of each file matched by the first column. For example, infile1 could be

a b c 1 2 3 4
d e f 5 6 7 8
g h j 9 10 11 12

infile2 could be:

a b c 11 3 4 5 
g h j 9 8 7 6
d e f 1 2 3 4

infile3 could be

d e f 2 3 4 5
a b c 5 6 7 8
g h j 9 10 11 12

then the after the merging step1, the outfile1 would look like like this:

a b c 1 2 3 4 3 4 5
d e f 5 6 7 8 2 3 4
g h j 9 10 11 12 8 7 6

then after adding infile3, the outfile2 would look like this:

a b c 1 2 3 4 3 4 5 6 7 8
d e f 5 6 7 8 2 3 4 3 4 5
g h j 9 10 11 12 8 7 6 10 11 12

I hope this clears things up.

Many thanks!

Don_Cragun · August 20, 2016, 5:31pm

Hi zajtat,
If combining.executable infile1 infile2 > outfile1 produces the output you said it does, then running the command:

combining.executable infile3 outfile1 > outfile2

won't give you the output you said you want (it would put the data from infile3 before the data from end of the infile2 in outfile2 discarding everything from infile1 , wouldn't it); wouldn't you need to use:

combining.executable outfile1 infile3 > outfile2

instead???

Do all of the ~6000 text files you want to combine have names that end in .txt (or some other single filename extension)? Are there any other files in that directory that have names that end in the same filename extension? If the 1st answer is yes and the 2nd answer is no, why do you need master-infile ? Why not just use for i in *.txt ? And, no, you don't want nested loops for this. What you have shown us in post #1 will combine two copies of infile1 into a files named 1 through 6000, and then for each of your other input files overwrite each of those numbered files with a combination of two copies of the next file in master-infile keeping only the 6000 copies of the combination of two copies of the last input file named in master-infile .

Is my interpretation of the commands you need to run correct? If not, please explain more clearly what arguments you are trying to pass to the command combining.executable .

Do you really want to keep outfile1 through outfile6000 , or do you just want one outfile to be the combination of the 6000 input files? (Putting 12000 files in a single directory is usually a great way to slow down processing any files in that directory.)

Are you always processing exactly 6000 files, or do you just want to combine all of your text files (either based on a filename matching pattern or on the list of files in master-infile ) no matter whether than is two files or a thousand files?

What is the format of the real names of your input files?

What name do you really want for your output file(s)?

zajtat · August 20, 2016, 6:44pm

Hi,

The syntax for the combining.executable is correct and it would not put the data from infile3 in front of the infile2 in outfile2 . The excitable will add columns from infile3 to the end of outfile1 . I've used the syntax to illustrate that the outcome of one command should be the input for the other.

I'm sorry, but your interpretation of the commands is not correct. We do not need to keep outfile1 through outfile6000, we just need outfile1 to be input for the executable in order to generate outfile2, then outfile1 can be deleted. Outfile2 will be the input for the executable to generate outfile3. Then outfile2 can be deleted and outfile3 can be used as input to generate outfile4, etc.

I may not need a master file or the loop, it was just the way I tried to solve the problem. But it does not necessarily have to be the case. The point is that I need to run a program (executable) where the output of step1 is the input of step2, and the output of step2 is the input for step 3, etc.

The format of my input files is text. They are the only ones in the folder.
The name of the output files does not matter.

Don_Cragun · August 20, 2016, 7:39pm

In post #3 in this thread, you said that the command:

combining.executable infile1 infile2 > outfile1

puts the entire contents of lines from infile1 (the 1st input file operand) followed by the the last three fields of matching lines from infile2 (the 2nd input file operand) into outfile1 .

Then you said that the command:

combining.executable infile3 outfile1 > outfile2

changes its behavior completely and puts the entire contents of lines from outfile1 (the 2nd input file operand) followed by the last three fields of matching lines from infile3 (the 1st input file operand) into outfile2 .

Are you absolutely positive that combining.executable magically knows that the behavior should be different for the 1st pair of files being combined than it is for every other pair of files it is asked to combine?

I understand the concept of running a program repeatedly with the output of subsequent invocations being one of the inputs from the previous invocation. And that isn't hard to do; it just can't be done with the nested loops you showed us. If you answer the questions I posed above and the remaining questions I asked in my previous post, I think we will easily be able to suggest something that will work for you.

Therefore, I repeat: What is the format of the real names of your input files? (Please show us the actual names of your 1st input file and your last input file.) If all files don't have the same text with just a number that changes from file to file, we need to have a filename matching pattern that will match all of the input filenames you want to process and a way to determine the order in which those files should be processed.

And, is the number of input files a constant?

And, what name do you really want for your output file?

zajtat · August 20, 2016, 8:30pm

I'm sorry my explanations were not clear.

The executable command takes the columns of file specified in argument 1 (infile1) and adds them to the file specified in argument 2 (infile2). It creates a new file that starts with columns from infile2, followed by columns from infile1. So, the first command line is:

combining.executable infile1 infile2 > outfile1

This will take infile2 as the basis and add columns to it from infile1. The outfile1 will start with columns from infile2 and finish with columns from infile1.

The we run the following command:

combining.executable infile3 outfile1 > outfile2

This will take the columns from the file specified as argument 1 (infile3) and add them to the file specified in the argument 2 (outfile1). It is the same procedure as in step1, so the program does the same thing, nothing changes. The resulting outfile2 will start with columns from outfile1 and finish with columns from infile3. So, all 3 files are combined and nothing is deleted/replaced.

For the next step, we'll need outfile2 and infile4. So, we can delete the outfile1 as its info is now in outfile2. And so on...

This is your next question:

What is the format of the real names of your input files? (Please show us the actual names of your 1st input file and your last input file.)

The answer: the files are simple text files, here is the name of my first input file:

9464294024_R01C01header

and the last file name is:

9479475073_R12C02header

here is a small subsample of the files:

9464294024_R01C01header  9477371149_R12C02header  9477871078_R06C01header  9477875165_R03C01header  9477885102_R05C01header  9479475073_R10C02header
9464294024_R01C02header  9477371157_R01C01header  9477871078_R06C02header  9477875165_R03C02header  9477885102_R05C02header  9479475073_R11C01header

Next question: And, is the number of input files a constant?

I'm not sure I understand this question. There is exactly 5427 files than need to be combined into one.

Next question: And, what name do you really want for your output file?

The name of the output file does not really matter for me. It can be anything as long as it contains the columns of all the 5427 files combined into one text file.

Many thanks!

Don_Cragun · August 20, 2016, 9:46pm

I apologize for guessing wrong at what combining.executable does. I made the unwarranted assumption that a file named infile1 would be your 1st input file and that a file named infile2 would be your second input file instead of infile1 being the name of your 2nd input file and infile2 being the name of your 1st input file.

With that list of files and no indication of what produced it, I'll make another wild assumption that the list is the first couple of lines of output from the command:

ls *header

and that the intent is that your input files are to be processed in increasing alphanumeric sorted order. Assuming that is correct (which based on the failure of my earlier assumptions is certainly not guaranteed), the following will combine your input files and produce an output file named outfile . It uses temporary files named 0 and 1 and renames the last used temporary file to be outfile and removes the other temporary file before it exits:

#!/bin/ksh
i=0
last=0
for file in *header
do	case "$i" in
	(0)	f1=$file
		i=1
		;;
	(1)	combining.executable "$file" "$f1" > $((out = last))
		i=2
		;;
	(2)	out=$((1 - last))
		combining.executable "$file" $last > $out
		last=$out
		;;
	esac
done
mv $out "outfile"
rm -f $((1 - out))

Although written and tested using a Korn shell, this will work with any shell that uses basic Bourne shell syntax and supports POSIX arithmetic substitutions (such as ash , bash , dash , and ksh ; but not csh and its derivatives, and not a traditional Bourne shell).

zajtat · August 21, 2016, 2:19am

Hi and thank you for your solution.
I will test it out tomorrow on my files.

Many thanks for your kind help!

---------- Post updated at 12:53 AM ---------- Previous update was at 12:33 AM ----------

Hi,

I actually just tried out your script now

The combining.executable complines that the argument 2 is not specified. It creates files 0 and 1, but they are empty.

Your assumption is correct that the files can be taken in increasing alphanumeric sorted order. Your other assumption that the list of files was generated with command ls is also correct.

Many thanks for your help in advance.

---------- Post updated at 02:19 AM ---------- Previous update was at 12:53 AM ----------

p.s. there is also this:

/bin/ksh: bad interpreter: No such file or directory

Don_Cragun · August 21, 2016, 2:30am

It sounds like you may have removed some of the <space> characters from the script I suggested???

Please show us the output from the command uname -a . We need to know what operating system (including the release number) that you are using?

What shell are you using (including version number)? If you don't know the version number, show us the output from the command:

shell --version

where shell is the name of the shell you are using.

Then run the command:

shell -xv script_name > log 2>&1

where shell is the name of your shell and script_name is the name of the file containing my script. Then show us the exact contents of the 1st 40 lines in the file named log (in CODE tags; not ICODE tags) and show us the output from the commands:

type combining.executable
printf '%s\n' "$PATH"
ls -l 0 1 log outfile
ls -l *header|head -n 20

(also in CODE tags).

----------------

I just saw your PS. If /bin/ksh isn't a valid path on your system, how did you run the script to produce files 0 and 1 and if trying to run my script produced that error, what else did you do that attempted to run combining.executable ???

Please also show us the output from the commands:

ls -l "$SHELL"
type ksh

in CODE tags.

zajtat · August 21, 2016, 3:04am

Hi,

I did not alter your script, just copy pasted it into a shell and run it. When the shell produced an error of /bin/ksh: bad interpreter: No such file or directory , I deleted the first line of your code and tried it that way. Other than that, I didn't change anything.

Here are the answers to your questions:

Question1: Please show us the output from the command uname -a . We need to know what operating system (including the release number) that you are using?

Answer:

Linux hw-uger-1000 2.6.32-642.el6.x86_64 #1 SMP Wed Apr 13 00:51:26 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux

Question 2: What shell are you using (including version number)?

Answer:

GNU bash, version 4.1.2(1)-release (x86_64-redhat-linux-gnu)
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>

Question 3: show the first 40 lines of output from this command shell -xv script_name > log 2>&1

[/COLOR]

Answer:

module () {  eval `/usr/bin/modulecmd bash $*`
}
#!/bin/ksh
i=0
+ i=0
last=0
+ last=0
for file in *header
do	case "$i" in
	(0)	f1=$file
		i=1
		;;
	(1)	./match.pl -f "$file" -g "$f1" -k 1 -l 1 -v "5 6 7" > $((out = last))
		i=2
		;;
	(2)	out=$((1 - last))
		./match.pl -f "$file" -g $last -k 1 -l 1 -v "5 6 7" > $out
		last=$out
		;;
	esac
done
+ for file in '*header'
+ case "$i" in
+ f1=9477885102_R01C01header
+ i=1
+ for file in '*header'
+ case "$i" in
+ ./match.pl -f 9477885102_R01C02header -g 9477885102_R01C01header -k 1 -l 1 -v '5 6 7'
+ i=2
+ for file in '*header'
+ case "$i" in
+ out=1
+ ./match.pl -f 9477885102_R02C01header -g 0 -k 1 -l 1 -v '5 6 7'
Must specify file one using -g switch
+ last=1
+ for file in '*header'
+ case "$i" in
+ out=0
+ ./match.pl -f 9477885102_R02C02header -g 1 -k 1 -l 1 -v '5 6 7'
+ last=0

I'm sorry,but I'm not able to share the answers to the following commands because they would contain sensitive information.

type combining.executable
printf '%s\n' "$PATH"
ls -l 0 1 log outfile
ls -l *header|head -n 20

Question 4: the output form the following commands:

Please also show us the output from the commands:

ls -l "$SHELL"

answer:

-rwxr-xr-x 1 root root 941880 Dec 22  2015 /bin/bash*

type ksh

answer:

-bash: type: ksh: not found

Many thanks!

bakunin · August 21, 2016, 4:50am

In this case do what Don Cragun already (implicitly) suggested in post #8:

change the line in his script where the shell to be used to execute it is named:

#! /bin/ksh

and replace it with a line pointing to a shell valid on your system, namely (from your quoted output above)

#! /bin/bash

Further, to spare you one source of common problems with this line (which is, informally, also called a "shebang"): the line needs to be the very first line in the file and it must not be indented! Even a leading space would make it a normal comment (instead of this special one which tells the OS which shell executable to use to have it interpreted) without any further meaning.

I hope this helps.

bakunin

zajtat · August 21, 2016, 5:26am

Hi Bakunin and thank you for your suggestion/explanation.

I've changed the first line to

#! /bin/bash

Unfortunately, I'm getting the same error that there is no file specified as argument2.

I apologise for misunderstanding things or not explaining them clearly. I'll try to do better.

Many thanks!

wisecracker · August 21, 2016, 7:28am

I think Don's script has found a bug in your, what looks like a, perl script; wherever that came from!?

I could be wrong but it looks as though your script cannot differentiate between 0, (zero) as a number and 0, (zero) as a file...

-g 0

Did you test your _perl_ script with a file named "0"?

Just a thought, as Don's script looks as though it generates a file name "0".

Don_Cragun · August 21, 2016, 8:40am

Hi zajtat,
Yes, the first invocation of combining.executable :

./match.pl -f 9477885102_R01C02header -g 9477885102_R01C01header -k 1 -l 1 -v '5 6 7' > 0

creates a file named 0 and the second invocation of combining.executable :

./match.pl -f 9477885102_R02C01header -g 0 -k 1 -l 1 -v '5 6 7' > 1

uses file 0 and creates file 1 . Subsequent passes through the loop alternate between creating file 0 and using file 1 on odd numbered invocations (after invocation 1) and creating file 1 and using file 0 on even numbered invocations.

And, on OS X, the script I suggested works with both bash (version: GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin15)) and with ksh producing the output you said you wanted with the three input files you provided as samples in post #1 in this thread (except the order of the 1st two lines in the output was switched) with combining.executable being the script:

#!/bin/bash
awk '
FNR == NR {
	k[$1, $2, $3] = $0
	next
}
($1, $2, $3) in k {
	print k[$1, $2, $3], $(NF - 2), $(NF - 1), $NF
	next
}
1' "$2" "$1"

which works fine as long as the input files are text files. Note that with the number of files you are combining, the above script might or might not work. The awk utility is only required to work on text files and some of the files this script will be processing will not be text files (due to excessive line lengths).

If your perl script won't accept 0 as a filename, that is your problem. The script I provided (after changing the 1st line to use your shell) meets all of your specifications.

If you would like to give us a clear specification of the allowed filenames for the -g option to your perl script, I might be able to help you make the script I suggested use different temporary filenames.

wisecracker · August 21, 2016, 10:44am

Also note that bakunin has a typo in his bash invocation second part:-

#! /bin/bash

......shloud be......

#!/bin/bash

......without the whitespace between "!" and "/".

EDIT:
I forgot to add, that he actually mentioned this is his text, but forgot to correct it.

zajtat · August 21, 2016, 11:30am

Thank you all for your suggestions and explanations!

bakunin · August 21, 2016, 3:52pm

Sorry to correct you, but this is NOT a typo and it was indeed intended that way. Here is a bit of UNIX history:

The UNIX kernel has to have some clue about what constitutes an executable file. OK, there is the x-bit in the filemode, but still, the kernel has to know what to do. For binary files there is the so-called "magic number", which is the first 4 bytes of an executable. Look at any first 4 bytes of binary, compiled files in your system and you will see that they are the same.These 4 bytes tell the kernel to use some member of the exec*() family of system calls to actually execute it.

Unfortunately this is not possible with the input files of the many interpreters a system has: all sorts of shells, awk, sed, many editors (ed, ex, ...) which can be scripted, and so on. Historically there was only one shell and the UNIX kernel had a special provision to try invoking the shell with the file as input in case the magic number was not understandable. Today this has been altered to the system default shell and is still in place: you can see it at work when you write a script without any shebang line. In this case the startup process is like: the kernel finds no valid magic number, then loads the system default shell (which is a normal executable binary) via normal exec*() , then feeds it the file in the hope that the shell can make sense of it. If it is indeed a shell script, it does. If not, you will see some error message, which is coming from the shell, not the kernel itself.

But as UNIX evolved and many different shells (and other interpreters of script-like text) were available the developers of UNIX searched for ways to name the specific shell/interpreter for scripts. The UNIX kernel was extended for a new magic number and this four-byte code was:

#! /

If the kernel finds this magic number it reads the rest of the first line, loads the specified interpreter and - upon succeeding in that - feeds it the file as input. Because the '#' is a comment to the shell it doesn't alter any script at all, the shell will just ignore it like any other comment.

This is the reason why you have to have the shebang in the very first line and why it is not allowed to be indented and why you have to use the absolute path of the shells executable: #! /bin/bash is a legal shebang, #! ../../bin/bash is not.

So, basically, the correct shebang line is:

#! /path/to/interpreter

Unfortunately most people were not able to reliably put a single space where it belongs and many wrote

#!/path/to/interpreter

Which was still a normal comment and the kernel would - failing to recognize the magic number - still load the default shell, but maybe not the specified one. Which is why the kernel was extended again to also recognize a special 3-byte magic number, which was the shebang without the space.

This is the situation we have now. Basically it doesn't matter any more if you write the space or not, but, being as old as i am, you have to have something that sets you apart from the youngsters (save for having to get up at three in the morning to pee). Therefore i always write the "originally correct" shebang, and not these newly-invented hacks which are only going back to the seventies.

I hope this helps.

bakunin

Don_Cragun · August 21, 2016, 4:21pm

Hi bakunin,
Sorry to disagree with you here, but back in the early UNIX days on the PDP-11, the magic number determining the type of executable was 2 bytes (16-bits) not 4 bytes. And, when #! was added to the magic numbers recognized by the kernel, a leading space was not allowed. Since then, some kernels allow one or more leading spaces, some kernels allow a single option (e.g., #!/bin/sh -xv ), and some kernels may even invoke a shell to evaluate the entire first line starting from the 3rd character as a shell command with the rest of the file as input (although I am not aware of any of these systems that are still produced today).

I believe that the PWB UNIX Systems I used when I was learning the OS treated:

#! /bin/sh

as a request to run the sh utility in the bin directory in the <space> directory, but I don't remember if it interpreted it as / /bin/sh or as ./ /bin/sh .

zajtat · August 22, 2016, 2:06am

I was going through the post again to look over things and have double-checked that the perl script works fine with files that have numbers as names. There are no bugs in perl script and there has never been such a problem, that's why I've answered a question about the name of the output file that it does not matter. I did try to provide clear information and answer all the questions to the best of my abilities and I apologise that it was not understandable. Thank you all of your time and help once more!