Concatenate Numerous Files

sudon_t · October 27, 2012, 1:16am

Hey!
I wanted to find a text version of the Bible for purposes of grepping. The only files I could find, (in the translation I wanted), were Old Testament.txt and New Testament.txt. I thought, "fine, I'll just concatenate those two, no problemo." But when I unpacked them, turns out they had each major book in it's own directory, often containing multiple text files. For example:

~/Desktop/New Testament/Colossians:
Colossians1.txt	Colossians2.txt	Colossians3.txt	Colossians4.txt

But, my faith in unix is strong, (possibly due to the depth of my ignorance). Can cat put all this together into one file - in order? I mean, the man page says that cat reads files sequentially, but what does that mean? If the directories were in order, (they're not now, I'll have to do that by hand), would cat work through them sequentially? I don't even see a recursive flag. Will it even move through directories? The truth is, I've only used cat to read files - not to actually concatenate them. Maybe I could feed it with ls, somehow?
I guess what I'm asking is, is there a one-liner that would get me through this, or am I expecting miracles?

bakunin · October 27, 2012, 4:36am

LOL

If faith is a result of ignorance what does that tell us about believers? ;-)) (sorry - i can't forego such opportunities).

Seriously: "cat"'s very pupose is to conCATenate files so what you want to do is "cat"s core competence, so to say.

The basic usage is

cat file1 file2 [ ... fileN] > newfile

and it works "sequentially" as the lines in "newfile" will be ordered like this:

file1, line1
file1, line2
...
file1, last line
file2, line1
file2, line2
...
file2, last line
file3, line1
...
fileN, last line

Notice, that you can't use one of the input files as output file, because of reasons stated here.

One word of caution: "cat" really concatenates the files without adding anything. Suppose you have two files like this:

line1-1
line2-1

line1-2
line2-2

If the end-of-file marker is immediately after the last character and there is no end-of-line you won't notice any difference as long as you work with the files alone, but concatenate them and the result will look like this:

line1-1
line2-1line1-2
line2-2

which is probably not what you want. You can avoid this by preparing a file with only an end-of-line character in it and use this as a spacer to make sure all the files are properly delimited:

cat file1 spacerfile file2 spacerfile file3 ... > outfile

You see it is possible to even use the same file over and over again.

Finally, a word about Unix philosophy and why it is a good thing you have to prepare the list of files yourself:

The design philosophy of Unix is that every tool should serve exactly one purpose and serve that as good as possible. "cat" is for concatenating files. If you want to prepare a file list use a special tool for that. Unix tools work like an orchestra: you don't expect the violinist to play the trumpet as well - you get a specialized trumpet-player if you need one. You now have a bunch of really devoted instrumentalists and they are waiting for your leadership. Step up to the podium, weave your conductors baton and make them sound like the work-class orchestra they are.

You have a lot of directories, each with one or more "*txt" files in it. First, let us prepare a list of these files. We use another specialised program, which really knows hot to find files: "find". (To understand how this trumpet player works here's a little starter.)

find ~/Desktop/New Testament -name "*txt" -type f -print

This will produce a list of files. If you are satisfied with the contents of this list, redirect it to a file:

find ~/Desktop/New Testament -name "*txt" -type f -print > listfile

Now use your editor to change the sorting order in this file to your hearts content. You probably want to keep the canonic order, which is - for the computer - completely arbitrary. You will have to prepare this by hand therefore.

When you have your list file ready, issue the following command (the usage of the spacerfile is optional) :

rm resultfile ; while read file ; do cat $file spacerfile >> resultfile ; done <listfile

This will work through the list and first remove any resultfile there might be from a previous run, then set up a loop (while..do-done) where a variable "file" is being filled with the filenames one after the other. This variable is then used in the body to concatenate one file after the other to the resultfile. This is, why we cleared that before, otherwise after three runs we'd have every file three times in there.

To see and understand how the loop works, change it slightly:

while read file ; do echo == $file == ; done <listfile

Which will print the filenames in "listfile", surrounded by equal signs.

I hope this helps (and i hope to have deepened your faith in Unix even though removing some ignorance).

bakunin

sudon_t · October 27, 2012, 12:54pm

With a name like Bakunin, I would expect no less. ; )

This is great - I really appreciate your help! I'm going to have to study this a bit to make sure I understand it enough to ask sensible questions, but I wanted to thank you immediately.

sudon_t · October 28, 2012, 1:01pm

I just want to make sure I understand the problem here - cat will mix things up if the last line of any file is not followed by a newline?

alister · October 28, 2012, 1:17pm

It won't mix things up. What will happen is that the first line of the following file will be joined with the last line of the preceding file (as in bakunin's example).

A proper text file (most of them are) is not missing the last newline, so you probably don't have to worry about this.

Regards,
Alister

sudon_t · October 28, 2012, 7:28pm

OK, I numbered the directories by hand so that they would sort in the canonical order. Now, they had the files within the directories numbered using single digit enumeration, so naturally they don't sort correctly:

./01_Old Testament/01_Genesis/Genesis1.txt
./01_Old Testament/01_Genesis/Genesis10.txt
./01_Old Testament/01_Genesis/Genesis11.txt

So I worked out a regex to place a zero in front of single digit filenames:

perl -pi -e 's/(?<=[a-z])(?=[0-9]\.txt)/0/g' ./OTfilelist.txt

I won't say how long, or how many tries it took for me to figure this out, even though I'm sure it would make for an exciting story. But, even though I am filled with a feeling of accomplishment, the filenames still do not sort correctly.

./01_Old Testament/01_Genesis/Genesis01.txt
./01_Old Testament/01_Genesis/Genesis10.txt
./01_Old Testament/01_Genesis/Genesis11.txt

No matter, because I want things to behave in the real world, too.
So, now that I have the magic regex in hand, how can I use it to change the actual filenames?
What I've been able to glean from the web is that there is a system call called "rename" which somehow should work with perl. But there is no mention of "rename" in the perl man page. On the other hand, there is a man page for rename, but it doesn't contain anything that I found illuminating. I'm guessing this is something that has to be called from a script?
I've also seen examples on the web of rename as a(n apparent) standalone executable, but I don't seem to have it. Nor can I find it though ports or fink.

-bash $ rename
-bash: rename: command not found

I guess I should mention I'm using Mac OS 10.6.8, Perl 5.12.4
Is there a different way to invoke this rename? Am I barking up the wrong tree altogether? Surely there's a one-liner solution to recursively rename files with a regex?

elixir_sinari · October 28, 2012, 11:30pm

Seems you didn't check that well.
Check

rename - perldoc.perl.org

bakunin · October 29, 2012, 12:30am

First off, very well done so far. You worked most of it out, but you made it more complicated for you than necessary.

sudon't:

OK, I numbered the directories by hand so that they would sort in the canonical order. Now, they had the files within the directories numbered using single digit enumeration, so naturally they don't sort correctly:
./01_Old Testament/01_Genesis/Genesis1.txt
./01_Old Testament/01_Genesis/Genesis10.txt
./01_Old Testament/01_Genesis/Genesis11.txt

Actually they don't have to sort correctly - i gave you a two-step plan how to produce a filelist first and then work through that list with a loop:

find ~/Desktop/New Testament -name "*txt" -type f -print > listfile
rm resultfile ; while read file ; do cat $file spacerfile >> resultfile ; done <listfile

The second line will work through the listfile (actually a list of filenames, one every line) sequentially, but "find" will probably not write the files into the list in the order you want. This is why i told you to reorder the listfile by reordering the files - you would just have to move the lines around.

At second thought, you don't even have to move the lines around, there is a utility for that: "sort". So, here is what you do:

Prepare the initial listfile:

find ~/Desktop/New Testament -name "*txt" -type f -print > listfile

The result will probably look like this:

/home/user/Desktop/New Testament/Colossians/Colossians1.txt
/home/user/Desktop/New Testament/Colossians/Colossians2.txt
/home/user/Desktop/New Testament/Colossians/Colossians3.txt
/home/user/Desktop/New Testament/Colossians/Colossians4.txt
/home/user/Desktop/New Testament/John/John1.txt
/home/user/Desktop/New Testament/Mark/Mark1.txt
...

Sort the listfile

Now this is not sorted canonically, because Mark and John come before all the letters. Use your editor to add an order number at the beginning of the line:

3 /home/user/Desktop/New Testament/Colossians/Colossians1.txt
4 /home/user/Desktop/New Testament/Colossians/Colossians2.txt
5 /home/user/Desktop/New Testament/Colossians/Colossians3.txt
6 /home/user/Desktop/New Testament/Colossians/Colossians4.txt
2 /home/user/Desktop/New Testament/John/John1.txt
1 /home/user/Desktop/New Testament/Mark/Mark1.txt
...

Never mind that the numbers will not have all the same number of digits. For the niftly little tool i show you now this is just peanuts: "sort". This, you guessed it, sorts things - not only alphabetically, but also numerically. Read the man page of "sort" and you will see how much it can do.

So, after you have added the numbers, use "sort" to sort the file:

sort -nk1 listfile > listfile.sorted

Your file should now look like this:

1 /home/user/Desktop/New Testament/Mark/Mark1.txt
2 /home/user/Desktop/New Testament/John/John1.txt
3 /home/user/Desktop/New Testament/Colossians/Colossians1.txt
4 /home/user/Desktop/New Testament/Colossians/Colossians2.txt
5 /home/user/Desktop/New Testament/Colossians/Colossians3.txt
6 /home/user/Desktop/New Testament/Colossians/Colossians4.txt
..

Check the file again with an editor, to see if all worked out. Note, that you still have the listfile, so you can change the numbers in there and re-run the "sort" command if not everything is to your satisfaction.

Concatenate the files

Finally use the sorted listfile to create the output. As we have added numbers we need to modify the loop i showed you slightly:

rm resultfile ; while read num file ; do cat $file spacerfile >> resultfile ; done <listfile.sorted

If your files are well-formed you can remove the spacerfile from the call:

rm resultfile ; while read num file ; do cat $file >> resultfile ; done <listfile.sorted

A few words about your solution:

You shouldn't use perl for that. "perl" is a full-blown programming language - a full orchestra of its own. You don't invite a whole orchestra and then tell them you need only one triangle player, for the other instruments you have an orchestra of your own. You can use perl to do all you want to do and if you prefer "perl" above shell code that is ok. But don't write shell code and then use "perl" as a simple regex machine. The shell has its own regexp machines for that (sed, awk, ...).

The usual way is to use the regexp to create teh modified name, store this information in a variable and then use this variable content to change the filename. See below.

"rename" is probably a "perl"-command and internal to this language. In shell code you use "mv", which is short for "move".

The sketch for renaming files would look like this:

<some pipeline providing a list of filenames> | while read filename ; do
     filenew="$(echo "$filename" | sed 's/\([a-z]\)\([0-9]\)\.txt/\10\2.txt/')
     mv "$filename" "$filenew"
done

I hope this helps.

bakunin

sudon_t · October 29, 2012, 2:22am

Yes, I do find this all extremely helpful and enlightening. Believe it or not, using sort did occur to me.
But you are right - for the immediate job at hand, renaming files is an unnecessary distraction. Unfortunately, I often get distracted with trying to order things - a symptom of my illness. On the other hand, I thought to keep the originals, and would like them to sort properly. But yes, let's leave that exercise for another time.
OK, since all directories are sorted into canonical order, and since all files have been renumbered with my little regex, this was all that was needed:

sort -n OTfilelist.txt > OTfilelistsorted.txt

They are now all in perfect order, so let's take a moment to grab a beer out of the fridge, and go back to your original instructions....

---------- Post updated at 02:22 AM ---------- Previous update was at 01:05 AM ----------

cat: Testament/21_Ecclesiastes/Ecclesiastes12.txt: No such file or directory
cat: Testament/22_Song: No such file or directory

As you can see, it lost Ecclesiastes12.txt because there's an unescaped space between Old and Testament. And it sees Song of Solomon as three different (non-existent) directories.
Also, changing the filenames in the list files was a bad idea. And in retrospect, it is clear why. So, find does not escape any spaces in filenames in it's print output. It's funny, if I just drag a file onto the Terminal, it shows the path with all spaces escaped. You would expect the opposite since drag & drop is such a Mac thing, while find is a real unix program.
Is it possible to simply pipe the stdout of find directly to cat? Perhaps that could eliminate the problem of how it prints paths? Or, better yet, pipe find to sort to cat? Am I over-estimating the omnipotence of unix? I have to admit, it's powerful one-liners that get me excited. It's what really drew me into wanting to learn unix in the first place.
Then again, it may pay to go ahead and fix the actual filenames first. Since it's 02:00 where I'm at, it may be best if I come back to it tomorrow.

elixir_sinari · October 29, 2012, 2:29am

rm resultfile ; while read num file ; do cat "$file" >> resultfile ; done <listfile.sorted

sudon_t · October 29, 2012, 1:48pm

The problem is in the listfile find generates. I need to find an app that will output properly escaped filenames, or fix the actual filenames. find's output to the list looks like this:

/01_Old Testament/39_Malachi/Malachi04.txt

I need it to look like this:

/01_Old\ Testament/39_Malachi/Malachi4.txt

Notice that the space between "Old" and "Testament" is not escaped, and so it breaks down. I was thinking the -d{n} flag might get it, (by skipping the directories altogether), then I realized cat probably needs the full path to find the files. I couldn't find a flag that would 'fix' the output of find, either.
I have to fix the list file, first.

alister · October 29, 2012, 1:49pm

Since the original filenames are predictable (identical to the containing directory followed by an incrementing index and the .txt extension), we can just build them until we construct one that doesn't exist. There is no need to sort.

The only information any solution to this problem needs to know is the sequence of books and where to find them.

The following script takes two arguments, $1, the path to the old testament books and, $2, the path to the new testament books. The sequence of book names is embedded in the script. The script begins looking for books in the old testament until a blank line in the embedded list signals it to switch to the new testament.

NOTE: Each book's name in the embedded list must be identical to the directory basename ("Genesis" in the case of "/home/your/Desktop/Bible/Old Testament/Genesis"). Same case. Same spacing.

ot=$1
nt=$2

t=$ot
while IFS= read -r b; do
    [ -z "$b" ] && t=$nt && continue
    i=1
    while cat "$t/$b/$b$i.txt" 2>/dev/null; do
        i=$((i+1))
    done
done <<'END_OF_DAYS'
Genesis
Exodus
...
Zechariah
Malachi

Matthew
Mark
...
Jude
Revelation
END_OF_DAYS

Note the blank line before Matthew (iirc, beginning of the NT); it's critical.

If the script were stored in a file named bible.sh, the following would generate a single text file bible (using pathnames derived from your posts):

sh bible.sh ~/Desktop/Old\ Testament ~/Desktop/New\ Testament > bible.txt

Regards,
Alister

sudon_t · October 29, 2012, 3:04pm

I knew that, eventually, someone reading this thread would get frustrated and whip up a script to solve all my problems. It must be the same feeling I get when I meet someone who can barely read or write. Script writing is so far beyond my capabilities that it feels like cheating, somehow. ; )
The way they have the files set up might be a problem for your script. Indeed, it is thee problem.

-bash $ cat OTfilelistsorted.txt
.....
./01_Old Testament/01_Genesis/Genesis8.txt
./01_Old Testament/01_Genesis/Genesis9.txt
./01_Old Testament/01_Genesis/Genesis10.txt
./01_Old Testament/01_Genesis/Genesis11.txt
.....

Each book constitutes a directory, while each chapter constitutes a numbered file. Genesis, for instance, is broken up into fifty separate files. Correct me if I'm wrong, but it seems like your script is expecting each book to be one file. Could your embedded list contain a wildcard character? Even so, it seems to me we still have the problem of sorting. As you can see, they used single digit enumeration. But I'm going to try to fix the actual filenames, first.

---------- Post updated at 03:04 PM ---------- Previous update was at 02:36 PM ----------

OK, found out that rename is a perl script someone made up. Downloaded the code, et voila!

-bash $ ls
Genesis1.txt	Genesis19.txt	Genesis28.txt	Genesis37.txt	Genesis46.txt
Genesis10.txt	Genesis2.txt	Genesis29.txt	Genesis38.txt	Genesis47.txt
Genesis11.txt	Genesis20.txt	Genesis3.txt	Genesis39.txt	Genesis48.txt
Genesis12.txt	Genesis21.txt	Genesis30.txt	Genesis4.txt	Genesis49.txt
Genesis13.txt	Genesis22.txt	Genesis31.txt	Genesis40.txt	Genesis5.txt
Genesis14.txt	Genesis23.txt	Genesis32.txt	Genesis41.txt	Genesis50.txt
Genesis15.txt	Genesis24.txt	Genesis33.txt	Genesis42.txt	Genesis6.txt
Genesis16.txt	Genesis25.txt	Genesis34.txt	Genesis43.txt	Genesis7.txt
Genesis17.txt	Genesis26.txt	Genesis35.txt	Genesis44.txt	Genesis8.txt
Genesis18.txt	Genesis27.txt	Genesis36.txt	Genesis45.txt	Genesis9.txt
-bash $ ls | rename 's/(?<=[a-z])(?=[0-9]\.txt)/0/g'
-bash $ ls
Genesis01.txt	Genesis11.txt	Genesis21.txt	Genesis31.txt	Genesis41.txt
Genesis02.txt	Genesis12.txt	Genesis22.txt	Genesis32.txt	Genesis42.txt
Genesis03.txt	Genesis13.txt	Genesis23.txt	Genesis33.txt	Genesis43.txt
Genesis04.txt	Genesis14.txt	Genesis24.txt	Genesis34.txt	Genesis44.txt
Genesis05.txt	Genesis15.txt	Genesis25.txt	Genesis35.txt	Genesis45.txt
Genesis06.txt	Genesis16.txt	Genesis26.txt	Genesis36.txt	Genesis46.txt
Genesis07.txt	Genesis17.txt	Genesis27.txt	Genesis37.txt	Genesis47.txt
Genesis08.txt	Genesis18.txt	Genesis28.txt	Genesis38.txt	Genesis48.txt
Genesis09.txt	Genesis19.txt	Genesis29.txt	Genesis39.txt	Genesis49.txt
Genesis10.txt	Genesis20.txt	Genesis30.txt	Genesis40.txt	Genesis50.txt
-bash $

This should give us properly sorted lists. Now, a little find/replace to eliminate spaces.... I am now having fun.

alister · October 29, 2012, 3:20pm

Understood. That is exactly what my script expects.

You are wrong. Wildcards are not necessary.

My script does not require filenames to be modified, even though they do not sort properly because the numeric indices are not of equal digits. The inner while-loop generates the filenames itself.

My script is intended to work with the original filenames, unmodified.

Regards,
Alister

sudon_t · October 29, 2012, 5:44pm

OK, I just want to be clear. In your example, it looks like you are pointing the script directly to the files, rather than a filelist. Is this correct?

---------- Post updated at 05:44 PM ---------- Previous update was at 05:24 PM ----------

OK, I fixed the file names, which is something I wanted to do anyway:

find ./02_New\ Testament -name "*txt" -type f -depth 2 |rename 's/(?<=[a-z])(?=[0-9]\.txt)/0/g'

Then I generated new file lists, and edited the lists so that they were properly escaped:

perl -pi -e 's/ /\\ /g' NTfilelist.txt

Now we have a perfectly formed listfile! Let's concatenate using Bakunin's solution!

-bash $ rm resultfile ; while read num file ; do cat $file >> resultfile ; done <./OTlistfile.txt
rm: resultfile: No such file or directory
-bash: ./OTlistfile.txt: No such file or directory

OK, clearly I'm missing something. I don't know why it can't find my listfile - OTlistfile.txt I tried it a couple of different ways, with a ./, etc.
Also, I thought resultfile would be created. Do I need to create an empty file for it to work with? You guys gotta remember, I really know nothing. You must explain as to an overgrown child.
A word to Bakunin as to my use of perl as a regex engine:
Possibly, I am killing flies with cannons, but when I began learning regex, I found out, much to my dismay, that every app understood regexes differently. So, I learned it for grep, (which also has the -P flag), and for perl. In other words, I use perl because I know how to write regexes for it, and I'm never clear what other apps will understand.

bakunin · October 30, 2012, 8:33am

No. alister is creating a list of values which he supposes to correspond to directory- and filenames. This is because of the way you laid out the problem in your previous posts.

First off, how to find your listfile:

-bash: ./OTlistfile.txt: No such file or directory

The same way you searched for all the other files:

find ~ -type f -name "OTlistfile.txt" -print

I will not really matter where it is stored. You could use full paths:

rm resultfile ; while read num file ; do cat $file spacerfile >> resultfile ; done </full/path/to/where/you/found/listfile.sorted

Second, here are some general tips, some of them digressing from the problem at hand to some more generalized angle:

Present your problem as concisely as possible.

Your description of the problem (the directory layout, how the files are organised, etc.) changed somewhat over the course of the thread. You didn't contradict yourself directly, but you left out critical information in your first description(s) which you gave out one at a time in your later posts.

Problems in shell scripting - and what you are attempting is shell-scripting, despite your claims it is way above your capabilities - are like any other programming problem mostly depending on a clear and precise definition. Once you have precisely defined what you want to do and how you want it to be done the solution is in most cases obvious and easy to implement. Have a look in the "Shell Programming and Scripting" forum and compare threads with many answers with the ones with few answers. One would expect the threads with many answers to be more interesting, but the opposite is the case: the ones with many answers are the ones which usually go like this:

Q: i need to produce X
A: do THIS
Q: ah, yes, fine, but i need the Xs to be different, more like Ys
A: modify THIS to be THAT to produce Ys
Q: many thanks, but my Ys should have a special quality of Z
A: *sigh* do THAT, but modify the FOO part to BAR
... rinse and repeat ad nauseam

The fifth answer was not at all more "complex" or "hard" to give than the first - it was just the realization of having come up with 4 answers completely unnecessarily that caused the sigh.

So, analyze the problem you have as exactly and meticulously as possible and you will be on the fast lane to programmers ascension. What we do is not an arcane art, but just this skill of defining problems precisely and abstractly, mixed with some common sense - trust me, i'm bakunin! ;-))

Second, you sure might want to know how alisters script works (which is, btw., based on a better idea than my own solution, so you should go with it).

Here is the short version of "Introduction to programming logic 101:

The core part is a loop, into which a "here-document" is fed. "Here-documents" are shell-constructs, which are similar to files but have fixed contents, so that they are incorporated into scripts directly. Cosider the following line:

cat x > y

A file "x" is read by "cat" and its contents are dumped into file "y". What exactly ends up in "y" depends on what was in "x" in first place. But if you want "y" to have a fixed content you could create a here-document replacing the file:

cat <<EOF > y
foo
bar
foobar
EOF

This says: treat everything you read until a line that reads "EOF" as the content of a (virtual) file. We could have the three lines in file "x" and used the above command to the same effect.

So lets see the relevant part of alisters script:

while IFS= read -r b; do
     .....
done <<'END_OF_DAYS'
Genesis
Exodus
...
Zechariah
Malachi

Matthew
Mark
...
Jude
Revelation
END_OF_DAYS

The core part into which this here-document is fed is this:

while read b ; do
     .....
done

This takes one line at a time, fills it into a variable named "b" and does whatever is between "do" and "done". Because with every loop the variable content of "b" changes we can use it for out purposes. Let us say we want to surround the name with equal signs. We could do this:

while read b; do
     echo == $b ==
done <<'END_OF_DAYS'
Genesis
Exodus
...
Zechariah
Malachi

Matthew
Mark
...
Jude
Revelation
END_OF_DAYS

The command echo == $b == does nothing else than print "==", than the content of variable b (this is what "$b" stands for) and then "==" again. Now, alister does something more sophisticated with "$b", but basically this is it. Let us see what he does:

    [ -z "$b" ] && t=$nt && continue

He is abbreviating here, so it is not that obvious. Let us write it in the long form and it will become clearer:

    if [ -z "$b" ] ; then
                   t=$nt
                   continue
              fi

The -z "$b" means: if "$b" is empty. This is true exactly one time: when the loop reads in the empty line in the middle of the document. In this case the variable "t" is filled with "$nt" (the contents of variable "nt") and the enclosing while-loop is immediately started over again ("continue").

Now, take stock: what are the various variables filled with:

$ot=path to old testament books, probably "./01_Old Testament"
$nt=path to old testament books, probably "./01_New Testament"
$t=either $ot (at start) or $nt (after the blank line is processed)

OK, on we go. What else does the while-loop do:

    i=1
    while cat "$t/$b/$b$i.txt" 2>/dev/null; do
        i=$((i+1))
    done

First, a variable "i" is set to "1". Then, there is another loop:

    while cat "$t/$b/$b$i.txt" 2>/dev/null; do
        i=$((i+1))
    done

I have to explain something about while-loops here: the general form is

while <command> ; do
     ......
done

This loop will run "<command>" and if this returns 0 (=TRUE) it will run the body of the loop. The same was true when we used:

while read b ; do
    ....
done

"read" is a command and it returns TRUE when there is something to read and FALSE if not - this is why the loop stops at the end of the list we feed into.

Inside this loop there is nothing special done, except for incrementing "i" by 1. This is simply counting: 1, 2, 3, 4, 5, .... Every time the command

cat "$t/$b/$b$i.txt"

is issued. Replacing the various variables with their content (see the list above), this is:

cat "./01_Old Testament/Genesis/Genesis1.txt"

(after incrementing i by 1)

cat "./01_Old Testament/Genesis/Genesis2.txt"

etc. at some point, this will give us a filename which doesn't exist. If Genesis has 51 chapters (haven't bothered to look), this would be:

cat "./01_Old Testament/Genesis/Genesis52.txt"

This time, "cat" would return a non-zero return value, meaning "FALSE" and the loop would stop.

Now there is one last question left: without redirection "cat" will display the content of the file to the screen (try it!). We haven't used any redirection, so why does the output not land on the screen?

sh bible.sh ~/Desktop/Old\ Testament ~/Desktop/New\ Testament > bible.txt

This is why: it is not only possible to redirect individual commands but also whole scripts. Without the last "> bible.txt" the text would indeed land on the screen. You could replace the redirection with a pipeline:

sh bible.sh ~/Desktop/Old\ Testament ~/Desktop/New\ Testament | more

Will seend the output to "more", which will display it on screen, but pagewise (hit any key to display another page, CTRL-C to end) or

sh bible.sh ~/Desktop/Old\ Testament ~/Desktop/New\ Testament | grep someword

to filter only for lines with "someword" in them.

I hope this helps.

bakunin

/PS: i refrained from giving you any practical solution because i figured you are here to learn foremost and to solve your problem at hand second. I hope to have served your intentions best in enabling you to understand and write scripts yourself instead of just throwing something miraculously working at your feet.

Once you overcome your reservations i am sure you will find neverending joy in programming the shell. Don't be shy, there may be only a few chosen, but an awful lot are invited.

sudon_t · October 30, 2012, 2:07pm

Boy, I've avoided scripting because it's a whole other language to learn. I barely have a grasp on the command line. But you guys have tossed me into the deep end - I may as well learn to swim.
Two things occur to me:
1.) Possibly, I have malformed, (if that's the word), text files. It is likely they were created on a Windows machine, so I will check that out, and convert them if necessary. Perhaps that is the problem? Then, I think I have a way to test whether each file ends in a newline.
2.) Forgive my ignorance, but are the variables defined correctly? How does the script know, for instance, that "ot" equals "./01_Old Testament"?

bakunin · October 30, 2012, 4:39pm

Scripting is nothing else than command line. You could every program written here cut-&-paste to the command line and it would work. On the other hand you could paste any command line content to a file, make it executable and you have a script. So, again: you are already scripting, like it or not.

Yes, this is always a problem. When you edit files intended to be used on a Unix system better stay away from Windows editors, "notepad" foremost. Use an editor under Unix instead, it has an awful abundance of them.

This is actually a good question, because i left it out in my explanation of the script. Let me correct that error now. When we have a look at the start of the script we see:

ot=$1
nt=$2

t=$ot

The first two lines use special variables which are filled by the shell automatically. When you provide command line arguments to a script the shell will use the variables "1", "2", etc., and fill these with the first, second, third, ... argument. Example:

./script one two three

now inside "script" "$1" would be "one", "2" would be "two" and "$3" would be "three". Have a look how alister has suggested to start his script and you will know how these two variables were filled.

A proper description of how alisters script is to be called would be:

./bible.sh /path/to/old/testament /path/to/new/testament > bible.txt

The third line fills variable "t" with the first of these two values (you remember, it will be filled with the other upon encountering the empty line).

sudon_t · October 31, 2012, 12:41pm

I think I get it now. The first argument, (the path to the target dir), automatically becomes the first variable, and so on?

Here is the difference - to me - between scripting and the command line. With the CL, you have an application, maybe a flag or two, an argument, a target file. Very simple, and you can pipe that output to another app, etc...
With a script, there are all these strange symbols whose meanings I don't understand, and formatting whose purpose I don't understand. Why is there sometimes a bracket sitting on a line by itself? Why are some lines indented, and some not? I have no clue. Scripting is an area where I'll really have to start at the beginning. Not that I don't want to learn - I just haven't, yet. ; )

Here is where I'm at now. Turns out they were dos files. I checked this by opening a couple in vim. So I used find to pipe into dos2unix:

find ./01_Old\ Testament -name "*txt" -type f |dos2unix
find ./02_New\ Testament -name "*txt" -type f |dos2unix

I again used vi to check a couple of the files, and it no longer says [dos] on the bottom.
Then I wanted to see if all had an EOF newline. I'm not sure how to do this, but I opened a couple files in TextWrangler, showing invisibles, and they do seem to have a last newline. Anyway, the last lines of the files I looked at have that 'capital L laying on it's side' symbol, and the cursor will travel one line below that last line. How's that for scientific?
The way I invoked Alister's script is a little different than you show because I packed it away where I thought it should go:

which bible.sh
/usr/local/bin/bible.sh

But I think it's correct:

sh bible.sh ~/Sandbox/01_Old\ Testament ~/Sandbox/02_New\ Testament >Bible.txt

I even used the full paths, but I still end up with an empty file:

ls -l
total 120
drwxr-xr-x@ 41 rick  staff   1394 Oct 28 13:48 01_Old Testament
drwxr-xr-x@ 26 rick  staff    884 Oct 29 16:43 02_New Testament
-rw-r--r--   1 rick  staff      0 Oct 30 18:06 Bible.txt
-rw-r--r--   1 rick  staff  13054 Oct 29 16:55 NTfilelist.txt
-rw-r--r--   1 rick  staff  41377 Oct 29 16:47 OTfilelist.txt

It shouldn't be a problem with the eof newline - we would simply end up some lines stuck together. Did the fact that I changed directories affect the script? It doesn't seem like that should be the case. If I understand correctly, the script expects that input from the CL. I'm not sure how to proceed.

---------- Post updated 10-31-12 at 12:02 AM ---------- Previous update was 10-30-12 at 06:38 PM ----------

I also thought that, since we have proper list files, why not go back and try earlier solutions?

Elixir Sinari's solution:

rm resultfile ; while read num file ; do cat "$file" >> resultfile ; done <OTfilelist.txt
cat: : No such file or directory
cat: : No such file or directory
cat: : No such file or directory
. . .

And so on, all the way through.

And Bakunin's again:

rm resultfile ; while read num file ; do cat $file >> resultfile ; done <OTfilelist.txt

What happened here was that the first time, I had a blinking cursor as if it was working. Finally! - I thought. Then I noticed I wasn't using any processor. I looked, but cat did not seem to be a running process, but I let it go for about eight minutes, then killed it. I ended up with a 4 k empty file named 'resultfile'.
I ran it again, but it stopped itself after a couple moments, leaving me again with a 4 k file.

It occurred to me - why not just pipe find's output directly to cat? It worked with dos2unix, (and other programs), so why not?

find ./01_Old\ Testament -name "*txt" -type f |cat > OT.txt

But it only printed another file list named OT.txt

I don't know if this is useful, but here is what's in the directory, and where it's at. The two directories containing the files, and the two perfectly ordered and escaped file lists. I tossed the worthless files that were created by various attempts.

pwd
/Users/rick/Sandbox
ls -l
total 120
drwxr-xr-x@ 41 rick  staff   1394 Oct 28 13:48 01_Old Testament
drwxr-xr-x@ 26 rick  staff    884 Oct 29 16:43 02_New Testament
-rw-r--r--   1 rick  staff  13054 Oct 29 16:55 NTfilelist.txt
-rw-r--r--   1 rick  staff  42672 Oct 30 23:06 OTfilelist.txt

And here's a sample of what the filelist - OTfilelist.txt - looks like, in case there is a problem with it:

./01_Old\ Testament/21_Ecclesiastes/Ecclesiastes11.txt
./01_Old\ Testament/21_Ecclesiastes/Ecclesiastes12.txt
./01_Old\ Testament/22_Song\ of\ Solomon/Song\ of\ Solomon01.txt
./01_Old\ Testament/22_Song\ of\ Solomon/Song\ of\ Solomon02.txt

---------- Post updated at 11:43 AM ---------- Previous update was at 12:02 AM ----------

After much wailing and gnashing of teeth, I was struck with the inspiration that a big problem was with the file names, namely spaces. I replaced all spaces with connecting underlines:

find ./01_Old_Testament -name "*txt" -type f |rename 's/ /_/g'

Then I tossed in xargs:

find ./01_Old_Testament -name "*txt" -type f |xargs cat > OT.txt

Success! So we do the same with the New Testament directory, then concatenate those two files:

cat OT.txt NT.txt > TheeBible.txt

If it warms up a bit, I will make a burnt offering on the grill.
I thank everyone for their help and guidance and hope I didn't completely wear out my welcome. I really learned a lot!

---------- Post updated at 12:41 PM ---------- Previous update was at 11:43 AM ----------

When you added double quotes to $file, was that an attempt to deal with the spaces in the file names?

bakunin · October 31, 2012, 4:05pm

Correct.

Again: all these devices in scripts work (at least in principle) on the command line too and everything you write on teh commandline could come from a script. In fact every shell is a "command language" which you either type in by hand or have typed in from a script file. A script file is basically just a file with stored commands you don't have to type again should you need them a second time.

You mean "{"? It is for command grouping. Basically everything between "{ ... }" works like a single command from the outside. It is like you could write a script, give it a name and then use this script in another script like any other command. This works the same, just inside scripts.

Simple: to be easier readable (for humans). If you have a loop (for ... done or while ... done) or a branch (if ... else ... fi or case ... esac) you indent the body of this to see easily what this body of the construct is. For the shell executing the command it makes no difference at all. You can write:

 while <condition> ; do
     command1
     command2
     command3
done

and the definition of the loop as well as the body immediately will stand out. This:

 while<condition> ; do command1; command2; command3; done

is syntactically the same but is a lot harder to find out what is loop definition and what is its body.

Very well. There is a "-exec" flag for find , which you could use. It takes a "command template" where the filename "find" has found is represented by "{}". Suppose you have a command

cmd /path/to/myfile.txt

and you want to execute this command with every txt-file in all subdirectories of "/there/too". You would write:

find /there/too -type f -name "*txt" -exec cmd {} \;

"find" will find all the filenames and for each filename found that way execute the command up to "\;" (this signifies the end of the template command) with "{}" replaced by the actual filename. You might find this device extremely useful and - i can promise - once you got the hang of it you can easily outperform every filemanager there might be. Unix aficionados are not command line fetishists because they are masochists, but because they can type in seconds what you can do with a mouse in hours.

We have a several "vi" resources here, threads where people tried to explain its usage in general or specific aspects of it. "vi" is most times a "love on third sight". First, you think it is overly complicated, the first half year you use it you think its giving you the third degree and after this time you start missing its features in every other program you type more than two keystrokes in.

Here is how to make the unvisible characters visible in "vi":

open the file. Hit "<ESC>" (maybe repeatedly, it can't hurt) to make sure you are in command mode. Press ":". A line with ":" at the beginning will appear in the last line of the screen. Enter "set list" and press <ENTER>. The line will disappear and you will see the non-printing characters now. Press ":" again and enter "set nolist" to switch that mode off or simply leave the editor. These characters will appear and have the following meaning:

^I     tab character
$      line end, Unix style
^M$    line end, DOS/Windows style

Notice that "^I" and "^M" are ONE character, as you can see when you pass over them with the cursor.

To remove the DOS line ends, you can do the following magic:

Again, from the command mode (<ESC>, you remember, will always take you there) press ":" and type:

1,$ s/^M$//

Enter the "^M" by pressing "<CTRL>-<V>" (take the next input character verbosely) and then either "<CTRL>-<M>" or "<ENTER>". You will notice that the "^M" appears.

You can do this with "list" mode set and you will see the "^M"s at the line ends (ALL line ends!) disappear. The command says: from the first to the last line ("1,$") substitute ("s") a ^M character, followed by a line end ("/^M$/") with nothing ("//").

Very good. in Unix every directory has its rationale and purpose and this is even (informally) standardized (we use the word "canonical" for this type of informal standard). Indeed "/usr/local/bin" is the directory where executable files which do not belong to the OS go (OS executables go to "/usr/bin"). "/usr", btw., is for "Unix Software Resources", even if it is usually pronounced like "user".

LOL! Well, i think christians have burned already enough things in history, so it might even work if it doesn't warm up enough. Bishop Theophilus of Alexandria is quoted to have said, while burning down the library of Alexandria, that the books in there either agree with the bible and are superfluous or disagree with the bible and are rightfully burned. I am not divinely inspired to the same degree as this venerable bishop, but i can suggest a book about (Korn) shell and shell programming. It helps best, btw., if being read instead of being burned:

Barry Rosenberg, Korn Shell Programming Tutorial

You will find the read informative as well as very entertaining.

Exactly. This is the common method of dealing with blanks, because blanks are the shells default way of separating things. If one doesn't the blank (or any other) character to do that you use quoting.

I hope this helps.

bakunin