Dear all,
I have huge txt file with the input files for some setup_code. However for running my setup_code, I require txt files with maximum of 1000 input files
Please help me in suggesting way to break down this big txt file to small txt file of 1000 entries only.
If each entry is a line and you want 10,000 lines try using the
head and tail commands to chop up the file in 10,000 line chunks.
You can probably put it in a for loop with the wc command to see
how many lines are in the file, hence the maximum number to extract.
head -10000 file_name.txt > file1.txt
head -20000 file_name.txt | tail -10000 > file2.txt
head -30000 file_name.txt | tail -20000 > file2.txt
head -40000 file_name.txt | tail -30000 > file2.txt
Well, gandolf989, that appears to be a poor suggestion given that split will do this all in a single operation. You are wasting IO and CPU resources so it will be quite slow. Also, how would you know when to stop writing code or running your loop? There is no point reinventing a process that works very well already and lets you customise the beginning and end of the output files if you need to.
If the splitting up dependant on any condition other than record count, then csplit may be the tool for you, however without some sample data and rules to follow it's impossible to really know what you need.
Please wrap all code, file, input & output/errors in CODE tags as it makes them far easier to read and preserves multiple spaces and long lines in case these are important.
There seems to be a magic awk command for almost every problem.
---------- Post updated at 12:46 PM ---------- Previous update was at 12:44 PM ----------
csplit might be a far better tool, I just haven't used it. There is certainly a size file where head/tail just won't work, but for many files it might work well enough.
The awk variable FILENAME is provided by awk and contains the name of the input file that is currently being processed. Redefining it is not a good idea. Try something like this instead:
Note, however, that both your script and the above script consume a file descriptor for each output file created and don't free any file descriptors until awk exits. If you need to create several files, you may have to close files when you're done writing to them to avoid a "too many open files" error. Even if you don't "have to", it is usually a good habit to close files you no longer need open. And, if you have a lot of files with numbers in them that might be more than one digit, you may want to add some leading zeros so the files will appear in numeric order when output by ls ...
And, just out of curiosity, why does your script bother defining:
PATHNAME=$1
CONSTANT=rfio:
GREP=$2
OUTPUT=$3
when none of them are ever referenced in your script?
Note that I also changed the print >> outfile to print > outfile . If you ever need to update the split files due to an update in a base file, you will want to overwrite the old files instead of append to the en of them. (Note, however, that this won't remove any trailing files that may no longer be needed if your updated base file is smaller than it was before.) If that is a concern, you could add a line to your script before invoking awk :
# Remove any earlier versions of the split output files.
rm -f ${3}[0-9][0-9][0-9].txt
In awk , FILENAME is only defined after the first file has been opened, which is after the BEGIN section has been finished. Within the BEGIN section FILENAME is empty.