Breaking large file into small files

Dear all,
I have huge txt file with the input files for some setup_code. However for running my setup_code, I require txt files with maximum of 1000 input files
Please help me in suggesting way to break down this big txt file to small txt file of 1000 entries only.

thanks and Greetings,
Emily

man split .
Did you consider the hints at the lower page border?

1 Like

If each entry is a line and you want 10,000 lines try using the
head and tail commands to chop up the file in 10,000 line chunks.
You can probably put it in a for loop with the wc command to see
how many lines are in the file, hence the maximum number to extract.

head -10000 file_name.txt               > file1.txt
head -20000 file_name.txt | tail -10000 > file2.txt
head -30000 file_name.txt | tail -20000 > file2.txt
head -40000 file_name.txt | tail -30000 > file2.txt
1 Like

Well, gandolf989, that appears to be a poor suggestion given that split will do this all in a single operation. You are wasting IO and CPU resources so it will be quite slow. Also, how would you know when to stop writing code or running your loop? There is no point reinventing a process that works very well already and lets you customise the beginning and end of the output files if you need to.

If the splitting up dependant on any condition other than record count, then csplit may be the tool for you, however without some sample data and rules to follow it's impossible to really know what you need.

Please wrap all code, file, input & output/errors in CODE tags as it makes them far easier to read and preserves multiple spaces and long lines in case these are important.

Thanks, in advance,
Robin

1 Like

Hello emily/gandolf989,

If we need to split a file according to lines let's say 10'000 lines per file then following may help.

awk '{FILENAME="file"int((NR-1)/10000);print >> FILENAME}' Input_file

Thanks,
R. Singh

1 Like

There seems to be a magic awk command for almost every problem.

---------- Post updated at 12:46 PM ---------- Previous update was at 12:44 PM ----------

csplit might be a far better tool, I just haven't used it. There is certainly a size file where head/tail just won't work, but for many files it might work well enough.

1 Like

thanks all for useful input, it work fine..:slight_smile:

Greetings,

---------- Post updated at 03:13 AM ---------- Previous update was at 02:45 AM ----------

Hello,
I am not able to provide external parameter here..which is $3 while getting the desired output files..:frowning: in this line

awk '{FILENAME="$3_"int((NR-1)/200)".txt";print >> FILENAME}' $3
#!/bin/bash                                                                                                                  
#usage ./copyTextFromCastor.sh $PATH $GREP $OUTPUTFILE                                                                       

PATHNAME=$1
CONSTANT=rfio:
GREP=$2
OUTPUT=$3

echo "Copying fileName \"$1 | grep $2\" to $3"
srmls "$PATHNAME" --count 99999 --offset 2 | grep "$2" | awk -F'tier2' '{print string path $2}' string="" path=""  > "$3"

echo "progressing ... please be patient..."

## split $3 into small size files, name InputFileN.txt                                                                       
awk '{FILENAME="$3_"int((NR-1)/200)".txt";print >> FILENAME}' $3

Hello emily,

Not sure about your complete requirement, could you please try following and let me know if this helps.

echo $3 | awk '{FILENAME=$3"_"int((NR-1)/200)".txt";print >> FILENAME}'

You can replace this command with the shown one.

Thanks,
R. Singh

1 Like

The awk variable FILENAME is provided by awk and contains the name of the input file that is currently being processed. Redefining it is not a good idea. Try something like this instead:

awk '{outfile=FILENAME int((NR-1)/200) ".txt";print >> outfile}' $3

Note, however, that both your script and the above script consume a file descriptor for each output file created and don't free any file descriptors until awk exits. If you need to create several files, you may have to close files when you're done writing to them to avoid a "too many open files" error. Even if you don't "have to", it is usually a good habit to close files you no longer need open. And, if you have a lot of files with numbers in them that might be more than one digit, you may want to add some leading zeros so the files will appear in numeric order when output by ls ...

awk '
BEGIN {	outfile = sprintf("%s%03d.txt", FILENAME, 0))
}
{	print > outfile
}
(NR % 200) == 0 {
	close(outfile)
	outfile = sprintf("%s%03d.txt", FILENAME, int(NR/200))
}' $3

And, just out of curiosity, why does your script bother defining:

PATHNAME=$1
CONSTANT=rfio:
GREP=$2
OUTPUT=$3

when none of them are ever referenced in your script?

Note that I also changed the print >> outfile to print > outfile . If you ever need to update the split files due to an update in a base file, you will want to overwrite the old files instead of append to the en of them. (Note, however, that this won't remove any trailing files that may no longer be needed if your updated base file is smaller than it was before.) If that is a concern, you could add a line to your script before invoking awk :

# Remove any earlier versions of the split output files.
rm -f ${3}[0-9][0-9][0-9].txt
1 Like

Hello Ravinder and Don,
Here is my modified script [1] and the output. Why I am getting filename like:

-rw-r--r-- 1 emily af-cms  54400  6. Mr 11:18 _0.txt

[1]

#!/bin/bash                                                                                                                  

OUTPUT=InputFile_
GREP=root
EOSPATH="srm://dcache-se-cms.desy.de:8443SingleMu
onGun/SingleMuMinus_Fall14_FlatPt-0to200_MCRUN2_72_V3_GEN_SIM_DIGI_RECO_L1/150127_084421/"
FILEPATH[1]=$EOSPATH/0001
FILEPATH[2]=$EOSPATH/0002
#FILEPATH[3]=$EOSPATH/0003                                                                                                   
#FILEPATH[4]=$EOSPATH/0004                                                                                                   

## copy the FileName from eos to $3                                                                                          
for FileNameIndx in "${FILEPATH[@]}"
  do
    if [[ ! -e "dest_path/$FileNameIndx" ]]; then
        echo "Copying fileName \"$FileNameIndx  | grep root\" to $OUTPUT"
        Index=$(echo $FileNameIndx | awk '{split($FileNameIndx, a, "000"); print "000"a[2]}')
        srmls $FileNameIndx --count 99999 --offset 2 | grep $GREP | awk -F'tier2' '{print string path $GREP}' string="" path\
=""  > $OUTPUT$Index
        FINALFILE=$OUTPUT$Index
        echo $FINALFILE
        echo "progressing ... please be patient..."

        awk '                                                                                                                
        BEGIN {outfile = sprintf("%s_%01d.txt", FILENAME, 0)                                                                 
}                                                                                                                            
{print > outfile                                                                                                             
}                                                                                                                            
(NR % 200) == 0 {                                                                                                            
close(outfile)                                                                                                               
outfile = sprintf("%s_%01d.txt", FILENAME, int(NR/200))                                                                      
}'  $FINALFILE

    fi
done


It is working, but giving the output like:

-rwxr-xr-x 1 emily af-cms   1820  6. Mr 11:18 copyTextFromCastor.sh
-rw-r--r-- 1 emily af-cms 271184  6. Mr 11:18 InputFile_0001
-rw-r--r-- 1 emily af-cms  54400  6. Mr 11:18 InputFile_0001_1.txt
-rw-r--r-- 1 emily af-cms  54400  6. Mr 11:18 InputFile_0001_2.txt
-rw-r--r-- 1 emily af-cms  54400  6. Mr 11:18 InputFile_0001_3.txt
-rw-r--r-- 1 emily af-cms  53584  6. Mr 11:18 InputFile_0001_4.txt
-rw-r--r-- 1 emily af-cms 271456  6. Mr 11:18 InputFile_0002
-rw-r--r-- 1 emily af-cms  54400  6. Mr 11:18 _0.txt
-rw-r--r-- 1 emily af-cms  54400  6. Mr 11:18 InputFile_0002_1.txt
-rw-r--r-- 1 emily af-cms  54400  6. Mr 11:18 InputFile_0002_2.txt
-rw-r--r-- 1 emily af-cms  54400  6. Mr 11:18 InputFile_0002_3.txt
-rw-r--r-- 1 emily af-cms  53856  6. Mr 11:18 InputFile_0002_4.txt

Sorry. My mistake. FILENAME isn't defined yet in the BEGIN clause...

Change:

        BEGIN {outfile = sprintf("%s_%01d.txt", FILENAME, 0)

to:

        NR==1 {outfile = sprintf("%s_%01d.txt", FILENAME, 0)
1 Like

In awk , FILENAME is only defined after the first file has been opened, which is after the BEGIN section has been finished. Within the BEGIN section FILENAME is empty.

1 Like

working fine..:slight_smile:

thanks everyone for your useful suggestions