I have few very huge files ~ 2 Billion rows of 130 column(CDR data) in a folder, I have written shell script need to read on each file in a folder and will create a new files based on some logic.
But problem is it's taking time to create a new file due to the size , So i dont want to corrupt the data while one process reading the file when i have concurrent run. So required some mechanism to intimate the process that the files is already in process i.e take the next file to process.
I was thinking on flock but not sure will it help me or not !
Can some one suggest me an idea to implement this process?
The shell script will read the files and create the new files with required column , Example if i have file1.csv contain 120 column my requirement is to create new file only with 30 column.
For the above process i am creating temp files to re-structure the files.
In this case if some one tries to run the shell for the same file it will corrupt the data.
So need to implement file level lock or flag mechanism.
You need to make up your mind what you want to achieve and what restrictions you have to obey.
Can you suppress parallel execution of your script by using e.g. a file in the /var/run directory (which would be easiest)?
Can you (temporarily) move the files acted upon to a different directory?
Can you log the files acted upon and have your script respect this info?
One technique for parallelizing processing on large files is to use GNU parallel. If you are interested in re-arranging the order of fields in a file, say with cut or awk, then parallel can create a sequence of processes to do this in chunks.
Here's an example from the man page:
EXAMPLE: Processing a big file using more cores
To process a big file or some output you can use --pipe to split up the
data into blocks and pipe the blocks into the processing program.
If the program is gzip -9 you can do:
cat bigfile | parallel --pipe --recend '' -k gzip -9 >bigfile.gz
This will split bigfile into blocks of 1 MB and pass that to gzip -9 in
parallel. One gzip will be run per CPU core. The output of gzip -9 will
be kept in order and saved to bigfile.gz
gzip works fine if the output is appended, but some processing does not
work like that - for example sorting. For this GNU parallel can put the
output of each command into a file. This will sort a big file in
parallel:
cat bigfile | parallel --pipe --files sort | parallel -Xj1 sort -m {}
';' rm {} >bigfile.sort
Here bigfile is split into blocks of around 1MB, each block ending in
'\n' (which is the default for --recend). Each block is passed to sort
and the output from sort is saved into files. These files are passed to
the second parallel that runs sort -m on the files before it removes
the files. The output is saved to bigfile.sort.
Note that it covers two cases, the second being where the process cannot do simple appending.