Merge files in a directory

Hey Guys,

I want to merge all files (Apache Tomcat Access Logs) for a particular date say "Aug 24" to be merged into a single file.

Is there any quick hack for that ?

[tomcat@localhost logs]$ ls -alrth access_log2016-08-*|grep "Aug 24"
-rw-rw-r--. 1 tomcat        tomcat          16M Aug 24 00:00 access_log2016-08-23.23.log
-rw-rw-r--. 1 tomcat        tomcat         6.8M Aug 24 01:00 access_log2016-08-24.00.log
-rw-rw-r--. 1 tomcat        tomcat         4.7M Aug 24 02:00 access_log2016-08-24.01.log
-rw-rw-r--. 1 tomcat        tomcat          14M Aug 24 03:00 access_log2016-08-24.02.log
-rw-rw-r--. 1 tomcat        tomcat          18M Aug 24 04:00 access_log2016-08-24.03.log
-rw-rw-r--. 1 tomcat        tomcat          15M Aug 24 05:00 access_log2016-08-24.04.log
-rw-rw-r--. 1 tomcat        tomcat         5.6M Aug 24 06:00 access_log2016-08-24.05.log
-rw-rw-r--. 1 tomcat        tomcat         8.9M Aug 24 07:00 access_log2016-08-24.06.log
-rw-rw-r--. 1 tomcat        tomcat          19M Aug 24 08:00 access_log2016-08-24.07.log
-rw-rw-r--. 1 tomcat        tomcat          32M Aug 24 09:00 access_log2016-08-24.08.log
-rw-rw-r--. 1 tomcat        tomcat          45M Aug 24 10:00 access_log2016-08-24.09.log
-rw-rw-r--. 1 tomcat        tomcat          44M Aug 24 11:00 access_log2016-08-24.10.log
-rw-rw-r--. 1 tomcat        tomcat          49M Aug 24 12:00 access_log2016-08-24.11.log
-rw-rw-r--. 1 tomcat        tomcat          51M Aug 24 13:00 access_log2016-08-24.12.log
-rw-rw-r--. 1 tomcat        tomcat          53M Aug 24 14:00 access_log2016-08-24.13.log
-rw-rw-r--. 1 tomcat        tomcat          52M Aug 24 15:00 access_log2016-08-24.14.log
-rw-rw-r--. 1 tomcat        tomcat          84M Aug 24 16:00 access_log2016-08-24.15.log
-rw-rw-r--. 1 tomcat        tomcat          57M Aug 24 17:00 access_log2016-08-24.16.log
-rw-rw-r--. 1 tomcat        tomcat          48M Aug 24 18:00 access_log2016-08-24.17.log
-rw-rw-r--. 1 tomcat        tomcat          37M Aug 24 19:00 access_log2016-08-24.18.log
-rw-rw-r--. 1 tomcat        tomcat          38M Aug 24 20:00 access_log2016-08-24.19.log
-rw-rw-r--. 1 tomcat        tomcat          40M Aug 24 21:00 access_log2016-08-24.20.log
-rw-rw-r--. 1 tomcat        tomcat          37M Aug 24 22:00 access_log2016-08-24.21.log
-rw-rw-r--. 1 tomcat        tomcat          26M Aug 24 23:00 access_log2016-08-24.22.log
[tomcat@localhost logs]$

They should be merged into single file as per date time in ascending order so access_log2016-08-23.23.log then access_log2016-08-24.00.log so on and so forth.

We can do manually like cat individual file but want to explore more smarter way. I heard from my colleague a command called plus to concatenate into single file ,but not 100% sure..

Please assist.

What's wrong with

cat access_log2016-08-24???.log access_log2016-08-24.log

?

EDIT: itkamaraj is right - use redirection as shown in his/her post

cat access_log2016-08-24???.log > access_log2016-08-24.log

Can we automate ? I don't believe I can. If you can then I would love to learn.

2 Likes

What exactly do you mean?

for i in access*; do filename=$(echo $i | sed "s/.*\([0-9]\{4\}-[0-9]\{2\}-[0-9]\{2\}\).*/\1/"); cat $i >> $filename; done

if you run the same command multiple times, it will overwrite the existing file.

The loop suggested seems odd to me. The sed is difficult to read and is a large process to start, so for multiple files will be slow. It also assumes that the target output files do not exist.

Would you be better with:-

filenames="$(for i in access*
do
  echo ${i%%.*}                     # Variable substitution removes everything after the first dot
done | sort -u)"                    # Building a list of required output files

for filename in $filenames
do
   cat ${filename}.* > $filename   # Write all matching files in one step
done

I know it is a two step process, but I've kept IO to a minimum and used internal functions rather than calling external processes.

Because of the trailing .* on the cat, this will not match the output file as an input file if it already exists. If you run this and then tidy away the input files, a re-run will match the output files as potential input, but then the cat will complain about missing input.

I hope that this helps,
Robin

1 Like

Hi bluemind2005,
I note that in post #1 in this thread you say you want files from the last hour of one day and the first 23 hours of the next day. I also note that all of the suggestion solutions so far would process 24 hours from a single day (i.e., access_log2016-08-24.00.log through access_log2016-08-24.23.log ) instead of what you requested. (And, I must say that the suggested solutions make a lot more sense to me.)

  1. Do you want the files 00 through 23 for a single day (as in the suggested solutions)? Or, do you want file 23 from one day and files 00 through 22 from the target day (as in your sample in post #1)?
  2. Do you just want to create a combined log file for one day? Or, do you want to create combined log files for all complete days in a directory that do not have combined log files?
  3. Do you want to remove the day's hourly logs after you successfully create the complete daily log file for that day? And, if you do, do you want to create a combined log file (without removing hourly logs) for the current day when there isn't an hourly log for hour 23 yet?
  4. What shell (including version) and operating system are you using?

In bash-4 an associative array can eliminate the duplicates

declare -A files
for i in access*
do
  files[${i%%.*}]=
done

for filename in ${!files[@]}
do
  cat "$filename".* > "$filename" 
done