Need help on merging

I have a total of 100 files (variable size of each file) with total size of 328950 bytes. I want to merge those 100 files into 4 files with each size be close to equal size i.e (328950/4 ~= 82238) but do not want to break any file. Any unix sheel script help will be really helpful.

Try:

size=$(cat file* | wc -l)
cat file* | split -l $((size/4)) -

That will create 4 files with roughly the same number of lines..

--
You can split into equal byte sizes using wc -c option and split -b but that may break multibyte characters..

Thankis for reply. I tried the same. But the problem is it breaks the files. for example if I have 100 files which I have spliited though my customer number. When I am merging those 100 files into 4 files, I don't want a customer number to be present in multiple file.for example :

in the file1 data is present as where 12345 is my customer number.

12|12345|09876
.
.
12|12345|78901

when merged:
it should not present in both merge file say xaa or xab. It can only be present in one file.

OK I see, that is what you mean with "break a file" ...

If there are not too many files and if your file names do not contain spaces you could try this crude approach, which may be good enough for your application:

awk '
  BEGIN {
    n=4
    m=1
  }

  FNR==1 {
    if(NR>1) {
      A[m]+=sz
      f="outfile" m
      printf "%s",s>f
      m=1
      for(i=2; i<=n; i++)
        if(A[m]>A)
          m=i
      s=x
      sz=0
    }
  }

  {
    s=s $0 ORS
    sz+=length
  }

  END {
    if(NR>1)printf "%s",s>f
  }
' $(ls -drS file*)

It reverse sorts the files on size first and then for each file tries to put it in the emptiest bucket.

Can you please explain the script you have posted so that I can materialize it for my logic.

Sure:

awk '
  BEGIN {
    n=4                      # set number of buckets
    m=1                      # initialize emptiest bucket
  }

  FNR==1 {                   # if a new file is starting to be read (FNR is line number per file)
    if(NR>1) {               # if it is not the very first file (NR is total line number for all files)
      A[m]+=sz               # add the size of the previous file to the emptiest bucket
      f="outfile" m          # specify the bucket output file name
      printf "%s",s>f        # print to last file from memory to emptiest bucket file
      m=1                    # set emptiest bucket to 1
      for(i=2; i<=n; i++)    # for every the other bucket
        if(A[m]>A)        # if the minimum bucket is fuller then that bucket
          m=i                # make that bucket the new minimum
      s=x                    # clear the file memory
      sz=0                   # clear its size
    }
  }

  {
    s=s $0 ORS               # Add next line of file to memory
    sz+=length               # Add the number of characters on that line to the size
  }

  END {
    if(NR>1)printf "%s",s>f  # print the last file in memory to the emptiest bucket file
  }
' $(ls -drS file*)           # read the files in reverse sorted size order

Hope this helps

1 Like

I am getting error like below:

awk: Cannot find or open file 16.
 The source line number is 4

I have taken a sample of 5 files with name like test1 to test5. test1 looks like below:

12|09876|12345 78907|111.46|A|1234567
12|09876|12345 12345|111.46|A|1234567

can you please let me know about the error and how to rectify it? the permission is also good.

Five files? Do you call it like this

...
' $(ls -drS test[1-5])

What does

ls -drS test[1-5]

produce?

I have call it '$(ls -drS test*). Though with modification i have able to pull it off. But a new problem has occured now. it showing the below error:

awk: Format item %s cannot be longer than 3,000 bytes.

I know awk cannot read more than 3000 byte. Can this be rectify some how?

Please get used to post the context of the error producing code.

Without seeing what script/code you ran we can only guess that there's a quote or a comma missing in a printf statement.

This is what I am using :

awk '
  BEGIN {
    n=4
    m=1
  }
  FNR==1 {
    if(NR>1) {
      A[m]+=sz
      f="outfile" m
      printf "%s",s>f
      m=1
      for(i=2; i<=n; i++)
        if(A[m]>A)
          m=i
      s=x
      sz=0
    }
  }
  {
    s=s $0 ORS 
    sz+=length
  }
  END {
    if(NR>1)printf "%s",s>f
  }
' $(ls -dr sample*)

Can any one help?

Scrutinizer's code (and your modification of it) are reading files into memory line by line but writing entire input files with one printf statement. For any input file larger than 3000 (or whatever your system's size limit is, if it has a limit) bytes, the writes will fail.

That can be fixed by copying a line at a time from an input file to an output file as you read each instead of copying entire input files in single printf calls.

But, to even out the four output file sizes (which you said was a primary goal), Scrutinizer's code depends on copying the largest remaining input file to an output file first. Your modification to Scrutinizer's code (sorting files by reverse alphabetical filename instead of sorting by decreasing file size) will not meet you goal of producing similar sized files unless all of your input files are about the same size. You said:

How did changing:

$(ls -drS file*)

to:

$(ls -dr file*)

enable you to "pull it off"? I don't see why this change would have any effect on getting the error you're seeing. Did I miss some other change you made to Scrutinizer's suggested code?

Changing Scrutinizer's code to copy a line at a time instead of a file at a time is pretty straightforward. But, doing that without understanding why you found sorting by name instead of sorting by size necessary would be a waste of time. What additional constraint is there that you haven't explained to us that requires a different order of input files in the merged output files?

I addition to what Don Cragun said, try this modification:

awk '
  BEGIN {
    n=4
    m=1
  }
  FNR==1 {
    if(NR>1) {
      A[m]+=sz
      f="outfile" m
      print s>f
      m=1
      for(i=2; i<=n; i++)
        if(A[m]>A)
          m=i
      s=x
      sz=0
    }
  }
  {
    s=s (FNR>1?ORS:x) $0 
    sz+=length
  }
  END {
    if(NR>1)print s>f
  }
' $(ls -drS file*)

I think the 3000 byte limit is specific for the printf statement in certain awk implementations (HPUX?)

1 Like

Thanks to both of you for doing it. It working now and extremely sorry for delaying the response as I was trying to do it from sql. I am using HP UX and may be prinf has 3000 byte limitation.

-S was not working in my case therefore I have change the code in different way. The updated is:

awk '
  BEGIN {
    n=4
    m=1
  }
  FNR==1 {
    if(NR>1) {
      A[m]+=sz
      f="outfile" m
      print s>f
      m=1
      for(i=2; i<=n; i++)
        if(A[m]>A)
          m=i
      s=x
      sz=0
    }
  }
  {
    s=s (FNR>1?ORS:x) $0 
    sz+=length
  }
  END {
    if(NR>1)print s>f
  }
' $(ls -1 file*| sort -r -n -k5) # it is used to sort the file with size.

Ah thanks, so print is working, but printf is not. Something to keep in mind..

Yes the -S, -A and -k options for the ls command only made it to the POSIX standards in issue 7 (2008.1). So OS that adhere to earlier POSIX standards may not understand these command flags..

The command you provide will not work correctly. Note that it uses the -1 option (one name per line) and not the -l (long format) option. Therefore there is no column 5 and the sort will not work..

Try this alternative instead:

[..]
' $(ls -nd * | sort -rn -k5 | awk '{print $NF}') # it is used to sort the file with size.

Note that this also only works with files without spaces...

Also note, that I used the -n option here instead of the -l option, sometimes ls gets in trouble when user/group names are too long and then that can make mess up the ls output columns. This can be avoided by printing UID and GID, rather than the names..

Will keep that in mind. Now if I have total number of file less than 4 in that case it will not go for 4 bucket. Can this be achived?

Means : If I have total of 3 files with same size and it has produced 3 outfiles. Can a 4th outfile be achived with 0 byte?

The simple thing would be to add the command:

tee outfile1 outfile2 outfile3 outfile4 < /dev/null

to your script before the awk command to be sure that all of your output files exist and that those that the awk script doesn't write to will be of size zero.

Have you tried Scrutinizer's updated script with the final line being:

' $(ls -nd file* | sort -rn -k5 | awk '{print $NF}') # list files in decreasing file size order

with any files larger than 3000 bytes yet? I know that there are some versions of awk where the maximum string size that can be processed is LINE_MAX (probably 2048 on HP-UX) bytes or a little more (maybe 3000 in your case).

In the future, it would help all of us if instead of saying things like "-S was not working" you would say something more like "When I tried using the -S option, I got the following diagnostic:" and then show us the actual diagnostic that was produced (in CODE tags).

Thanks a lot. Everything working as expected.