Merge two files with different lengths

ClaraW · April 19, 2013, 9:19pm

Hi there,

I have two very long files like:

file1: two fields

file2: 6 fields

1 123 0 1 0 0
1 126 2 1 0 0
2 123 0 1 0 1
2 138 1 1 1 1
2 300 0 1 2 3
2 311 2 4 6 0
3 120 3 4 1 0
3 215 1 1 2 1
3 216 0 2 1 5
3 345 8 0 1 0
3 357 0 1 1 1
3 500 2 1 0 1
4 17  6 1 0 2
4 70  0 1 0 1
...

The numbers of lines in file1 and file2 are not equal.

I want to get an output file like

file3: 6 fields, and the first two fields are exactly the same as the first two fields in file1. For example, the line with the first two field "1 123" has a match in file2: "1 123 0 1 0 0", then print the whole line in file2:"1 123 0 1 0 0" to file3. If one line in file1 does not have a match in file2, e.g. "1 125", then print "1 125 0 0 0 0" to file3.

1 123 0 1 0 0
1 125 0 0 0 0 
1 234 0 0 0 0
2 123 0 1 0 1
2 234 0 0 0 0
2 300 0 1 2 3
2 312 0 0 0 0
3 10  0 0 0 0
3 215 1 1 2 1
4 56  0 0 0 0
...

I am wondering if this can be done using awk or join or any other in linux? Since the files are very large, I really want it to be fast. Thanks a lot~~~

Note: Field 2 in both file1 and file2 has only number values, but field 1 in both files may have characters too. The two fields are sorted. And in both files, this kind of situation will not happen, no duplicates.

1 123
1 123
...

Also we do not need to consider the lines in file2 which do not have any match in file1, for example "1 126 2 1 0 0" (no match in file1), then this line should not be added to file3.

Scrutinizer · April 19, 2013, 9:44pm

Try something like:

awk '{p=$0; getline<f} $1 FS $2 != p { $0 = p " 0 0 0 0" }1' f=file2 file1 > file3

RudiC · April 20, 2013, 5:20am

Where's your file2's lines 1 126 and 2 138 in your output file? Does your specification imply lines in file2 that don't exist in file1 are to be suppressed? What happens to two or more consecutive lines in file1 missing in file2?

While scrutinizer's proposal may be very fast, it's difficult to get the two files back in sync if more than one line is missing in either.

After adding line 1 127 and 1 128 to file1, try this one:

sort file1 file2 |
awk     'tmp == $1 FS $2        {tmp = ""}
         tmp && 
           tmp != $1 FS $2      {print tmp, "0 0 0 0"}
         NF == 2                {tmp = $0; next} 
         NF == 6                {print; tmp = ""}
         END                    {if (tmp) print tmp, "0 0 0 0"}
        '
1 123 0 1 0 0
1 125 0 0 0 0
1 126 2 1 0 0
1 127 0 0 0 0
1 128 0 0 0 0
1 234 2 3 0 0
2 123 0 1 0 1
2 138 1 1 1 1
2 234 0 0 0 0

Scrutinizer · April 20, 2013, 5:52am

Thanks, RudiC, I may have misread the requirement. If so, this adaptation might fix this:

awk '{p=$1 FS $2; while( getline<f && $1 FS $2 < p) } $1 FS $2 != p { $0 = p " 0 0 0 0" }1' f=file2 file1

or

awk '{p=$1; q=$2; while(getline<f && ( $1<p || $2<q ) )} $1!=p || $2!=q { $0 = p FS q FS "0 0 0 0" }1' f=file2 file1

ClaraW · April 22, 2013, 3:05pm

Thank you so much, Scrutinizer and RudiC! And yes RudiC, such lines are to be suppressed. I only want to get the information for all lines in file1, if the lines exits in file2, then add the other fields 3, 4, 5 and 6 in file2 to file1, if one line in file1 is not in file2, then add "0 0 0 0" for fields 3, 4, 5 and 6. But if one line in field 2 does not exit in file1, then just ignore the line.

Don_Cragun · April 22, 2013, 8:56pm

Here are two awk scripts that I think do what you need. Use the 1st script if the field separator in either file is a mixture of spaces and tabs. Use the 2nd script if the field separator in both files is always a single space. In both cases, if you're using a Solaris/SunOS system, use /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk instead of awk :

echo 1st form:
awk '
function gf1() {
        if((getline file1line < file1) != 1) exit(0)
        split(file1line, field, /[ \t]+/)
        key1 = field[1] FS field[2]
        return(1)
}
{       while(key1 < $1 FS $2) {
                if(key1) print key1 " 0 0 0 0"
                gf1()
        }
}
key1 == $1 FS $2 {
        print key1, $3, $4, $5, $6
        gf1()
}
END {   while(1) {
                print key1 " 0 0 0 0"
                gf1()
        }
}' file1=file1 file2 > file3

echo 2nd form:

awk '
function gf1() {
        if((getline key1 < file1) != 1) exit(0)
        return(1)
}
{       while(key1 < $1 FS $2) {
                if(key1) print key1 " 0 0 0 0"
                gf1()
        }
}
key1 == $1 FS $2 {
        print
        gf1()
}
END {   while(1) {
                print key1 " 0 0 0 0"
                gf1()
        }
}' file1=file1 file2 > file4

Note that the 2nd form sends output to file4 instead of file3 (so you can run both forms and compare the output).

ClaraW · April 26, 2013, 7:40pm

Hi all,

I think the problem is maybe I did not state my question clearly. I've changed the statement and made it more easy to understand.

Thanks,

Lu

hanson44 · April 26, 2013, 9:42pm

$ cat file1
1 123
1 125
1 234
2 123
2 234
2 300
2 312
3 10
3 215
4 56

$ cat file2
1 123 0 1 0 0
1 126 2 1 0 0
2 123 0 1 0 1
2 138 1 1 1 1
2 300 0 1 2 3
2 311 2 4 6 0
3 120 3 4 1 0
3 215 1 1 2 1
3 216 0 2 1 5
3 345 8 0 1 0
3 357 0 1 1 1
3 500 2 1 0 1
4 17  6 1 0 2
4 70  0 1 0 1

$ cat test.awk
NF == 2 {
  if (looking == 1) { print a, b, "0 0 0 0" }
  else { looking = 1 }
  a = $1; b = $2
  }
NF != 2 && looking == 1 {
  if (a == $1 && b == $2) { print }
  else { print a, b, "0 0 0 0" }
  looking = 0
  }

$ sort file1 file2 | awk -f test.awk
1 123 0 1 0 0
1 125 0 0 0 0
1 234 0 0 0 0
2 123 0 1 0 1
2 234 0 0 0 0
2 300 0 1 2 3
2 312 0 0 0 0
3 10 0 0 0 0
3 215 1 1 2 1
4 56 0 0 0 0

alister · April 26, 2013, 10:04pm

Combining both files into a single stream is a very clever approach, hanson44. Great idea. However, the implementation needs a bit more work. If the last line of the sorted stream belongs to file1, its "a b 0 0 0 0" line is not generated.

Regards,
Alister

hanson44 · April 26, 2013, 10:13pm

Ah, the special case not considered. Try again...

$ cat file1
1 123
1 125
1 234
2 123
2 234
2 300
2 312
3 10
3 215
4 56
4 80

$ cat file2
1 123 0 1 0 0
1 126 2 1 0 0
2 123 0 1 0 1
2 138 1 1 1 1
2 300 0 1 2 3
2 311 2 4 6 0
3 120 3 4 1 0
3 215 1 1 2 1
3 216 0 2 1 5
3 345 8 0 1 0
3 357 0 1 1 1
3 500 2 1 0 1
4 17  6 1 0 2
4 70  0 1 0 1

$ cat test.awk
NF == 2 {
  if (looking == 1) { print a, b, "0 0 0 0" }
  else { looking = 1 }
  a = $1; b = $2
  }
NF != 2 && looking == 1 {
  if (a == $1 && b == $2) { print }
  else { print a, b, "0 0 0 0" }
  looking = 0
  }
END {
  if (looking == 1) { print a, b, "0 0 0 0" }
  }

$ sort file1 file2 | awk -f test.awk
1 123 0 1 0 0
1 125 0 0 0 0
1 234 0 0 0 0
2 123 0 1 0 1
2 234 0 0 0 0
2 300 0 1 2 3
2 312 0 0 0 0
3 10 0 0 0 0
3 215 1 1 2 1
4 56 0 0 0 0
4 80 0 0 0 0

alister · April 26, 2013, 10:38pm

Assuming that a - (hyphen) does not occur in either of the first two fields:

awk '{print $1"-"$2 "\t" $0}' file1 | sort -k1,1 > file1.tmp
awk '{print $1"-"$2 "\t" $0}' file2 | sort -k1,1 > file2.tmp
join -e 0 -a 1 -o 1.2,1.3,2.4,2.5,2.6,2.7 file1.tmp file2.tmp > file3

Note: join without the -t option requires sort to use -b. However, since file?.tmp files can never have whitespace before the first field (because awk's default field splitting precludes whitespace in the field variables, $1 and $2, which constitute the new first field), I am able to dispense with sort's -b.

Regards,
Alister

Don_Cragun · April 26, 2013, 11:06pm

Nothing has been said about the sizes of these files. If they are "large", the sort could be relatively expensive. Here is a slightly modified version of one of the awk scripts I proposed earlier (correcting a bug that duplicated the final line of output in some cases) with additional comments added. It avoids the need to sort since the OP has stated that both input files are sorted and produces the desired output except for three lines. The following script doesn't add a trailing space to the output line:

1 125 0 0 0 0

and it produces the output lines:

3 10 0 0 0 0
    and
4 56 0 0 0 0

instead of the output lines:

3 10  0 0 0 0
    and
4 56  0 0 0 0

(note the extra spaces before the "0 0 0 0").

The awk script is:

awk '
# gf1() -  Get a line from file1.
# Description:
#       Set key1 to the 1st two fields from that line separated by a single
#       space.  It is assumed that the input fields are separated by a
#       combination of one or more spaces and tabs.
# Exit Code:
#       0 EOF or error reading from file1.
# Return Value:
#       1 Successful completion.
function gf1() {
        if((getline file1line < file1) != 1) exit(0)
        split(file1line, field, /[ \t]+/)
        key1 = field[1] FS field[2]
        return(1)
}
{       while(key1 < $1 FS $2) {
                # We are here because either this is the 1st line from file2
                # and we have not read a line from file1 yet or the line from
                # file1 does not have a match in file2 and we are looking at a
                # key from file2 that is greater than the key from the current
                # line in file1...
                if(key1) {
                        # Create a enw line for an unmatched key from file1.
                        print key1 " 0 0 0 0"
                        key1 = ""
                }
                gf1()   # Get another line from file1.
        }
}
key1 == $1 FS $2 {
        # We have a matching key, print the line from file2 and get a new line
        # from both files.
        print key1, $3, $4, $5, $6
        gf1()
}
END {   while(key1) {
                # Create new lines for any remaining unmatched keys from file1.
                print key1 " 0 0 0 0"
                gf1()   # We will exit this loop when we hit EOF on file1.
        }
}' file1=file1 file2 > file3

As always, if you're using a Solaris/SunOS system, use /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk instead of awk .

Hope this helps,
Don