Deduping file

clueless181 · April 13, 2008, 2:03am

I could really use some help!

I need to dedup a file (let's call it - Test_File.dat) by the first field and then concatenate one value within it....I guess you could say I need to normalize the file.

Test_File.day looks something like:
aaa|bbb|ccc|123|ddd
abc|def|ghi|123|jkm
aaa|bbb|ccc|456|ddd
abc|def|ghi|456|jkm
aaa|bbb|ccc|789|ddd
abc|def|ghi|789|jkm

I need it to be:
aaa|bbb|ccc|123;456;789|ddd
abc|def|ghi|123;456;789|jkm
I've been looking into commands like "awk" and "sort", but I'm above my pay grade.

Thanks,
-Clueless

era · April 13, 2008, 2:13am

Do you need to preserve sort order, or could you start by sorting the file? That would make it a little bit less challenging, and reduce memory needs. The other way is also doable, but not quite as elegant.

sort file |
awk 'BEGIN { FS = OFS = "|" }
{ key=$1 OFS $2 OFS $3 OFS $5;
  if (key == prev) { four=four (four ? ";" : "") $4 }
  else { if (four) print prev, four; four = $4 }
  prev = key }
END { if (four) print prev, four }'

I cheated and rearranged the field order for simplicity. Maybe you can figure out how to move the fourth field back to the fourth position.

Tested on Ubuntu mawk; if you have another awk, minor changes may be required.

matrixmadhan · April 13, 2008, 4:01am

I don't quite understand this ...

here is something in perl

#! /opt/third-party/bin/perl

open(FILE, "<", "sample.txt") or die "Unable to open file <$!>\n";

while(<FILE>) {
  chomp;
  my @arr = split(/\|/);
  my $tmp = "$arr[0]#$arr[1]#$arr[2]#$arr[4]";
  if( defined $fileHash{$tmp} ) {
    $fileHash{$tmp} .= ( $arr[3] . "#" );
  }
  else {
    $fileHash{$tmp} = ($arr[3] . "#");
  }
}

close(FILE);

foreach my $k (keys %fileHash) {
  my @arr = split(/#/, $k);
  print "$arr[0]|$arr[1]|$arr[2]|";
  my $val = $fileHash{$k};
  $val =~ s/#/\;/g;
  $val =~ s/\;$//;
  print "$val|$arr[3]\n";
}

exit(0);

radoulov · April 13, 2008, 6:30am

If the order doesn't matter:

awk 'END { for (k in x) 
print k, x[k], y[k] }
{ x[$1FS$2FS$3] = x[$1FS$2FS$3] ? x[$1FS$2FS$3] ";" $4 : $4
y[$1FS$2FS$3] = $5 } ' FS=\| OFS=\| Test_File.dat

Otherwise:

awk 'END { for (i=1; i<=NR; i++)
if (w) 
  print  w, x[w], y[w] } 
!z[$1FS$2FS$3]++ { w[NR] = $1FS$2FS$3 }
{ x[$1FS$2FS$3] = x[$1FS$2FS$3] ? x[$1FS$2FS$3] ";" $4 : $4
y[$1FS$2FS$3] = $5 }' FS=\| OFS=\| Test_File.dat

Use nawk or /usr/xpg4/bin/awk on Solaris.

clueless181 · April 13, 2008, 9:55am

Thanks guys, I'm going to try some of these solutions right now!

The file layout order matters, but not the record order; meaning the end product could be:

aaa|bbb|ccc|123;456;789|ddd
abc|def|ghi|123;456;789|jkm

or

abc|def|ghi|123;456;789|jkm
aaa|bbb|ccc|123;456;789|ddd

...and the "pay grade" things is just a joke. It means that this type of coding is out of my league (i.e. I'm just a low level coder).

Neo · April 13, 2008, 10:16am

... or a student with homework

clueless181 · April 13, 2008, 10:39am

era/radoulov

Thanks! They both work great.

Would it be too much trouble to get a layman's description of what you did?

radoulov · April 13, 2008, 12:19pm

awk 'END { for (k in x) 
print k, x[k], y[k] }
{ x[$1FS$2FS$3] = x[$1FS$2FS$3] ? x[$1FS$2FS$3] ";" $4 : $4
y[$1FS$2FS$3] = $5 } ' FS=\| OFS=\| Test_File.dat

x[$1FS$2FS$3] = x[$1FS$2FS$3] ? x[$1FS$2FS$3] ";" $4 : $4

x is an associative array with keys based on your logical primary key $1, $2 and $3 and values/elements based on the value of the fourth filed.
The if expression ? then : else statement is composing your elements in the desired format (if array[key] exists -> then append $4 to its value).

y[$1FS$2FS$3] = $5

Another associative array with the same keys and elements based on the last field.

END { for (k in x) 
print k, x[k], y[k] }

After reading the entire input for every key in the x array,
print the desired values: the x key, the x value and the y value.

Hope this helps.