Reducing text file using similar lines

shekhar2010us · March 17, 2013, 10:23pm

Hello,

I am a java programmer but want to try unix for a purpose where I need to reduce a file using its first field.. Here is the sample data:

admin;2;0;[yrral];[]
admission;8;0;[timlu];[]
aman;1;0;[ev];[]
caroline;0;4;[];[luis, asethi]
cook;0;4;[];[shekhar, raj]
cook;2;0;[lew];[]
far;0;3;[];[venk]
far;1;5;[shekhar];[venk, raj]

I am explaining the dataset first. There are five fields separated by ";". First field is the main id, second and third are numerics, and 4th and 5th fields are list.

I need to combine all lines in the file where the first field matches. By Combination means, The 2nd and 3rd fields should be added and 4th anf 5th fields (lists) should be combined uniquely.

So, the desired output should be:

admin;2;0;[yrral];[]
admission;8;0;[timlu];[]
aman;1;0;[ev];[]
caroline;0;4;[];[luis, asethi]
cook;2;4;[lew];[shekhar, raj]
far;1;8;[shekhar];[venk, raj]

I could have done this java but want to use the power of unix to get it done fast, since I have tonnes of such very large files.

Thanks a lot.
Shekhar

---------- Post updated at 10:23 PM ---------- Previous update was at 07:16 PM ----------

Please help.

Yoda · March 17, 2013, 10:46pm

Please do not bump up questions if they are not answered promptly.

Here is an awk program based on some assumptions:

awk -F';' ' {
                if ( F1 == $1 && i > 1 )
                {
                        F2 += $2
                        F3 += $3
                        gsub (/\[|\]/, X, $4)
                        gsub (/\[|\]/, X, $5)
                        if ( F4 == "" )
                                F4 = $4
                        else if ( $4 != "" )
                        {
                                F4 = F4 "," $4
                                n = split (F4, V, ",")
                                F4 = ""
                                for ( k = 1; k <= n; k++ )
                                {
                                        if ( k > 1 && V[k] !~ V[k-1] )
                                                F4 = F4 "," V[k]
                                        else if ( k == 1 )
                                                F4 = V[k]
                                }
                        }
                        if ( F5 == "" )
                                F5 = $5
                        else if ( $5 != "" )
                        {
                                F5 = F5 "," $5
                                n = split (F5, V, ",")
                                F5 = ""
                                for ( k = 1; k <= n; k++ )
                                {
                                        if ( k > 1 && V[k] !~ V[k-1] )
                                                F5 = F5 "," V[k]
                                        else if ( k == 1 )
                                                F5 = V[k]
                                }
                        }
                }
                else
                {
                        ++i
                        F1 = $1
                        F2 = $2
                        F3 = $3
                        gsub (/\[|\]/, X, $4)
                        gsub (/\[|\]/, X, $5)
                        F4 = $4
                        F5 = $5
                }
} END {
                for ( j = 1; j <= i; j++ )
                        print F1[j], F2[j], F3[j], "[" F4[j] "]", "[" F5[j] "]"
} ' OFS=';' file

Don_Cragun · March 18, 2013, 5:42am

For the given input, Yoda's script produces the desired output. But, when combining F4 and F5 with the current line's $4 and $5, respectively, it will produce duplicate entries in the output list unless the newly added entries are adjacent to the entry with the same value in the previous line. And, the space after a comma is treated as part of a name. The specification given isn't clear if this is intended, but it seemed that the separator in the lists in the sample input was a comma followed by a space rather than just a comma.

As an example, if the last line in the input file is changed from:

far;1;5;[shekhar];[venk, raj]

to:

far;1;5;[shekhar];[raj, venk]

the last line of the output will be:

far;1;8;[shekhar];[venk,raj, venk]

instead of:

far;1;8;[shekhar];[raj, venk]

And, if the following lines are in the input file:

plus;1;2;[u1, u2];[g1,g2]
plus;1;2;[u2];[g1]

it produces:

plus;2;4;[u1, u2,u2];[g1,g2,g1]

while I would have thought the desired output was:

plus;2;4;[u1, u2];[g1, g2]

Yoda's code also assumes that all lines that need to be combined will be adjacent in the input file. That is true in the sample input, but the specification doesn't specify that this will be true.

Here is an alternative awk script that you may want to consider:

awk '
function combine(ins, LOCAL, a, i, j, n, os) {
        n = split(ins, a, /, */)
        os = a[1]
        for(i = 2; i <= n; i++) {
                for(j = 1; j < i; j++)
                        if(a == a[j]) break
                if(j >= i) os = os ", " a[j]
        }
        return os
} 
BEGIN { FS = OFS = ";" }
{       if($1 in order) i = order[$1]
        else            F1[i = order[$1] = ++oc] = $1
        F2 += $2
        F3 += $3
        gsub (/[][]/, "", $4) 
        if(F4 == "")         F4 = $4
        else if($4 != "")       F4 = combine(F4 "," $4)
        gsub (/[][]/, "", $5) 
        if(F5 == "")         F5 = $5
        else if($5 != "")       F5 = combine(F5 "," $5)
}
END {   for(i = 1; i <= oc; i++)
                print F1, F2, F3, "[" F4 "]", "[" F5 "]"
}' file

With the input file:

admin;2;0;[yrral];[]
admission;8;0;[timlu];[]
aman;1;0;[ev];[]
caroline;0;4;[];[luis, asethi]
cook;0;4;[];[shekhar, raj]
cook;2;0;[lew];[]
far;0;3;[];[venk]
far;1;5;[shekhar];[raj, venk]
plus;1;2;[u1, u2];[g1,g2]
plus1;1;1;[u1];[]
plus2;0;3;[u2];[raj, venk, g3]
plus2;1;5;[shekhar];[venk, raj]
plus1;1;1;[u2];[g1, g2, g3]
plus1;1;1;[u1];[g2, g4]
plus2;0;3;[u1];[g1, g2, g3]
plus;1;2;[u2];[g1]

this script produces:

admin;2;0;[yrral];[]
admission;8;0;[timlu];[]
aman;1;0;[ev];[]
caroline;0;4;[];[luis, asethi]
cook;2;4;[lew];[shekhar, raj]
far;1;8;[shekhar];[venk, raj]
plus;2;4;[u1, u2];[g1, g2]
plus1;3;3;[u1, u2];[g1, g2, g3, g4]
plus2;1;11;[u2, shekhar, u1];[raj, venk, g3, g1, g2]

With this same input, Yoda's script produces:

admin;2;0;[yrral];[]
admission;8;0;[timlu];[]
aman;1;0;[ev];[]
caroline;0;4;[];[luis, asethi]
cook;2;4;[lew];[shekhar, raj]
far;1;8;[shekhar];[venk,raj, venk]
plus;1;2;[u1, u2];[g1,g2]
plus1;1;1;[u1];[]
plus2;1;8;[u2,shekhar];[raj, venk, g3,venk, raj]
plus1;2;2;[u2,u1];[g1, g2, g3,g2, g4]
plus2;0;3;[u1];[g1, g2, g3]
plus;1;2;[u2];[g1]

shekhar2010us · March 18, 2013, 10:24am

Many Thanks Yoda and Don.
Don, you are correct.
1) The input might not be same only in adjacent lines
2) The list is comma separated and when I checked with some tweaks in the inputs , it produces duplicates in the output list.. Which I mentioned in the first post that it should add uniquely. So, I dont need duplicates in the list.

Many thanks once again.

Don_Cragun · March 18, 2013, 2:57pm

Note that if you want the lists combined by my script to use just a comma to separate entries in fields 4 and 5 instead of a comma followed by a space, change the following line:

                if(j >= i) os = os ", " a[j]

to:

                if(j >= i) os = os "," a[j]

Note, however, that the script won't change the lists unless two non-empty lists are found for the same field for the same value in the first field. It could be made to normalize all lists found, but it would run slower.

alister · March 18, 2013, 10:54pm

Finally, a high performance contender has arrived. Step back and eat my dust.

#!/bin/sh

while IFS=\; read -r f1 f2 f3 f4 f5; do
    dir=temp/$f1
    if ! [ -d "$dir" ]; then
        mkdir "$dir"
        touch "$dir/2" "$dir/3" "$dir/4" "$dir/5"
    fi
    printf '%s\n' "$f2" >> "$dir/2"
    printf '%s\n' "$f3" >> "$dir/3"
    printf '%s\n' "$f4" | tr '[], ' '[\n*]' | sed '/./!d' >> "$dir/4"
    printf '%s\n' "$f5" | tr '[], ' '[\n*]' | sed '/./!d' >> "$dir/5"
done

ls -ct temp |
while IFS= read -r dir; do
(
    printf '%s\n' "$dir"
    dir=temp/$dir
    paste -sd+ "$dir/2" | bc
    paste -sd+ "$dir/3" | bc
    sort -u "$dir/4" | paste -sd, - | sed 's/,/, /g; s/.*/[&]/'
    sort -u "$dir/5" | paste -sd, - | sed 's/,/, /g; s/.*/[&]/'
) | paste -sd\; -
done

Obviously, that was created purely for amusement. It is not seriously recommended over an AWK (or perl, or ...) solution.

It assumes an empty directory named "temp" in the current working directory and reads data via stdin.

Regards,
Alister