Search, and add if present

loba · February 7, 2014, 8:03am

Dear All,

I have to find a way to reorganize a table file according to the last column. The input file looks like this:

cat Input1.txt:
ID:12:23:00Q    EU232    2342    234    123    231    aa1;ab2
ID:11:22:00E    EU112    1232    211    112    233    ab2;ac3
ID:19:24:00S    EU121    569    100    101    244    aa1;ac3
ID:11:33:00S    EU456    332    120    99    221    ac3

My output file should contain the information of the last column in newly created columns like this:

cat Output:
ID:12:23:00Q    EU232    2342    234    123    231    aa1    ab2    na aa1;ab2
ID:11:22:00E    EU112    1232    211    112    233    na    ab2    ac3    ab2;ac3
ID:19:24:00S    EU121    569    100    101    244    aa1    na    ac3    aa1;ac3
ID:11:33:00S    EU456    332    120    99    221    na    na    ac3    ac3

My solution:
In a first step I introduced three new columns containing "na" values.

awk '{ print $1,$2,$3,$4,$5,$6,"na","na","na",$7}' input1.txt > input2.txt

This resulted in the following output:

cat Input2.txt
ID:12:23:00Q    EU232    2342    234    123    231    na    na    na    aa1;ab2
ID:11:22:00E    EU112    1232    211    112    233    na    na    na    ab2;ac3
ID:19:24:00S    EU121    569    100    101    244    na    na    na    aa1;ac3
ID:11:33:00S    EU456    332    120    99    221    na    na    na    ac3

Now, I replaced the "na" if present in the last column.

awk '/aa1/{gsub($7, "aa1")};{print}' input2.txt > output_aa1.txt
awk '/ab2/{gsub($8, "ab2")};{print}' output_aa1.txt > output_aa1_ab2.txt
awk '/ac3/{gsub($9, "ac3")};{print}' output_aa1_ab2.txt >

This kind of works but is only feasible for a limited number of items (in my example three). Is there a way to upscale it? Something along the lines:

for ITEM in `cat item.list`
do
   ???
done

Thanks for your help!

tukuyomi · February 7, 2014, 8:58am

Nothing fancy in the below script, just an all-in-one task:

awk '{$10=$7;$7=$8=$9="na"}
$10~/aa1/{$7="aa1"}
$10~/ab2/{$8="ab2"}
$10~/ac3/{$9="ac3"}
1' input_file > output_file

loba · February 8, 2014, 1:52am

Dear tukuyomi,

thank you for you help. The problem is that my item list contains a few hundred entries for each position. Meaning at position $7 I have over 400, at position $8 there are about 100 and $9 has > 800. Any idea how I could do this?

tukuyomi · February 8, 2014, 2:14am

Just to be clear : input_file's 7th column is what you call 'item list', isn't it?
Do you also mean that you might have aa1 as well as foo, bar, and etc entries only for $7? ab2, fool, bars, etcs for $8, ...?
Thanks for clarifying.

Don_Cragun · February 8, 2014, 2:18am

When we see one of your ~1300 values, how do we know whether it is supposed to go into field 7, 8, or 9?

loba · February 8, 2014, 6:37am

Dear tukuyomi,

yes, position $7 can have a list of possible values not just aa1. The same is true for $8 and $9. I was thinking I could first work on position $7. Save the file and continue with the next one. Your suggestion works the problem is I would have to create a text file for all possible values, safe it and run it. I was wondering if you know a better way? I tired a for loop (and an array) to get all the items for e.g. $7 but it did not work.

---------- Post updated at 12:37 PM ---------- Previous update was at 12:31 PM ----------

Dear Don Cragun,

You are right. I was thinking of doing one list after the other. Like I tried to describe in my last post.

for ITEMS in 'cat item_at_position_7.list'
do
   ???
done < in.txt > out7.txt

I know the for loop does not work

Don_Cragun · February 8, 2014, 3:22pm

Assuming that you have files item_at_position[789].list containing valid values for each of those fields (one value per line), such as:
item_at_position7.list :

aa1
aa2
aa3
aa4

item_at_position8.list :

ab1
ab2
ab3
ab4

and item_at_position9.list :

ac1
ac2
ac3
ac4

and an input file ( Input1.txt ) containing:

ID:12:23:00Q    EU232    2342    234    123    231    aa1;ab2
ID:11:22:00E    EU112    1232    211    112    233    ab2;ac3
ID:19:24:00S    EU121    569    100    101    244    aa1;ac3
ID:11:33:00S    EU456    332    120    99    221    ac3
ID:12:34:00D    DWC11    1    2    3    4    aa1;aa2;abc;ac1
ID:23:45:00D    DWC22    5    6    7    8    ad1;aa1;ab2;ac3
ID:23:59:00D    DWC33    9    10    11    12    aa1;aa2;aa3;ab2;ab3;ab4;ac1;ac3;ac4

then the following awk script:

awk -v outf="Output.txt" '
BEGIN { OFS="    " }
# Replace "na" in field fieldnum with value.
# With alternative else clause, print diagnostic if the field has already been set.
function add(value, fieldnum) {
        if($fieldnum == "na")   $fieldnum = value
        else                    $fieldnum = $fieldnum ";" value
# Replace above line with the following four lines if only one value is allowed per output field.
#       else {  printf("Line %d, multiple value %s for field %d dropped\n",
#                       FNR, value, fieldnum)
#               exitcode = 1
#       }
}
FNR == 1 {      # Increment input file number...
        file++
}
file <= 3 {     # Read values from one of the list files...
        list[file,$1]
        next
}
                # Process main input file...
{       # Split the last field into list values.
        n = split($7, values, /;/)
        # Initialize last four fields.
        $10 = $7
        $7 = $8 = $9 = "na"
        # Process values found on this line.
        for(i = 1; i <= n; i++)
                if((1,values) in list) add(values, 7)
                else if((2,values) in list) add(values, 8)
                else if((3,values) in list) add(values, 9)
                else {  printf("Line %d: value: %s not recognized\n",
                                FNR, values)
                        exitcode = 1
                }
        # Print the updated line.
        print > outf
}
END {   exit exitcode
}' item_at_position_[789].list Input1.txt >&2

prints the following diagnostic to the standard error output:

Line 5: value: abc not recognized
Line 6: value: ad1 not recognized

and stores the following output in Output.txt :

ID:12:23:00Q    EU232    2342    234    123    231    aa1    ab2    na    aa1;ab2
ID:11:22:00E    EU112    1232    211    112    233    na    ab2    ac3    ab2;ac3
ID:19:24:00S    EU121    569    100    101    244    aa1    na    ac3    aa1;ac3
ID:11:33:00S    EU456    332    120    99    221    na    na    ac3    ac3
ID:12:34:00D    DWC11    1    2    3    4    aa1;aa2    na    ac1    aa1;aa2;abc;ac1
ID:23:45:00D    DWC22    5    6    7    8    aa1    ab2    ac3    ad1;aa1;ab2;ac3
ID:23:59:00D    DWC33    9    10    11    12    aa1;aa2;aa3    ab2;ab3;ab4    ac1;ac3;ac4    aa1;aa2;aa3;ab2;ab3;ab4;ac1;ac3;ac4

and, with your original sample input file, stores the foliowing output in Output.txt :

ID:12:23:00Q    EU232    2342    234    123    231    aa1    ab2    na    aa1;ab2
ID:11:22:00E    EU112    1232    211    112    233    na    ab2    ac3    ab2;ac3
ID:19:24:00S    EU121    569    100    101    244    aa1    na    ac3    aa1;ac3
ID:11:33:00S    EU456    332    120    99    221    na    na    ac3    ac3

without producing any diagnostic messages.

Is this what you wanted to do?

If you want to try this on a Solaris/SunOS system, use /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk instead of the default /usr/bin/awk .

loba · February 10, 2014, 2:11am

Dear Don Cragun,

thank you very much for your help. Your code works like a charm and I learned a few things!