How to remove mth and nth column from a file?

zaq1xsw2 · August 6, 2013, 2:54am

Hi,

i need to remove mth and nth column from a csv file. here m and n is not a specific number. it is a variable

ex.

m=2
n=5

now i need to remove the 2nd and 5th line.. Please help how to do that.

Thanks!!!

Jotne · August 6, 2013, 3:09am

awk '{$a="";$b=""} 1' a=$m b=$n

alister · August 6, 2013, 3:44am

Assigning an empty string does not remove a column. The number of fields in the record will remain unchanged.

Regards,
Alister

RavinderSingh13 · August 6, 2013, 3:44am

Thanks a lot Jotne
But I guess we should give file name too after the command as follows. Where we have file named df_test_removing_column in which we want to remove
3rd and 4th column.

$ m=3
$ n=4
$ awk '{$a="";$b=""} 1' a=$m b=$n df_test_removing_column

Output should be as follows. It is just an exmaple for df (Disk space of AIX server.)

Filesystem 512-blocks   Iused %Iused Mounted on
/dev/hd4 262144   5812 9% /
/dev/hd2 9175040   79431 7% /usr
/dev/hd9var 3014656   3968 2% /var
/dev/hd3 2097152   113 1% /tmp
/proc -   - - /proc
/dev/hd10opt 1048576   1741 2% /opt
/dev/lv_home 720896   853 1% /home
/dev/lv_hd 524288   55 1% /opt/hd

Thanks,
R. Singh

Jotne · August 6, 2013, 4:04am

aliester is correct.
This can be cleaned by

awk '{$a=$b="";gsub(FS "+",FS)} 1' a=$m b=$n df_test_removing_column

zaq1xsw2 · August 6, 2013, 4:06am

That is nice and quick.. Thanks.. i am using the below command on my csv file

awk -F "," 'OFS="," {$a="";$b=""} 1' a=$m b=$n output.csv

the output is like below...

AED03852181,AED03852181,AED03852181,04/08/2013,04/08/2013,1239,,3,3,3,3,,3
AED09020382,AED09020382,AED09020382,04/08/2013,04/08/2013,1239,,3,3,3,3,,3

you can see that 7th and 12th column is set to BLANK as m=7 and n=12. can you modify the above command to remove the blank column. Actually i dont want to replace ",," with a ",". then it will create problem where actually there is blank field in the original file ..

Jotne · August 6, 2013, 4:17am

This should do it

awk -F, '{$a=$b="";gsub(FS "+",FS)} 1' a=$m b=$n OFS=, df_test_removing_column

Franklin52 · August 6, 2013, 4:17am

Try this:

awk -F, '{sub($a FS,x)} sub($(b-1) FS,x)' a=7 b=12 file

Jotne · August 6, 2013, 4:21am

Should work without brackets too
awk -F, 'sub($a FS,x) sub($(b-1) FS,x)' a=7 b=12 file

This works by changing the column and its filed separator $a FS to x
Since x is not defined, it will be blank.

zaq1xsw2 · August 6, 2013, 4:25am

This is done.. Thanks Everybody for your helps

Jotne · August 6, 2013, 4:39am

Or you can use cut

m=7
n=12
cut -d, --complement -s -f$m,$n file

konsolebox · August 6, 2013, 7:36am

The first and last columns won't have an extra pair of comma if they are deleted so FS "+" would not work. Also if one column was originally empty that column would also be deleted. It's better to just use split and array looping instead:

awk -v m=3 -v n=9 -- '{
    last = split($0, a, /,/)
    append = 0
    for (i = 1; i < last; ++i) {
        if (i != m && i != n) {
            if (append) {
                printf "," a
            } else {
                printf a
                append = 1
            }
        }
    }
    print ""
}' file

Or

awk -v m=3 -v n=9 -- '{ last = split($0, a, /,/); append = 0; for (i = 1; i < last; ++i) { if (i != m && i != n) { if (append) { printf "," a; } else { printf a; append = 1; }; }; } print ""; }' file

If we use bash for the shell it could be simpler:

#!/bin/bash

function remove_columns {
    local A LINE IFS=,
    while read -ra LINE; do
        for A; do
            unset "LINE[$A]"
        done
        echo "${LINE
[*]}"
    done
}

remove_columns "$m" "$n" < file  ## could be more than two columns specified

alister · August 6, 2013, 2:07pm

These suggestions should not be trusted for a moment. Blindly treating literal text data as regular expressions is asking for trouble. Both sub($a FS, x) and sub($(b-1) FS, x) are problematic in multiple respects.

If that text contains a regular expression metacharacter, who knows where in $0 and how much of $0 it will match.

Even if there are no metacharacters, neither substitution is guaranteed to occur at the correct field; if a preceding field matches the regular expression, that earlier field takes precedence. For example, if a is 5, nothing prevents $5 FS from matching at any point between $1 and $5 inclusive.

If there are metacharacters, the situation is worse, because then a literal string when treated as a regular expression may not even match itself. For example, both of the following expressions are false: "(1)" FS ~ "(1)" FS and "[a]" FS ~ "[a]" FS . This means that for a=5 $5 FS may not match itself but it could match at some other location, both before or after $5.

In Franklin52's code, since the numeric return value of sub($(b-1) FS,x) evaluated in a boolean context controls printing, and since (as just pointed out) literal text as a regular expression may not match itself, entire records could be silently deleted if the controlling sub makes no substitutions.

In Jotne's version, the two numeric return values are converted to strings, concatenated, and the result is evaluated in a boolean context. Since there will always be a numeric return value, and since no number converts to a null string, the string which is evaluated in a boolean context is never empty and so is always true. Even if there were no other problems with the code, I would recommend against this because of the subtlety involved. The chances are extremely high that whoever inherits this code will not fully understand it.

Finally, there's also the possibility that the text is an undefined regular expression, which could produce different output on different awk implementations given identical input.

Great solution, so long as portability isn't a concern.

If portability is a constraint, --complement is disallowed. In which case using cut would require constructing the -f option-argument from $m and $n, yielding something similar to 1-($m-1),($m+1)-($n-1),($n+1)- . However, the logic required to properly handle all boundary conditions isn't worth the trouble when there's a simple, portable AWK solution available (presented below).

konsolebox:

It's better to just use split and array looping instead:

awk -v m=3 -v n=9 -- '{
   last = split($0, a, /,/)
   append = 0
   for (i = 1; i < last; ++i) {
   if (i != m && i != n) {
   if (append) {
   printf "," a
   } else {
   printf a
   append = 1
   }
   }
   }
   print ""
}' file

I agree with you that the best (only?) AWK solution is to iterate over the fields, excluding the undesirables. It is a robust approach that's immune to all the problems arising from treating literal text as regular expressions.

I did not test your code, but looking at it there appears to be an off-by-one bug at i < last . last corresponds to the final field and it is never printed. It should be i <= last .

Aside from that, your implementation is also a bit overcomplicated. There is no need to explicitly split the record into an array when AWK has already split it into field variables for your convenience.

For portability, simplicty, and flexibility, I recommend:

{
    for (i=1; i<=NF; i++)
        if (i != m  &&  i != n)
            s = s OFS $i
    print substr(s, length(OFS)+1)
    s=""
}

Obviously, FS and OFS must be set to the appropriate values.

For the sake of those who follow in your footsteps, seeking a solution to an identical or similar problem, the least you can do is state how you solved your problem. This is especially true if all the suggestions provided to you were inadequate and you either crafted your own approach or found one elsewhere.

Regards,
Alister

MadeInGermany · August 6, 2013, 5:51pm

Another formatting trick

{
  sep=""
  for (i=1; i<=NF; i++)
    if (i != m  &&  i != n) {
      printf sep"%s", $i
      sep=OFS
    }
  print ""
}

alister · August 6, 2013, 8:41pm

That would be a problem if OFS were set to something that's special in a printf format string. For example, if OFS were % , then the output would always be %s .

I suspect it was just an oversight on your part, but for those that don't get it, here's the correct way:

    printf "%s%s", sep, $i

Regards,
Alister

konsolebox · August 7, 2013, 3:34am

I remember finding this bug but probably my modification didn't went through.

alister:

Aside from that, your implementation is also a bit overcomplicated. There is no need to explicitly split the record into an array when AWK has already split it into field variables for your convenience.

For portability, simplicty, and flexibility, I recommend:
{
   for (i=1; i<=NF; i++)
   if (i != m  &&  i != n)
   s = s OFS $i
   print substr(s, length(OFS)+1)
   s=""
}
Obviously, FS and OFS must be set to the appropriate values.

In the world of awk yes, but there's no way I would do that in C, and somehow doesn't make me want to do it in awk as well. If you're careful about speed you'll naturally not use that method despite appearing to be simpler. One could argue that that could be better but one would not.

Sometimes we think we could simplify things by minimizing our code but sometimes it just gets more bloated. Mine may not have been in its most optimized form but at least there's the balance. Yeah I know speed isn't crucial really but we have our opinions.

Edit: You could actually be correct about appending strings instead of calling multiple printfs since that could possibly cause multiple ioctl calls depending on awk's implementation, but I wouldn't consider print substr(s, length(OFS)+1).

MadeInGermany · August 7, 2013, 4:15am

Or

printf "%s", sep $i

---------- Post updated at 03:15 AM ---------- Previous update was at 02:40 AM ----------

konsolebox:

I remember finding this bug but probably my modification didn't went through.

In the world of awk yes, but there's no way I would do that in C, and somehow doesn't make me want to do it in awk as well. If you're careful about speed you'll naturally not use that method despite appearing to be simpler. One could argue that that could be better but one would not.

Sometimes we think we could simplify things by minimizing our code but sometimes it just gets more bloated. Mine may not have been in its most optimized form but at least there's the balance. Yeah I know speed isn't crucial really but we have our opinions.

Edit: You could actually be correct about appending strings instead of calling multiple printfs since that could possibly cause multiple ioctl calls depending on awk's implementation, but I wouldn't consider print substr(s, length(OFS)+1).

Alister addressed your explicit split(), ignoring the built-in auto-split. That's unnecessary overhead.
The formatting method does not really matter, but why not present an alternative? I was even inspired to present a 3rd method.

konsolebox · August 7, 2013, 5:09am

Considering the use of FS and OFS I now have this version:

awk -v m=3 -v n=9 -v FS=, -v OFS=, -- '{
    j = 0
    for (i = 1; i <= NF; ++i) {
        if (i == m || i == n) {
            ++j
            continue
        }
        $(i - j) = $i
    }
    NF -= j
    print
}'

And adding Alister's suggestion this is the best change I could have for my original code:

awk -v m=3 -v n=9 FS=, OFS=, -- '{
    append = 0
    for (i = 1; i <= NF; ++i) {
        if (i != m && i != n) {
            if (append) {
                s = s OFS $i
            } else {
                s = $i
                append = 1
            }
        }
    }
    print s
    s = ""
}' file

alister · August 7, 2013, 10:22pm

Given the dearth of detail, any optimization efforts would be aimless.

For an average implementation, on average hardware, processing an average text file, under average user expectations, the performance discrepancy between the AWK scripts will be insignificant, and there has been no indication by the OP that this situation is anything but average.

For an extraordinary situation, the details which we do not have (awk implementation? data set characteristics?) are crucial.

Testing with gawk, mawk, and busybox and two types of data, one with modest lines (100 columns, 292 bytes each) and the other with much wider lines (32,765 columns, 185,484 bytes each), yielded highly inconsistent results.

My original suggestion was sometimes the fastest, but only when lines were modestly-sized. As you correctly pointed out, my code does not scale; performance degrades drastically with increasing line length.

Casual testing suggests that you're using gawk, because otherwise the performance of your more recent suggestions regresses greatly compared to your original contribution.

Gawk running the following script was the fastest of all possible implementation/script combinations (that I tested):

However, that very same script under Busybox was also the slowest of all interpreter/script combinations (slower even than any run of my original sloth). This script was also the slowest of all under mawk.

The highlighted statements trigger recomputation of $0 in all three implementations, but only gawk implements an optimization to lazily avoid that overhead until $0 itself (not its fields) is referenced. For the details, follow field0_valid in gawk - field.c

There are a lot of systems out there that do not use gawk by default. Even among Linux installations, most embedded systems and most Debian derivatives (including most Ubuntu and Ubuntu-derivative versions) do not use it. For all of them, this revision is a setback.

In the absence of any specifics, in my judgement, your original solution exhibits the best balance of scalability and predictable performance across implementations. Minus the redundant split, the off-by-one in the loop condition, and the printf format string bugs:

{
    append = 0
    for (i = 1; i <= NF; ++i) {
        if (i != m && i != n) {
            if (append) {
                printf "%s%s", OFS, $i
            } else {
                printf "%s", $i
                append = 1
            }
        }
    }
    print ""
}

In this specific case, though, since there is no constraint requiring AWK and since any cut implementation would outperform any AWK implementation running any of these scripts ... by a significant margin, the performance debate is academic.

Regards,
Alister