Make copy of text file with columns removed (based on header)

LMHmedchem · January 14, 2015, 1:31am

Hello,

I have some tab delimited text files with a three header rows. The headers look like, (sorry the tabs look so messy).

index	group	Name	input	input	input	input	input	input	input	input	input	input	input
int	char	string	double	double	double	double	double	double	double	double	double	double	double
id	group	Name	AtR_Ptb_L	flatness	inv_dx2	rvalHyd	sumLip	xv0	dxv1	Gmax	k2	Spyridin_N	Salph_N

The files could have any number of columns. What I need to do is simple. I just need to copy the file with one or more columns removed. The columns to be removed would be specified by the value in the third row. For example, I could want the files with columns "dxv1" and "k2" removed.

index	group	Name	input	input	input	input	input	input	input	input	input
int	char	string	double	double	double	double	double	double	double	double	double
id	group	Name	AtR_Ptb_L	flatness	inv_dx2	rvalHyd	sumLip	xv0	Gmax	Spyridin_N	Salph_N

The order of the remaining columns should be the same. It doesn't matter how the list of columns to be removed is formatted. It can be any kind of list.

I have read posts about how to copy specific columns with cut or awk, but not how to skip specific cols and copy everything else. One thing to do would be to find the position of the cols to be removed and use cut, but how to set that up to work in a general implementation is a bit unclear to me. I also suspect that awk would be more efficient.

Any suggestions?

LMHmedchem

RavinderSingh13 · January 14, 2015, 1:50am

Hello LMHmedchem,

Following may help you in same.
1st: For any line:

awk -vs1="dxv1" -vs2="k2" '{for(i=1;i<=NF;i++){if($i==s1 || $i==s2){$i=""}}} 1' Input_file

Output will be as follows.

index   group   Name    input   input   input   input   input   input   input   input   input   input   input
int     char    string  double  double  double  double  double  double  double  double  double  double  double
id group Name AtR_Ptb_L flatness inv_dx2 rvalHyd sumLip xv0  Gmax  Spyridin_N Salph_N

2nd: If for only 3rd line:

awk -vs1="dxv1" -vs2="k2" 'NR==3{for(i=1;i<=NF;i++){if($i==s1 || $i==s2){$i=""}}} 1'  Input_file

Output will be as follows.

index   group   Name    input   input   input   input   input   input   input   input   input   input   input
int     char    string  double  double  double  double  double  double  double  double  double  double  double
id group Name AtR_Ptb_L flatness inv_dx2 rvalHyd sumLip xv0  Gmax  Spyridin_N Salph_N

Thanks,
R. Singh

RudiC · January 14, 2015, 4:18am

If you want those fields removed in every record, not just the header, try:

awk     'NR==3          {MX=split (RM, T, " ")
                         for (i=1; i<=NF; i++)
                             for (n=1; n<=MX; n++)
                                 if ($i==T[n]) T[n]=i
                        }
         !(NR%3)        {for (n=1; n<=MX; n++) $(T[n])=""
                         $0=$0; $1=$1
                        }
         1
        ' FS="\t+" OFS="\t" RM="dxv1 k2" file

LMHmedchem · January 15, 2015, 7:10pm

rudic:

If you want those fields removed in every record, not just the header, try:

awk     'NR==3          {MX=split (RM, T, " ")
   for (i=1; i<=NF; i++)
   for (n=1; n<=MX; n++)
   if ($i==T[n]) T[n]=i
   }
   !(NR%3)        {for (n=1; n<=MX; n++) $(T[n])=""
   $0=$0; $1=$1
   }
   1
   ' FS="\t+" OFS="\t" RM="dxv1 k2" file

This approach does not seem to work. The input and output files still have the same number of columns. The values dxv1 and k2 have been removed from the third row, but it looks like for the rest of the file, one column has been removed from every third row instead of the entire column being removed.

I have attached the original file,
original_f0_RSV_1912_A_S1v6_RI7_1916_15-01-10.txt

the file as modified by the code above,
modified_f0_RSV_1912_A_S1v6_RI7_1916_15-01-10.txt

and the output I was trying to create,
intended_f0_RSV_1912_A_S1v6_RI7_1916_15-01-10.txt

The method posted by RavinderSingh13 modifies the third row, but not the rest of the file.

This code does what I want,

# assign value of header for column to be removed
REMOVE='dxv1'
# assign data input file for $FOLD
BASE_INPUT_FILE_LIST=($(ls './'$SET'/input_data/base/'$FOLD'_'*'_'$SET'_'*'.txt'))
# assign modified input file directory
MOD_INPUT_FILE_DIR=$(ls -d './'$SET'/input_data/')
echo $MOD_INPUT_FILE_DIR

for BASE_INPUT_FILE in "${BASE_INPUT_FILE_LIST[@]}"
do
   echo $BASE_INPUT_FILE
   # change path to filename
   REVISED_FILE=$(echo $BASE_INPUT_FILE | awk 'BEGIN {FS="/"} {print $5}')
   REVISED_FILE='./'$SET'/input_data/'$REVISED_FILE
   echo $REVISED_FILE

   # find the location of the column to be removed
   HEADER_ROW_LIST=($(cat $BASE_INPUT_FILE | sed -n '3p'))
   ELEMENT_COUNTER='0';  HEADER_POSITION='0'

   # loop through headers
   for HEADER_ROW in "${HEADER_ROW_LIST[@]}"
   do
      # incrementer counter
      (( ELEMENT_COUNTER++ ))
      echo $HEADER_ROW
      if [ "$HEADER_ROW" == "$REMOVE" ]; then
         echo "found remove at position" $ELEMENT_COUNTER
         HEADER_POSITION=$ELEMENT_COUNTER
      fi
   done
   echo $REMOVE "was found at position" $HEADER_POSITION

   # create values before and after position to be removed
   let "REMOVE_m1=$HEADER_POSITION-1";  let "REMOVE_p1=$HEADER_POSITION+1";

   echo "REMOVE_m1" $REMOVE_m1
   echo "REMOVE_p1" $REMOVE_p1

   # remove column from file
   cut --output-delimiter=$'\t' -f1-$REMOVE_m1,$REMOVE_p1-  $BASE_INPUT_FILE > $REVISED_FILE

This does not currently allow for more than one column to be removed, though the code could be called separately for each column.

Thanks,

LMHmedchem

ongoto · January 16, 2015, 7:17am

Edit:
Code removed: not working right.

LMHmedchem · January 17, 2015, 9:15pm

This is what I have at the moment,

#!/bin/sh

# file with list of colums to remove (by header name)
REMOVE_LIST_FILE='./remove_list.txt'
SET='A'

## 1) read list of columns to remove into array, skip comment lines
# clear array
unset LIST_TO_REMOVE
echo ""

# read from file
while IFS=$'\n' read line
do
    if [[ "$line" =~ \#.* ]];then
        echo "skipping comment line:" $line
    else
        echo "adding column header:" $line
        LIST_TO_REMOVE=("${LIST_TO_REMOVE[@]}" "$line")
    fi
done < $REMOVE_LIST_FILE

## 2) translate list of headers to list of column numbers
echo ""
echo "creating files without columns" ${LIST_TO_REMOVE[@]}
echo ""

# find position of column to remove based on the first input file
COLUMN_CHECK_FILE=$(ls './'$SET'/input_data/base/f0_'*'_'$SET'_'*'.txt')

# clear columns to be removed
unset COLS_TO_REMOVE

# find column number for column to remove
for CURRENT_REMOVE in "${LIST_TO_REMOVE[@]}"
do
   # grab header row (row 3) from column check file
   HEADER_ROW_LIST=($(cat $COLUMN_CHECK_FILE | sed -n '3p'))
   ELEMENT_COUNTER='0';  HEADER_POSITION='0';

# loop through headers
for HEADER_ROW in "${HEADER_ROW_LIST[@]}"
do
      # incrementer counter
      (( ELEMENT_COUNTER++ ))
      if [ "$HEADER_ROW" == "$CURRENT_REMOVE" ]; then
         HEADER_POSITION=$ELEMENT_COUNTER
         COLS_TO_REMOVE=("${COLS_TO_REMOVE[@]}" "$HEADER_POSITION")
      fi
   done
done

# sort list of columns to remove
SORT_COLS_TO_REMOVE=($(IFS=$'\n'  sort -n <<< "${COLS_TO_REMOVE[*]}"))
echo "SORT_COLS_TO_REMOVE" ${SORT_COLS_TO_REMOVE[@]}
echo ""

## 3) create -f field string to pass to cut

# the cut string always starts with "-f1-" because the first column will never be cut
# the code works by taking the column number to remove and creating the integer +1 and -1
# the +/-1 values are seperated by a comma and added to the cut -f string
# if the column to remove is 4 3,5 is added to the cut string to make -f1-3,5

# additionally, the +1 value is checked to make sure it is not also on the list of
# columns to remove

# initalize
CUT_STRING='-f1-'; CUT_COUNTER='0'; SERIES='0'

# loop on list of columns to remove
for CUT_COLUMN in "${SORT_COLS_TO_REMOVE[@]}"
do
   # set the position of the next item in the remove list
   let "POS_CHECK=$CUT_COUNTER+1"
   # set the column number of the next column
   let "NEXT_VALUE=$CUT_COLUMN+1"
   # find the column number of the next colum in the remove list
   CHECK_NEXT=${SORT_COLS_TO_REMOVE[$POS_CHECK]}

   # if the next col in sequence is not the next col on the remove list
   if [ "$NEXT_VALUE" != "$CHECK_NEXT" ]; then

      # if we are not in a series, create col number before and after
      if [ "$SERIES" == "0" ]; then
         let "REMOVE_m1=$CUT_COLUMN-1"
         let "REMOVE_p1=$CUT_COLUMN+1"
         # add col before and after to cut string with comma seperation
         # if col 4 is to be cut, creates 3,5
         CUT_STRING=$CUT_STRING$REMOVE_m1','$REMOVE_p1'-'

      # if we are in a series, we arrive here when the end of the series
      # has been identified, create series end bracket by adding 1 to current col position
      else
         let "SERIES_END_BRACKET=$CUT_COLUMN+1"
         # add col before and after to cut string with comma seperation
         # if series to cut is 13 14 15, creates 12,16
         CUT_STRING=$CUT_STRING$SERIES_START_BRACKET','$SERIES_END_BRACKET
         # series is complete, so deactivate series
         SERIES='0'
      fi

   # if the next col in sequence is the next col on the remove list, in a series
   else
      # if series is not active, start series by storing current position -1
      if [ "$SERIES" == "0" ]; then
         let "REMOVE_m1=$CUT_COLUMN-1"
         SERIES_START_BRACKET=$REMOVE_m1
         # set marker that series is active
         SERIES='1'
      fi
   fi

   # increment position
   ((CUT_COUNTER++))

# for CUT_COLUMN in "${SORT_COLS_TO_REMOVE[@]}" done
done

# if the last character is not -, add
if [ "${CUT_STRING:$i:-1}" != "-" ]; then
   # add trailing - to cut string, check if this works in all cases
   CUT_STRING=$CUT_STRING'-'
fi
echo "CUT_STRING" $CUT_STRING
echo ""

# at this point, a -f cut argument string has been created, if the cut list was 4 13 14 16,
# the cut string is -f1-3,5-12,16- which gives the desired outcome


## 4) create modified files with the cut columns removed
# remove columns from base input files based on LIST_TO_REMOVE

# list of input files to modify
BASE_INPUT_FILE_LIST=($(ls './'$SET'/input_data/base/'*'_'$SET'_'*'.txt'))
# directory to write modified files
MOD_INPUT_FILE_DIR=$(ls -d './'$SET'/input_data/')

echo "creating modified input files"
# create coppies of base input with identified columns removed
for BASE_INPUT_FILE in "${BASE_INPUT_FILE_LIST[@]}"
do
   # create path for revised file
   REVISED_FILE=$(echo $BASE_INPUT_FILE | awk 'BEGIN {FS="/"} {print $5}')
   REVISED_FILE='./'$SET'/input_data/'$REVISED_FILE
    # remove columns from file as specified in CUT_STRING
   cut --output-delimiter=$'\t' $CUT_STRING  $BASE_INPUT_FILE > $REVISED_FILE
done

Sorry for the very long script. This seems to work and is reasonably fast.

real    0m0.437s
user    0m0.253s
sys     0m0.011s

It's hard to know that step 3 has been coded in a manner that will work for all possibilities. That is the kind of algorithm that is nothing but a box of bear traps.

I have attached a set of test files in case anyone would rather test than read the above. Just extract the tar.gz and run cut_columns.sh. You may have to make the script executable. You can play around with the files that are removed by modifying remove_list.txt.

LMHmedchem

RudiC · January 18, 2015, 6:54am

Due to missing samples, the assumption was every record is spread over three lines, so the relevant values had to be removed in the third lines...

Just remove the !(NR%3) to remove the columns in every line.

LMHmedchem · January 21, 2015, 9:44pm

Yes, I didn't provide quite enough information there.

I have tested a bit and this doesn't appear to remove the column from the first two rows. From row 3 to the end, it looks fine. I am trying to remove the entire column based on the value of row 3. Removal also includes the corresponding column in the first two rows.

For this example,

awk 'NR==3          {MX=split (RM, T, " ")
                     for (i=1; i<=NF; i++)
                         for (n=1; n<=MX; n++)
                            if ($i==T[n]) T[n]=i
                    }
                    {for (n=1; n<=MX; n++) $(T[n])=""
                     $0=$0; $1=$1
                    }
     1
    ' FS="\t+" OFS="\t" RM="AtR_Ptb_L" $BASE_INPUT_FILE > $REVISED_FILE

The input file $BASE_INPUT_FILE has 58 columns. The output file $REVISED_FILE has 57 columns for row 3 to the end, but the first two rows still have 58 columns. Is the issue that there is no match found until the 3rd row, so the first two rows are printed as is?

I will need to use a bash variable to pass in a value for RM. This is looping and the value of RM will be changing. It could be a single value or several. Passing a bash array there like RM="${LIST_TO_REMOVE[@]}" seems to work for a single element, but seems to be a problem with more than one element on the list. If I convert the array to a space delimited string, then it works for one variable or more than one. What do you think the best method is here?

LMHmedchem

Don_Cragun · January 22, 2015, 3:35am

On a recent Apple MacBook Pro, the following script:

#!/bin/ksh
if [ $# -gt 0 ]
then	rem_list="$@"
else	rem_list="dxv1 k2"
fi
awk -v del="$rem_list" '
BEGIN {	# Split del string into remhdr[] (indexed 1..nrem)...
	nrem = split(del, remhdr)
	# Create rem[] (indexed by titles of fields to be removed).
	for(i = 1; i <= nrem; i++) {
		rem[remhdr]
		delete remhdr
	}
	# Set OFS
	OFS = "\t"
}
NR <= 3 {
	# Read header lines into hdr[]...
	for(i = 1; i <= NF; i++)
		hdr[NR, i] = $i
	if(NR == 3) {
		# Create array of output fields to delete: od[]...
		for(i = 1; i <= NF; i++)
			if(hdr[3, i] in rem) {
				od
				odc++
				delete rem[hdr[3,i]]
				nrem--
			}
		if(nrem) {
			for(i in rem)
				printf("*Field heading \"%s\" not found.*\n", i)
			printf("** Processing aborted. **\n");
			exit 1
		}
		# Print updated headers...
		for(i = 1; i <= 3; i++) {
			oc = NF - odc
			for(j = 1; j <= NF; j++)
				if(!(j in od))
					printf("%s%s", hdr[i, j],
						(--oc) ? OFS : ORS)
		}
	}
	next
}
{	# Print data lines...
	for(i in od)
		$i = ""
	$0 = $0
	$1 = $1
	print
}' original_f0_RSV_1912_A_S1v6_RI7_1916_15-01-10.txt

when invoked with no arguments or with the arguments dvx1 and k2 (in either order) with the files you uploaded in post #4 in this thread, produces output identical to the contents of the file intended_f0_RSV_1912_A_S1v6_RI7_1916_15-01-10.txt and the longest runtime from timing that script ten times when the script output is redirected to a regular file was:

real	0m0.07s
user	0m0.07s
sys	0m0.01s

and the fastest was:

real	0m0.06s
user	0m0.06s
sys	0m0.00s

As always, is you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk .

If this is part of a larger script that puts the header strings to be deleted in an array as you did in your last post, change the first few lines:

if [ $# -gt 0 ]
then	rem_list="$@"
else	rem_list="dxv1 k2"
fi
awk -v del="$rem_list" '

to:

awk -v del="${LIST_TO_REMOVE[@]}" '

or modify the above awk script to read your list file and your data file.

Although tested using ksh (a version of ksh93 on OS X), this will work with recent versions of both bash and ksh .

RudiC · January 22, 2015, 5:19am

Adapted to your revised spec:

RM="RIexp dxv1 k2 THBint5"
awk     'function prep()
                        {for (n=1; n<=MX; n++) $(T[n])=""
                         $0=$0; $1=$1
                         print
                        }
         NR<3           {TMP[NR]=$0; next}
         NR==3          {MX=split (RM, T, " ")
                         for (i=1; i<=NF; i++)
                             for (n=1; n<=MX; n++)
                                 if ($i==T[n]) T[n]=i
                         SV=$0
                         for (j=1; j<3; j++)
                                {$0=TMP[j]
                                 prep()
                                }
                         $0=SV
                        }
                        {prep()}
        ' FS="\t+" OFS="\t" RM="$RM" /tmp/o.txt

Should work with arrays as well.