How to delete all lines with less then 32 characters from a textfile?

anna428 · February 25, 2010, 8:14am

I need to delete all lines with less then 32 characters from a textfile.

pludi · February 25, 2010, 8:40am

perl -nl -i.bak -e 'print if length > 32' yourfile

It will save the original file as a copy with the extension .bak

alister · February 25, 2010, 11:04am

What if Clint Eastwood was a UNIX sysadmin?

(rm data; sed '/.\{32\}/!d' > data) < data

To be clear, even if this works fine for you, it is a dangerous way of doing business. A power failure at just the right moment could leave you with no directory link to your data.

A safe, non-Dirty Harry approved version:

mv data data.bak; sed '/.\{32\}/!d' < data.bak > data

kshji · February 25, 2010, 11:19am

Awk and shell version.

awk 'length($0) >=32 { print } '  infile > outfile

#!/someposixshell
while read line 
do
      [ ${#line} -lt 32 ] && continue
      echo $line
done < infile > outfile

alister · February 25, 2010, 11:21am

Hi, kshji

At the cost of readability for AWK newbies, you can minimized that to:

awk 'length()>31' infile > outfile

And the shell version to:

#!/someposixshell
while IFS="" read -r line 
do
      [ ${#line} -gt 31 ] && echo "$line"
done < infile > outfile

It is necessary to disable IFS field splitting to prevent losing leading/trailing whitespace. The -r raw option is required to prevent joining lines that end in backslash with the line that follows. And the $line argument to echo needs to be quoted to prevent the replacing of runs of IFS whitespace with a single space.

Cheers,
Alister

anbu23 · February 25, 2010, 12:24pm

alister:

Hi, kshji

At the cost of readability for AWK newbies, you can minimized that to:
awk 'length()>31' infile > outfile
And the shell version to:
#!/someposixshell
while IFS="" read -r line 
do
   [ ${#line} -gt 31 ] && echo "$line"
done < infile > outfile
It is necessary to disable IFS field splitting to prevent losing leading/trailing whitespace. The -r raw option is required to prevent joining lines that end in backslash with the line that follows. And the $line argument to echo needs to be quoted to prevent the replacing of runs of IFS whitespace with a single space.

Cheers,
Alister

awk 'length>31' infile > outfile

kcoder24 · February 25, 2010, 1:08pm

My AWK version

'/.{32,}/'

alister · February 25, 2010, 1:57pm

@anbu23 and kcoder24:

heheh. nicely done.

The AWK implementation on the machine in front of me doesn't support brace intervals, but I assume you can shorten that just a bit more to:

'/.{32}/'

Cheers.

vivekraj · February 25, 2010, 10:13pm

We can also remove the lines which are having less than 32 characters in a file by using the following command.

sed -ir '/^.{0,31}$/d' <inputfilename>

Ultrix · February 25, 2010, 10:58pm

Try:

sed "/^.\{0,31\}$/d" file

EDIT: Just noticed that the above poster beat me to that answer. Oh, well, it was a good exercise.

thillai_selvan · February 25, 2010, 11:16pm

use strict; 
use warnings;  
open FH,"file"; #this is your input file
open RFH, "> output"; #this will be the output file which will have the lines from the input file longer than 32 characters 

while ( <FH> )  #reading the data from the file
{     
           if ( length $_ > 32 ) #if the line length is greater than 32    
           {         
                   print RFH $_; #writing to the output file     
           } 
}

This is a Perl script.
The file will contain the input data.
After executing the above code the output file will contain the lines which are all having length greater than 32 characters.

Ultrix · February 26, 2010, 12:57am

BTW, what is the practical value of what you're trying to do? I really want to know, because I just wrote a shell script that does what you described, only way better, and I'm wondering how I can use it.

---------- Post updated 02-26-10 at 12:57 AM ---------- Previous update was 02-25-10 at 11:54 PM ----------

thillai_selvan:

 use strict; 
use warnings;  
 open FH,"file"; #this is your input script 
my @array = ; #reading the file content in the input file 
open RFH,"> output"; #this will be the output file which will have the lines from the input file longer than 32 characters foreach ( @array ) 
{     
   if ( length $_ > 32 ) #if the line length is greater than 32    
   {         
   print RFH $_; #writing to the output file     
   } 
}

This is a Perl script.
The file will contain the input data.
After executing the above code the output file will contain the lines which are all having length greater than 32 characters.

Why use an 11 line Perl script when you can use a 1 line shell script? I tested my method and it works perfectly. I also modified it so that it takes an arbitrary range of line lengths, as well as allowing the user to decide whether to write to the file:

#!/bin/bash
# Deletes lines of a certain range length from a file.
# Writing the result back to the file is optional.

write="no";
for opt in $*
do
    case $opt in
        -w ) write="yes";
             shift;;
        -* ) shift;;
        *  );;
    esac
done
if [ $write == "yes" ]
then
    sed -e "/^.\{$1,$2\}$/d" $3 > tempfile.txt;
    cat tempfile.txt > $3;
    rm tempfile.txt;
else
    sed -e "/^.\{$1,$2\}$/d" $3;
fi

thillai_selvan · February 26, 2010, 1:03am

I answered this for the actual poster. Not indented to point to your script.
Perl is also an efficient script. Thats why I wanted to guide the questioner to get fame on my views

alister · February 26, 2010, 9:40am

ultrix:

I tested my method and it works perfectly. I also modified it so that it takes an arbitrary range of line lengths, as well as allowing the user to decide whether to write to the file:
#!/bin/bash
# Deletes lines of a certain range length from a file.
# Writing the result back to the file is optional.

write="no";
for opt in $*
do
   case $opt in
   -w ) write="yes";
   shift;;
   -* ) shift;;
   *  );;
   esac
done
if [ $write == "yes" ]
then
   sed -e "/^.\{$1,$2\}$/d" $3 > tempfile.txt;
   cat tempfile.txt > $3;
   rm tempfile.txt;
else
   sed -e "/^.\{$1,$2\}$/d" $3;
fi

Did you test it with filenames containing IFS characters? Assuming a default IFS value, your option handling, sed invocations, and cat statement will all barf if a filename contains whitespace.

for opt in $*

should be

for opt in "$@"

All instances of $3 need to be double-quoted.

Alister

Ultrix · February 26, 2010, 11:45am

alister:

Did you test it with filenames containing IFS characters? Assuming a default IFS value, your option handling, sed invocations, and cat statement will all barf if a filename contains whitespace.
for opt in $*
should be
for opt in "$@"
All instances of $3 need to be double-quoted.

Alister

Okay, I'll change that then.

EDIT: I fixed the script. It now looks like this:

#!/bin/bash
# deletes lines of a certain range length from a file
# Writing the result to the file is optional.

write="no";
for opt in "$@"
do
	case "$opt" in
		-w ) write="yes";
		     shift;;
		-* ) shift;;
		*  );;
	esac
done
least=$1;
great=$2;
shift;
shift;
filname="$*";
if [ $write == "yes" ]
then
	sed -e "/^.\{$least,$great\}$/d" "$filname" > tempfile.txt;
	cat tempfile.txt > "$filname";
	rm tempfile.txt;
else
	sed -e "/^.\{$least,$great\}$/d" "$filname";
fi
unset write

All I did was put quotation marks around the filename. I was also able to fix another script that I wrote a while back which was having the same problem.

Actually I've found I don't have to use "$@". It works whether I use that or $*. The quotation marks were the only problem.

alister · February 27, 2010, 9:21am

If you aren't using "$@" in that situation, then your script has a bug. There is no doubt about it.

Using $@ without quotes or $* with or without quotes will not expand to each individual command line argument (positional parameter in sh man page lingo). If unquoted, $@ and $* behave identically; they will expand to a list of words and then (this is the problem) each word (a positional parameter at this point) will be split according to the current setting of IFS (after splitting, the words may no longer correspond to the positional parameters). If you quote $*, you end up with one word containing all your positional parameters, regardless of how many parameters there are.

In that for loop, "$@" is the only correct option. If you don't believe me, try $* or $@ with a file name containing a space (assuming default IFS value) followed by one of your program's valid options, such as "infile -w". Even if the -w option isn't passed, such a filename will trigger it because "infile -w" will be split into two words, "infile" and "-w". That would be a bug. I realize that's a contrived and unlikely filename, but the point is that the option handling is behaving erroneously.

If you don't see it, read the sh man page carefully, with particular emphasis on the special parameters $@ and $*, quoting, and word splitting.

Here's some exemplary code:

$ cat o.sh 
#!/bin/bash

printf '==================================================\n'
printf '$@: INCORRECT: Word splitting after $@ expansion yields 3 words.\n'
for opt in $*
do
        case "$opt" in
                *  )echo "$opt";;
        esac
done

printf '==================================================\n'
printf '$*: INCORRECT: Word splitting after $* expansion yields 3 words.\n'
for opt in $*
do
        case "$opt" in
                *  )echo "$opt";;
        esac
done

printf '==================================================\n'
printf '"$*": INCORRECT: Always expands to one word, regarless of $# positional parameter count.\n'
for opt in "$*"
do
        case "$opt" in
                *  )echo "$opt";;
        esac
done

printf '==================================================\n'
printf '"$@": CORRECT: Expands to one word per positional parameter without subsequent word splitting.\n'
for opt in "$@"
do
        case "$opt" in
                *  )echo "$opt";;
        esac
done

# Let's call the script with two positional parameters.
# Only "$@" will expand to the two correct words, while the others result in 1 or 3.

$ ./o.sh -v 'input -w'
==================================================
$@: INCORRECT: Word splitting after $@ expansion yields 3 words.
-v
input
-w
==================================================
$*: INCORRECT: Word splitting after $* expansion yields 3 words.
-v
input
-w
==================================================
"$*": INCORRECT: Always expands to one word, regarless of $# positional parameter count.
-v input -w
==================================================
"$@": CORRECT: Expands to one word per positional parameter without subsequent word splitting.
-v
input -w

I hope this helped.

Regards,
Alister