help: Awk to control number of characters per line

Hello all,

I have the following problem:

My input is two sorted files:

file1

>1_19_130_F3
T01220131330230213311013000000110000
>1_23_69_F3
T01200211300200200010000001000000
>1_24_124_F3
T010203113002002111111200002010

file2

>1_19_130_F3
24 18 9 18 23 4 11 4 5 9 5 8 15 20 4 4 7 4 7 4 4 4 4 4 4 4 4 7 4 4 4 4 8 4 9 
>1_23_69_F3
26 4 7 4 4 17 5 23 4 4 5 6 5 5 4 4 4 5 4 4 4 4 4 4 4 4 4 8 4 4 4 4 7 4 4 
>1_24_124_F3
32 27 24 18 29 22 23 17 18 19 24 19 15 29 12 9 16 6 26 4 4 4 4 4 4 4 4 7 4 4 4 12 10 4 5 

Now I want to create a file similar to file 2, but with the same amount of fields than the respective line has numbers in file 1:
Output:

>1_19_130_F3
24 18 9 18 23 4 11 4 5 9 5 8 15 20 4 4 7 4 7 4 4 4 4 4 4 4 4 7 4 4 4 4 8 4 9 
>1_23_69_F3
26 4 7 4 4 17 5 23 4 4 5 6 5 5 4 4 4 5 4 4 4 4 4 4 4 4 4 8 4 4 4 4 
>1_24_124_F3
32 27 24 18 29 22 23 17 18 19 24 19 15 29 12 9 16 6 26 4 4 4 4 4 4 4 4 7 4 4 

I'm pretty sure there must be an easy solution to this, but I h can't figure it out yet. Do you have any idea how to do this with awk?

Thanks for your help,
Seb

edit: typo

Can you explain clearly for your request?

Now I want to create a file similar to file 2, but with the same amount of fields than the respective line has numbers in file 1:

Sure, My input is file1 and file2 above.

They are both sorted and have the same number of lines.

Now I want to create an output-file that is the same as file2, but has the same amount of fields than the same line has integers in file1.

Basicly if line "n" has 5 integer values in file 1:

T01210

I want to change the line "n" in file2 from:

26 23 54 4 22 4 6 3 6 8 66

to

26 23 54 4 22

btw. in file1 the line "n" always has a T as first character and is then followed by integer values.

I hope this helps!

 awk 'NR==FNR{T[NR]=length($1);next} 
      {if (FNR%2) {print $0} 
      else {{for (i=1;i<T[FNR];i++) printf "%s ",$i}; printf "\n"}
      }' file1 file2
1 Like

Wow, thanks for the quick response and your program.

However, I still have some problem. First it seems to take a lot of memory (Input files are about 4 Gb each), but I got it running assigning enough memory to the Job.
However, it then crashed by saying, "Wrong placed ()."

May that be caused by me submitting this task to a SGE cluster?

Thanks again!,
Sebastian

you can split the files to small size first.

If you have problem with the used memory you can try this:

awk '{
  getline line < "file1"
  gsub("[A-Za-z]","",line)
  n=length(line)
  $(n+1)="_"
  sub(" _.*","")
}1' file2
1 Like

Hi.

I think Franklin52's code is more compact than this. This post shows the results and compares with your desired results:

#!/usr/bin/env bash

# @(#) s1	Demonstrate adjust count of fields according to digit count.

# Uncomment to run script as external user.
# export PATH="/usr/local/bin:/usr/bin:/bin"
# Infrastructure details, environment, commands for forum posts. 
set +o nounset
pe() { for i;do printf "%s" "$i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe ; pe "Environment: LC_ALL = $LC_ALL, LANG = $LANG"
pe "(Versions displayed with local utility \"version\")"
c=$( ps | grep $$ | awk '{print $NF}' )
version >/dev/null 2>&1 && s=$(_eat $0 $1) || s=""
[ "$c" = "$s" ] && p="$s" || p="$c"
version >/dev/null 2>&1 && version "=o" $p specimen awk cmp diff
set -o nounset

FILE1=${1-data1}
shift
FILE2=${1-data2}

# Sample data files with head / tail if specimen fails.
pe
specimen $FILE1 $FILE2 \
|| { pe "(head/tail)"; head -n 5 $FILE1 $FILE2; pe " ||" ;\
     tail -n 5 $FILE1 $FILE2; }

pl " Results:"
awk -v f2="$FILE2" '
# first lines of pairs: print, skip
NR % 2 != 0 { print ; getline unused < f2 ; next }
            { digits = split($0,junk,"") - 1
              # print " Found", digits, " fields in line", NR
              getline < f2
              for (i = 1 ; i <= digits-1 ; i++) {
                printf("%s%s",$i,FS)
              }
              printf("%s",$digits)
              printf("\n")
            }
' $FILE1 |
tee t1

# Check results.

pl " Comparison with desired results:"
if cmp expected-output.txt t1
then
  pe " Passed -- files have same content."
else
  pe " Failed -- files differ -- details:"
  diff expected-output.txt t1
fi

exit 0

producing:

% ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0 
GNU bash 3.2.39
specimen (local) 1.17
GNU Awk 3.1.5
cmp (GNU diffutils) 2.8.1
diff (GNU diffutils) 2.8.1

Whole: 5:0:5 of 6 lines in file "data1"
>1_19_130_F3
T01220131330230213311013000000110000
>1_23_69_F3
T01200211300200200010000001000000
>1_24_124_F3
T010203113002002111111200002010

Whole: 5:0:5 of 6 lines in file "data2"
>1_19_130_F3
24 18 9 18 23 4 11 4 5 9 5 8 15 20 4 4 7 4 7 4 4 4 4 4 4 4 4 7 4 4 4 4 8 4 9 
>1_23_69_F3
26 4 7 4 4 17 5 23 4 4 5 6 5 5 4 4 4 5 4 4 4 4 4 4 4 4 4 8 4 4 4 4 7 4 4 
>1_24_124_F3
32 27 24 18 29 22 23 17 18 19 24 19 15 29 12 9 16 6 26 4 4 4 4 4 4 4 4 7 4 4 4 12 10 4 5

-----
 Results:
>1_19_130_F3
24 18 9 18 23 4 11 4 5 9 5 8 15 20 4 4 7 4 7 4 4 4 4 4 4 4 4 7 4 4 4 4 8 4 9
>1_23_69_F3
26 4 7 4 4 17 5 23 4 4 5 6 5 5 4 4 4 5 4 4 4 4 4 4 4 4 4 8 4 4 4 4
>1_24_124_F3
32 27 24 18 29 22 23 17 18 19 24 19 15 29 12 9 16 6 26 4 4 4 4 4 4 4 4 7 4 4

-----
 Comparison with desired results:
 Passed -- files have same content.

The core awk script processes pairs of lines sequentially, one pair at a time. It does not keep any extra data in memory. The line from the "control" file is broken into single-character strings, but only the count is important. The main data file is read and that count of fields is written.

Best wishes ... cheers, drl

Thanks for your replies. Using

  awk 'NR==FNR{T[NR]=length($1);next} 
      {if (FNR%2) {print $0} 
      else {{for (i=1;i<T[FNR];i++) printf "%s ",$i}; printf "\n"}
      }' file1 file2

with small files works, but the output is identical to file2.

Using

awk '{
  getline line < "file1"
  gsub("[A-Za-z]","",line)
  n=length(line)
  $(n+1)="_"
  sub(" _.*","")
}1' file2 

yields a syntax error at

n=length(line)

and

$(n+1)="_"

at the "=" signs.

Try nawk or /usr/xpg4/bin/awk on Solaris.