DerSeb
June 10, 2010, 8:20am
1
Hello all,
I have the following problem:
My input is two sorted files:
file1
>1_19_130_F3
T01220131330230213311013000000110000
>1_23_69_F3
T01200211300200200010000001000000
>1_24_124_F3
T010203113002002111111200002010
file2
>1_19_130_F3
24 18 9 18 23 4 11 4 5 9 5 8 15 20 4 4 7 4 7 4 4 4 4 4 4 4 4 7 4 4 4 4 8 4 9
>1_23_69_F3
26 4 7 4 4 17 5 23 4 4 5 6 5 5 4 4 4 5 4 4 4 4 4 4 4 4 4 8 4 4 4 4 7 4 4
>1_24_124_F3
32 27 24 18 29 22 23 17 18 19 24 19 15 29 12 9 16 6 26 4 4 4 4 4 4 4 4 7 4 4 4 12 10 4 5
Now I want to create a file similar to file 2, but with the same amount of fields than the respective line has numbers in file 1:
Output:
>1_19_130_F3
24 18 9 18 23 4 11 4 5 9 5 8 15 20 4 4 7 4 7 4 4 4 4 4 4 4 4 7 4 4 4 4 8 4 9
>1_23_69_F3
26 4 7 4 4 17 5 23 4 4 5 6 5 5 4 4 4 5 4 4 4 4 4 4 4 4 4 8 4 4 4 4
>1_24_124_F3
32 27 24 18 29 22 23 17 18 19 24 19 15 29 12 9 16 6 26 4 4 4 4 4 4 4 4 7 4 4
I'm pretty sure there must be an easy solution to this, but I h can't figure it out yet. Do you have any idea how to do this with awk?
Thanks for your help,
Seb
edit: typo
Can you explain clearly for your request?
Now I want to create a file similar to file 2, but with the same amount of fields than the respective line has numbers in file 1:
DerSeb
June 10, 2010, 8:50am
3
Sure, My input is file1 and file2 above.
They are both sorted and have the same number of lines.
Now I want to create an output-file that is the same as file2, but has the same amount of fields than the same line has integers in file1.
Basicly if line "n" has 5 integer values in file 1:
T01210
I want to change the line "n" in file2 from:
26 23 54 4 22 4 6 3 6 8 66
to
26 23 54 4 22
btw. in file1 the line "n" always has a T as first character and is then followed by integer values.
I hope this helps!
awk 'NR==FNR{T[NR]=length($1);next}
{if (FNR%2) {print $0}
else {{for (i=1;i<T[FNR];i++) printf "%s ",$i}; printf "\n"}
}' file1 file2
1 Like
DerSeb
June 11, 2010, 8:28am
5
Wow, thanks for the quick response and your program.
However, I still have some problem. First it seems to take a lot of memory (Input files are about 4 Gb each), but I got it running assigning enough memory to the Job.
However, it then crashed by saying, "Wrong placed ()."
May that be caused by me submitting this task to a SGE cluster?
Thanks again!,
Sebastian
you can split the files to small size first.
If you have problem with the used memory you can try this:
awk '{
getline line < "file1"
gsub("[A-Za-z]","",line)
n=length(line)
$(n+1)="_"
sub(" _.*","")
}1' file2
1 Like
drl
June 12, 2010, 7:19am
8
Hi.
I think Franklin52's code is more compact than this. This post shows the results and compares with your desired results:
#!/usr/bin/env bash
# @(#) s1 Demonstrate adjust count of fields according to digit count.
# Uncomment to run script as external user.
# export PATH="/usr/local/bin:/usr/bin:/bin"
# Infrastructure details, environment, commands for forum posts.
set +o nounset
pe() { for i;do printf "%s" "$i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe ; pe "Environment: LC_ALL = $LC_ALL, LANG = $LANG"
pe "(Versions displayed with local utility \"version\")"
c=$( ps | grep $$ | awk '{print $NF}' )
version >/dev/null 2>&1 && s=$(_eat $0 $1) || s=""
[ "$c" = "$s" ] && p="$s" || p="$c"
version >/dev/null 2>&1 && version "=o" $p specimen awk cmp diff
set -o nounset
FILE1=${1-data1}
shift
FILE2=${1-data2}
# Sample data files with head / tail if specimen fails.
pe
specimen $FILE1 $FILE2 \
|| { pe "(head/tail)"; head -n 5 $FILE1 $FILE2; pe " ||" ;\
tail -n 5 $FILE1 $FILE2; }
pl " Results:"
awk -v f2="$FILE2" '
# first lines of pairs: print, skip
NR % 2 != 0 { print ; getline unused < f2 ; next }
{ digits = split($0,junk,"") - 1
# print " Found", digits, " fields in line", NR
getline < f2
for (i = 1 ; i <= digits-1 ; i++) {
printf("%s%s",$i,FS)
}
printf("%s",$digits)
printf("\n")
}
' $FILE1 |
tee t1
# Check results.
pl " Comparison with desired results:"
if cmp expected-output.txt t1
then
pe " Passed -- files have same content."
else
pe " Failed -- files differ -- details:"
diff expected-output.txt t1
fi
exit 0
producing:
% ./s1
Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution : Debian GNU/Linux 5.0
GNU bash 3.2.39
specimen (local) 1.17
GNU Awk 3.1.5
cmp (GNU diffutils) 2.8.1
diff (GNU diffutils) 2.8.1
Whole: 5:0:5 of 6 lines in file "data1"
>1_19_130_F3
T01220131330230213311013000000110000
>1_23_69_F3
T01200211300200200010000001000000
>1_24_124_F3
T010203113002002111111200002010
Whole: 5:0:5 of 6 lines in file "data2"
>1_19_130_F3
24 18 9 18 23 4 11 4 5 9 5 8 15 20 4 4 7 4 7 4 4 4 4 4 4 4 4 7 4 4 4 4 8 4 9
>1_23_69_F3
26 4 7 4 4 17 5 23 4 4 5 6 5 5 4 4 4 5 4 4 4 4 4 4 4 4 4 8 4 4 4 4 7 4 4
>1_24_124_F3
32 27 24 18 29 22 23 17 18 19 24 19 15 29 12 9 16 6 26 4 4 4 4 4 4 4 4 7 4 4 4 12 10 4 5
-----
Results:
>1_19_130_F3
24 18 9 18 23 4 11 4 5 9 5 8 15 20 4 4 7 4 7 4 4 4 4 4 4 4 4 7 4 4 4 4 8 4 9
>1_23_69_F3
26 4 7 4 4 17 5 23 4 4 5 6 5 5 4 4 4 5 4 4 4 4 4 4 4 4 4 8 4 4 4 4
>1_24_124_F3
32 27 24 18 29 22 23 17 18 19 24 19 15 29 12 9 16 6 26 4 4 4 4 4 4 4 4 7 4 4
-----
Comparison with desired results:
Passed -- files have same content.
The core awk script processes pairs of lines sequentially, one pair at a time. It does not keep any extra data in memory. The line from the "control" file is broken into single-character strings, but only the count is important. The main data file is read and that count of fields is written.
Best wishes ... cheers, drl
DerSeb
June 12, 2010, 8:20am
9
Thanks for your replies. Using
awk 'NR==FNR{T[NR]=length($1);next}
{if (FNR%2) {print $0}
else {{for (i=1;i<T[FNR];i++) printf "%s ",$i}; printf "\n"}
}' file1 file2
with small files works, but the output is identical to file2.
Using
awk '{
getline line < "file1"
gsub("[A-Za-z]","",line)
n=length(line)
$(n+1)="_"
sub(" _.*","")
}1' file2
yields a syntax error at
n=length(line)
and
$(n+1)="_"
at the "=" signs.
derseb:
Using
awk '{
getline line < "file1"
gsub("[A-Za-z]","",line)
n=length(line)
$(n+1)="_"
sub(" _.*","")
}1' file2
yields a syntax error at
n=length(line)
and
$(n+1)="_"
at the "=" signs.
Try nawk or /usr/xpg4/bin/awk on Solaris.