awk to clean up input file, printing both fields

In the f1 file below I am trying to clean it up removing lines the have _tn_ in them. Next, removing the characters in $2 before the ninth / . Then I remove the ID_(digit- always 4) . Finally, the charcters after and including the first _ . It is curently doing most of it but the cut is removing $1 and I'm sure there is a better way. Thank you :).

f1

1112233  /xxxx/xxxx/xxxx/xxxx/yyy_yyyy_yy-yyyy-yyy-yyy_yyyy_yyyy_yyyy_yyyy_yyy_yyy_yyy_000_000/yyy/yyy/ID_1234_000000-Control_z_zzzz_zz_zz_zz_zz_zz_zzz_zz-zzzz-zzz-zzz_zzzz_zzzz_zzz_zzz_zzz_zzz_zzz.txt
1112231  /xxxx/xxxx/xxxx/xxxx/yyy_yyyy_yy-yyyy-yyy-yyy_yyyy_yyyy_yyyy_yyyy_yyy_yyy_yyy_000_000/yyy_tn_yyy/yyy/ID_1234_000000-Control_z_zzzz_zz_zz_zz_zz_zz_zzz_zz-zzzz-zzz-zzz_zzzz_zzzz_zzz_zzz_zzz_zzz_zzz.txt

current

000000-Control_z_zzzz_zz_zz_zz_zz_zz_zzz_zz-zzzz-zzz-zzz_zzzz_zzzz_zzz_zzz_zzz_zzz_zzz.txt

desired

1112231  000000-Control
sed '/_tn_/d' f1 | cut -d/ -f9 | awk '{ gsub(/ID_[0-9][0-9][0-9][0-9]_/, "", $2); print }' | cut -d_ -f1- > out

You can include field #1 in cut

sed '/_tn_/d' f1 | cut -d/ -f1,9
1 Like

This example is wrong, thanks Rudi See two posts down for a revised version, tried to use REGEX to simplify the code, does not accomplish much.
Using the sample
this code

awk -F "[/ \-]" '{
               
                  printf("%s %s\n", $1, substr( $(15),1, index($(15),"_") -1 ) ) 
                  
               }' filename

Outputs:

1112233 Control
1112231 Control
1 Like

Please check your post #1 for specification errors.
Try

awk '/_tn_/ {next} gsub ("^.*/|_.*$|ID_...._", "", $2)' file
1112233 000000-Control
1 Like

Corrected version

awk -F "[/ \-]" '{
                  tmp=$(14)
                  gsub("[A-Z]{2}_[0-9]{4}_", "", tmp)
                  printf("%s %s-%s\n", $1, tmp, substr( $(15),1, index($(15),"_") -1 ) ) 
                  
               }' filename
2 Likes

Thank you all :slight_smile:

Moderator comments were removed during original forum migration.