Find matched pattern and print all based on certain conditions

Hi,

I am trying to extract data based on certain conditions. My sample input file as below:-

lnc-2:1	OnePiece	tra_law	500	688	1	.	.	g_id "R792.8417"# tra_law_id "R792.8417.1"# g_line "2.711647"# KM "8.723820"#
lnc-2:1	OnePiece	room	500	510	1	.	.	g_id "R792.8417"# tra_law_id "R792.8417.1"# room_number "1"# g_line "2.711647"#
lnc-2:1	OnePiece	room	540	588	1	.	.	g_id "R792.8417"# tra_law_id "R792.8417.1"# room_number "2"# g_line "2.711647"#
lnc-2:1	OnePiece	room	620	650	1	.	.	g_id "R792.8417"# tra_law_id "R792.8417.1"# room_number "3"# g_line "2.711647"#
lnc-2:1	OnePiece	room	660	688	1	.	.	g_id "R792.8417"# tra_law_id "R792.8417.1"# room_number "4"# g_line "2.711647"#
lnc-1:3	OnePiece	tra_law	1	3601	1	.	.	g_id "R792.8416"# tra_law_id "R792.8416.1"# g_line "36.370155"# KM "117.008842"#
lnc-1:3	OnePiece	room	1	601	1	.	.	g_id "R792.8416"# tra_law_id "R792.8416.1"# room_number "1"# g_line "36.370155"#
lnc-1:3	OnePiece	room	1020	3001	1	.	.	g_id "R792.8416"# tra_law_id "R792.8416.1"# room_number "2"# g_line "36.370155"#
lnc-1:3	OnePiece	room	3400	3601	1	.	.	g_id "R792.8416"# tra_law_id "R792.8416.1"# room_number "3"# g_line "36.370155"#
lnc-9:1	OnePiece	tra_law	1743	2314	1	.	.	g_id "R792.8419"# tra_law_id "R792.8419.1"# g_line "27.213287"# KM "87.549683"#
lnc-9:1	OnePiece	room	1743	2314	1	.	.	g_id "R792.8419"# tra_law_id "R792.8419.1"# room_number "1"# g_line "27.213287"#
lnc-16:4	OnePiece	tra_law	25408	63025	1	-	.	g_id "R792.8420"# tra_law_id "R792.8420.1"# g_line "357.721802"# KM "1150.850586"#
lnc-16:4	OnePiece	room	25408	25528	1	-	.	g_id "R792.8420"# tra_law_id "R792.8420.1"# room_number "1"# g_line "765.276733"#
lnc-16:4	OnePiece	room	62888	63025	1	-	.	g_id "R792.8420"# tra_law_id "R792.8420.1"# room_number "2"# g_line "0.372920"#

I want to get an output where when all conditions are met, it should print every lines with the same name in $1. The conditions as follows:-

1) "tra_law" is found in $3 && the results of $5 - $4 (of tra_law) is > 200. It should print all the following lines associated with it.
2) Then, it should check for the room number in last column, where only room_number with min of 3 counts will be taken into consideration.

The output should be like below:-

lnc-1:3	OnePiece	tra_law	1	3601	1	.	.	g_id "R792.8416"# tra_law_id "R792.8416.1"# g_line "36.370155"# KM "117.008842"#
lnc-1:3	OnePiece	room	1	601	1	.	.	g_id "R792.8416"# tra_law_id "R792.8416.1"# room_number "1"# g_line "36.370155"#
lnc-1:3	OnePiece	room	1020	3001	1	.	.	g_id "R792.8416"# tra_law_id "R792.8416.1"# room_number "2"# g_line "36.370155"#
lnc-1:3	OnePiece	room	3400	3601	1	.	.	g_id "R792.8416"# tra_law_id "R792.8416.1"# room_number "3"# g_line "36.370155"#

as you can see, only lnc-1:3 met the conditions. for lnc-2:1, the tra_law value is less than 200 (688 - 500 = 180), therefore, it is omitted. As for lnc-9:1 and lnc-16:4, though the tra_law value > 200, both are omitted too as the room_number counts are less than 3.

I tried to use awk to work on it. My codes as below:-

awk -F"\t" 'NR>2 {$20=$5-$4; if ($20>200 && $3 ~/tra_law/) print $0}' inputfile | awk '{NF--NF};1' > outputfile

I got the results of the conditions no 1. But, it did not print the following lines associated to it. Also, I do not know how to check for condition no 2. I would prefer for the condition no 2 to be put in separate awk command as I might need to use them separately in different situation. Tried it many times but failed. Appreciate your kind help. Thanks.

Help me out: what does {NF--NF} do in your second awk script do? An uncommon construct, at least to me...

For your problem, how about

awk -F"\t" '
NR > 2           &&
($5-$4) > 200    &&
$3 ~ /tra_law/  {if (CNT > 3) print substr (OUT, 2)
                 SRC = $1
                 OUT = ""
                 CNT = 0
                }
$1 == SRC       {OUT = OUT ORS $0 
                 CNT ++
                }
END             {if (CNT > 3) print substr (OUT, 2)
                }

' file

lnc-1:3    OnePiece    tra_law    1    3601    1    .    .    g_id "R792.8416"# tra_law_id "R792.8416.1"# g_line "36.370155"# KM "117.008842"#
lnc-1:3    OnePiece    room       1     601    1    .    .    g_id "R792.8416"# tra_law_id "R792.8416.1"# room_number "1"# g_line "36.370155"#
lnc-1:3    OnePiece    room    1020    3001    1    .    .    g_id "R792.8416"# tra_law_id "R792.8416.1"# room_number "2"# g_line "36.370155"#
lnc-1:3    OnePiece    room    3400    3601    1    .    .    g_id "R792.8416"# tra_law_id "R792.8416.1"# room_number "3"# g_line "36.370155"#
1 Like

Hi RudyC,

Thanks so much...your codes work great on my real data.

If you notice, I created $20 to put in the total of $5 - $4. I need to remove $20 (last column) from being printed out in the final output. Therefore, i used {NF--NF} for that purpose. :smiley: It was just my silly way I guess.

Well, yes, I noticed the necessary removal of $20. So NF-- would remove it (if it is the last field). What is the trailing NF for?

And, why $20 when your entire file has only 9 fields per line consistently?
And - as I see now - why the NR>2 condition? You normally use this to eliminate e.g. headers. But your file seems to have valid data in lines 1 and 2?

-> I do have 2 lines header in my real data that need to be omitted. :slight_smile:

Hi,

I noticed that the code did not work for the below condition.

CL-AS1:4	OnePiece	tra_law	4721	4962	1	.	.	g_id "R06794.16434"; tra_law_id "R06794.16434.1"; g_line "6.980716"; KM "4.794062"; PM"4.235367";
CL-AS1:4	OnePiece	room		4721	4962	1	.	.	g_id "R06794.16434"; tra_law_id "R06794.16434.1"; room_number "1"; g_line "6.980716";
CL-AS1:4	OnePiece	tra_law	5085	5285	1	.	.	g_id "R06794.16435"; tra_law_id "R06794.16435.1"; g_line "4.355471"; KM "2.991154"; PM"2.642568";
CL-AS1:4	OnePiece	room		5085	5285	1	.	.	g_id "R06794.16435"; tra_law_id "R06794.16435.1"; room_number "1"; g_line "4.355471";
CL-AS1:4	OnePiece	tra_law	6800	24864	1	-	.	g_id "R06794.16436"; tra_law_id "R06794.16436.1"; g_line "5.995821"; KM "4.117677"; PM"3.637807";
CL-AS1:4	OnePiece	room		6800	7033	1	-	.	g_id "R06794.16436"; tra_law_id "R06794.16436.1"; room_number "1"; g_line "6.462393";
CL-AS1:4	OnePiece	room		24831	24864	1	-	.	g_id "R06794.16436"; tra_law_id "R06794.16436.1"; room_number "2"; g_line "2.784706";
CL-AS1:4	OnePiece	tra_law	8440	8785	1	.	.	g_id "R06794.16437"; tra_law_id "R06794.16437.1"; g_line "7.209587"; KM "4.951241"; PM"4.374228";
CL-AS1:4	OnePiece	room		8440	8785	1	.	.	g_id "R06794.16437"; tra_law_id "R06794.16437.1"; room_number "1"; g_line "7.209587";

I got the below output. By right, all the data above should be thrown out.

CL-AS1:4	OnePiece	tra_law	4721	4962	1	.	.	g_id "R06794.16434"; tra_law_id "R06794.16434.1"; g_line "6.980716"; KM "4.794062"; PM"4.235367";
CL-AS1:4	OnePiece	room		4721	4962	1	.	.	g_id "R06794.16434"; tra_law_id "R06794.16434.1"; room_number "1"; g_line "6.980716";
CL-AS1:4	OnePiece	tra_law	5085	5285	1	.	.	g_id "R06794.16435"; tra_law_id "R06794.16435.1"; g_line "4.355471"; KM "2.991154"; PM"2.642568";
CL-AS1:4	OnePiece	room		5085	5285	1	.	.	g_id "R06794.16435"; tra_law_id "R06794.16435.1"; room_number "1"; g_line "4.355471";

Tried to play around with the given codes but it didnt work. Need your kind help again. Thanks

Well, looks like any tra_law line should reset the counters, regardless of the $5 - $4 delta value. Try this small adaption.

awk -F"\t" '
NR > 2           &&
$3 ~ /tra_law/  {if ((CNT > 3) && (DELTA > 200)) print substr (OUT, 2)
                 DELTA = $5 - $4
                 SRC = $1
                 OUT = ""
                 CNT = 0
                }
$1 == SRC       {OUT = OUT ORS $0 
                 CNT ++
                }
END             {if  ((CNT > 3) && (DELTA > 200))  print substr (OUT, 2)
                }

' file

Your new file seems to have multiple <TAB> chars as field separators?

1 Like

Hi RudiC,

The multiple <TAB> chars were my typo mistakes. Sorry about that.

It worked now. Thanks a million. I am trying to understand your codes especially the "OUT" part. If you don't mind, can you pls help explain about it. thanks