Help with awk script to get missing numbers in column 1

Ophiuchus · November 5, 2013, 12:25pm

Hello to all,

I have show below a file separated by commas. In first column has numbers where the last number is 13.

1,4
2,6
3,7
5,2
6,5
7,5
8,65
9,10
11,78
13,2

What I want to know is which numbers are missing from 1 to 13 (in this case 13 is last number in column 1). My real file
has more than 5 million lines.

The output expected is as below:

4
10
12

Thanks in advance for any help.

Akshay_Hegde · November 5, 2013, 12:42pm

Try

$ awk -F, '{while(++x<$1)print x}' file
4
10
12

in2nix4life · November 5, 2013, 1:15pm

More awk... More awk please...

awk -F, '$1>a+1{print NR}{a=$1}' file
4
9
10

Akshay_Hegde · November 5, 2013, 1:18pm

9 is there in file

Ophiuchus · November 5, 2013, 1:25pm

Hello Akshay Hegde,

Thank you for your help.

It works with sample file but is not working with real file.

Real file has 5,440,177 lines and the last number in column 1 is 5440255. So, substracting both there are 78 numbers that are missing in column 1.
But trying your script I get more than 33 million of lines and I stopped since it seems enters in an infinite loop.

Is there a way to preload an array from 1 to N (N=13 in this case, in real file N=5440255)? in order to compare array which values from column 1 are not in array?

Thanks again

in2nix4life · November 5, 2013, 1:30pm

Oops, thanks Akshay. Here's the remix:

awk '($1!=a+1){print a+1}{a=$1}' file
4
10
12

Ophiuchus · November 5, 2013, 1:50pm

Hello in2nix4life,

Thanks for the help.

I've tried your script, the expected number of missing values in real file are 78 but is printing 108 numbers.

Is there a way to compare column 1 with a preloaded array that contains elements from 1 to N? in order to print values of array that are not present in column1?

Thanks again for the help.

Akshay_Hegde · November 5, 2013, 1:53pm

@ Ophiuchus I didn't get you...It should work if your real file is same like sample file

for example if I take sample file it's giving output..it would be better if you show us real input

$ cat file 
1,4
25,6
50,7
75,2
100,2

$ awk -F, '{while(++x<$1)print x}' file
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99

rdrtx1 · November 5, 2013, 2:09pm

try also:

awk -F, '{for (i=a+1; i<$1; i++) print i; a=$1}' file

Ophiuchus · November 5, 2013, 2:46pm

Hello Akshay Hegde,

I'm attaching the real column1, without any more columns and you can se that is being printed more lines than expected.

Hello rdrtx1,

It doesn't work with real attached file.

Thanks in advance for the help.

rdrtx1 · November 5, 2013, 3:19pm

in file.txt attached there are bad records, example:

 
(line #, value)
916084 62953736678539288
1092645 75087122368954400
1637717 7034045921763328
1722185 118349468948103173
3090566 13274083647037441
3321967 14267956259201025
4216020 289727116554141730
4807811 330395302886506496
4860863 334041077285781539
5051127 347116125965910039

Ophiuchus · November 5, 2013, 4:29pm

Hello rdrtx1,

Yes, there are those wrong values, then due to that I'd like to pre load an array of N consecutive elements to compare with column 1 to print only those that are missing, but I don't know how to pre load an array in that way.

rdrtx1 · November 5, 2013, 5:36pm

try something like:

awk -F, 'length($1)<8{for (i=a+1; i<$1; i++) print i; a=$1}' file.txt

Ophiuchus · November 5, 2013, 6:12pm

Hello rdrtx1,

Thanks for the help. It seems to work fine. Is giving me 93 values and the counts it seems to be fine since:
NL = Number of lines = 5440172
LN = Last number in column 1 = 5440255
Wl = Wrong lines = 10

Then, (LN-NL)+Wl=83 + 10 = 93.

PS: May you explain me the logic of your program please.

Many thanks for your help!

rdrtx1 · November 5, 2013, 6:29pm

Since the first example was coma delimeted then -F, (field delimiter) was used but not needed for the file posted.

length($1)<8 use only records that have field 1 length < 8

for (i=a+1; i<$1; i++) print i; a=$1 for value of a + 1 (stored from last record) to value of first field print the value of i list; store field 1 in a variable

Ophiuchus · November 5, 2013, 7:40pm

Thank you, I think I get the idea :)

Akshay_Hegde · November 5, 2013, 11:16pm

Try this since you have not supplied real input and not even mentioned that $1 length should not exceed more than 7. You were getting wrong result, it does not mean that it enters in an infinite loop. And in #1 you shown that your input is comma separated, but in real input it's not.

Missing and Count is shown below, change print x,++n to print x once test is done

$ awk  'length($1)<8{while(++x<$1)print x,++n}' file.txt 
65330 1
130866 2
196402 3
261938 4
327474 5
393010 6
458546 7
524082 8
589618 9
655154 10
720690 11
786226 12
851762 13
916097 14
917298 15
982834 16
1048370 17
1092661 18
1113906 19
1179442 20
1244978 21
1310514 22
1376050 23
1441586 24
1507122 25
1572658 26
1637741 27
1638194 28
1703730 29
1722211 30
1769266 31
1834802 32
1900338 33
1965874 34
2031410 35
2096946 36
2162482 37
2228018 38
2293554 39
2359090 40
2424626 41
2490162 42
2555698 43
2621234 44
2686770 45
2752306 46
2817842 47
2883378 48
2948914 49
3014450 50
3079986 51
3090613 52
3145522 53
3211058 54
3276594 55
3322017 56
3342130 57
3407666 58
3473202 59
3538738 60
3604274 61
3669810 62
3735346 63
3800882 64
3866418 65
3931954 66
3997490 67
4063026 68
4128562 69
4194098 70
4216084 71
4259634 72
4325170 73
4390706 74
4456242 75
4521778 76
4587314 77
4652850 78
4718386 79
4783922 80
4807884 81
4849458 82
4860937 83
4914994 84
4980530 85
5046066 86
5051204 87
5111602 88
5177138 89
5242674 90
5308210 91
5373746 92
5439282 93

Ophiuchus · November 5, 2013, 11:49pm

Hello Akshay,

Thanks for your help. I provided a simple sample since the logic should work for a small sample and in general. The handling of length of 7 for column1 was introduce by rdrtx1 since he found 10 wrong records that I didn't know about their existence.

The real file is comma delimited, I only upload the first column since is to big with more columns and the script would be the same only needed to remove the field separator.

Your last code it seems to work fine with real file now and is great the addition of count.

Many thanks