awk remove/grab lines from file with pattern from other file

Sorry for the weird title but i have the following problem.

We have several files which have between 10000 and about 500000 lines in them. From these files we want to remove lines which contain a pattern which is located in another file (around 20000 lines, all EAN codes). We also want to get the removed lines in a seperate file so we can check if lines get removed which shouldn't (this has nothing todo with the matching)

pattern file:

0018208944262
4016432428011
7290006780829
4021121468858
5025232434084
4021121338540
4021121435638

main file

0018208944262;A 562381;VNA750E1;50;4999.14;Nikon
4242004181811;A 582194;B55CR22N0;2;939.46;Neff
4242004181439;A 582193;B45CS24N0;1;895.04;Neff
4716123314882;A 552806;NH-L9A;0;39.90;Noctua
4716123314875;A 548120;NH-L9I;1;39.01;Noctua

With both the above i should get 1 file that only has "0018208944262;A 562381;VNA750E1;50;4999.14;Nikon" in it and one file which has the rest in it.

I tried with the following awk code

awk -F ';' 'NR==FNR {id[$1]; next} $1 in id' filter.csv main.csv but it does not remove the line or put it in another file. I also tried grep but that only works when the filter file has around 100 or so lines.

Does anyone know a way how i can get those 2 results like above?

Hello SDohmen,

Could you please try following and let me know if this helps you.

awk 'FNR==NR{A[$1]=$1;next} ($1 in A){print}' pattern_Input_file FS=";" main_Input_file

Output will be as follows.

0018208944262;A 562381;VNA750E1;50;4999.14;Nikon

Thanks,
R. Singh

How about

awk -F ';' 'NR==FNR {id[$1]; next} $1 in id {print > "Positive"; next} {print > "Negative"}' file1 file2

Hi,

thank you for the quick reply but it does not seem to work. the pattern lines are still in the output and not in a seperated file.

This seems to create no file at all. i get no output.

Hello SDohmen,

You could redirect the output to a Output_file as follows then.

awk 'FNR==NR{A[$1]=$1;next} ($1 in A){print}' pattern_file FS=";" main_file  >  Output_file

EDIT: In case you need Output_file for matches and non-matches both then following may help you in same then.

 awk 'FNR==NR{A[$1]=$1;next} ($1 in A){print >> "Output_match_found_file"} !($1 in A){print >> "Output_match_NOT_found_file"}' pattern_file FS=";" main_file
 

Thanks,
R. Singh

1 Like

I know i can redirect the output but it still did not remove the lines from the main file. Sorry if that was unclear before.

Omg you are fast with editing :D. I tested the revised code and it seems to work fine. I now have 2 files with different output in each. I will test it some more with other files to be sure but it seems to work. Thank you again for the extremly fast helping.

Hello SDohmen,

Here is exactly what you may be looking for.

cat script.ksh
awk 'FNR==NR{A[$1]=$1;next} ($1 in A){print >> "Output_match_found_file"} !($1 in A){print >> "Output_match_NOT_found_file"}' pattern_file FS=";" main_file
if [[ $? == 0 ]]
then
        mv  main_file main_file_Original
        mv  Output_match_NOT_found_file main_file
else
        echo "Please check there seems to be an issue with awk command."
fi
 

Above code will create a backup for main_file with name main_file_Original and remvove non-matching lines from main_file too, let me know if this helps you.
EDIT: Also you could try following too. Here if first awk statement fails then off course 2nd statement to change Input_file name will not be executed then.

 awk 'FNR==NR{A[$1]=$1;next} ($1 in A){print >> "Output_match_found_file"} !($1 in A){print >> "Output_match_NOT_found_file"}' pattern_file FS=";" match_file && mv Output_match_NOT_found_file match_file
 

Thanks,
R. Singh

I just tested the code as follows (adapted to my enviroment)

awk 'FNR==NR{A[$1]=$1;next} ($1 in A){print >> $PAD/raw4.csv} !($1 in A){print >> $PAD/removed.csv}' $VOEG/niet_gebruiken_ean.csv FS=";" $PAD/raw3.csv

but it does not seem to create the files. The error i am getting is as follows:

awk: 1: unexpected character '.'
awk: 1: unexpected character '.'
awk: cannot open /srv/prijslijst/lev/raw4.csv (No such file or directory)

I tried changing the $PAD variable to the dir itself but it does not change the output. From the shell itself it works fine.

Hello SDohmen,

In awk assigning variables value will not be like shell ones. Could you please try following and let me know if this helps you.

 awk -vpad="$PAD" -vvoeg="$VOEG" 'FNR==NR{A[$1]=$1;next} ($1 in A){print >> pad"/raw4.csv"} !($1 in A){print >> pad"/removed.csv"}'  $VOEG"/niet_gebruiken_ean.csv" FS=";"  $PAD"/raw3.csv"

Thanks,
R. Singh

Did you look for two files called "Positive" and "Negative" in your pwd?

1 Like

I think i figured it out. The code above did not create the files either so i googled a bit and found out that the changed code like below works.

awk 'FNR==NR{A[$1]=$1;next} ($1 in A){print >> "'"$PAD/raw4.csv"'"} !($1 in A){print >> "'"$PAD/removed.csv"'"}' $VOEG/niet_gebruiken_ean.csv FS=";" $PAD/raw3.csv

Thank you for helping with the code.

I overlooked those 2 as i thought it where text inline with the files :D. Later on i noticed it created 2 files named positive and negative. Sorry for overlooking that :slight_smile:

---------- Post updated 11-02-16 at 09:47 AM ---------- Previous update was 10-02-16 at 03:16 PM ----------

Here i am again. I tested the code with 1 of the things i needed filtering and it works just fine.

Now i am struggling with the other filter though. Same as before i need too filter lines from 1 file to 2 others. The difference here is that the filter file has words in it in which some have spaces like below:

All In   One PC       
Asus PC       
Bandsaege       
CI Module       
Desktop PC   

And the other problem is that these words can be anywhere in the file. I tried using the awk line from before and let it run on each colomn seperatly but it seems that it looks for each word seperatly creating false negatives. Any idea how i best can solve this?

I know i can do this with grep but its awefull slow and gives problems with too many lines.

---------- Post updated at 09:57 AM ---------- Previous update was at 09:47 AM ----------

Sorry i forgot to add some samples.

Main file

37760   Haushalt & K�che > SodaStream & Wassermaxx > Sirup      SodaStream Orange Sirup 500ml   3.1682242990654 EUR     7.00    SodaStream      7290002793335   1020103490      >10     4.99    0.699   0       Haushalt & Kueche > SodaStream & Wassermaxx > Sirup
37761   Haushalt & K�che > SodaStream & Wassermaxx > Sirup      SodaStream Zitrone-Limette Sirup 500ml  2.5046728971963 EUR     7.00    SodaStream      7290002793328   1020110490      0       4.99    0.600   0       Haushalt & Kueche > SodaStream & Wassermaxx > Sirup
37762   Haushalt & K�che > SodaStream & Wassermaxx > Sirup      SodaStream Apfel-Mix Sirup 500ml        3.5046728971963 EUR     7.00    SodaStream      7290002793229   1020108491      3       4.99    0.600   0       Haushalt & Kueche > SodaStream & Wassermaxx > Sirup
37765   Haushalt & K�che > SodaStream & Wassermaxx > Sirup      SodaStream Isotonic Sirup 375ml 3.7289719626168 EUR     7.00    SodaStream      7290010498574   5140013 >10     4.99    0.400   0       Haushalt & Kueche > SodaStream & Wassermaxx > Sirup
37773   Haushalt & K�che > Elektro Kleinger�te > Eierkocher     Gastroback 42801 Design Eierkocher Silber       33      EUR     19.00   Gastroback      4016432428011   0580178 1       0       1.000   0       Haushalt & Kueche > Elektro Kleingeraete > Eierkocher
54164           Logitech R400 Wireless Presenter        29.327731092437 EUR     19.00   Logitech        5099206018129   910-001357      2       4.99    0.210   0
68132   Computer & Zubeh�r > Eingabeger�te > M�use      Logitech MK710 Wireless Desktop 64.621848739496 EUR     19.00   Logitech        5099206020948   920-002420      10      4.99    1.390   0       Computer & Zubehoer > Eingabegeraete > Maeuse

This main file is different in most cases but that is mainly the delimiter which i should be able to handle.

filter file

4K Fernseher
Acer Aspire
Acer PC
Acer Veriton
All In One PC
Asus PC
Bandsaege
Wireless Desktop

With above samples it should filter the last line from the main file (and put it in a seperated file) because of the word(s) "Wireless Desktop" but not the line above which only contains "Wireless".

I hope this makes it easier.

Not sure how to discriminate this new problem from the other. Does it come on top or in place? Some decent samples could help.

I am not 100% sure what you mean but i added sample data.

In total i have 2 filters which run over different stages of the same file.

The word filter will be run first as that is when i still have all the data to look for those. After that it will be cut and processed to a near finished file which then filters against the other EAN filter to remove the last lines which arent needed.

Technically it should be fine if they are in 1 filter file (and loop in the beginning) but that would probably make the code really complicated i guess.

This might solve your new problem:

awk -F ';' '
NR==FNR         {id[$0]
                 next
                }
                {for (SP in id) if ($0 ~ SP)    {print > "Positive"    
                                                 next
                                                }
                }
                {print > "Negative"    
                }
' file1 file2
cf *ive
Negative:
37760   Haushalt & K�che > SodaStream & Wassermaxx > Sirup      SodaStream Orange Sirup 500ml   3.1682242990654 EUR     7.00    SodaStream      7290002793335   1020103490      >10     4.99    0.699   0       Haushalt & Kueche > SodaStream & Wassermaxx > Sirup
37761   Haushalt & K�che > SodaStream & Wassermaxx > Sirup      SodaStream Zitrone-Limette Sirup 500ml  2.5046728971963 EUR     7.00    SodaStream      7290002793328   1020110490      0       4.99    0.600   0       Haushalt & Kueche > SodaStream & Wassermaxx > Sirup
37762   Haushalt & K�che > SodaStream & Wassermaxx > Sirup      SodaStream Apfel-Mix Sirup 500ml        3.5046728971963 EUR     7.00    SodaStream      7290002793229   1020108491      3       4.99    0.600   0       Haushalt & Kueche > SodaStream & Wassermaxx > Sirup
37765   Haushalt & K�che > SodaStream & Wassermaxx > Sirup      SodaStream Isotonic Sirup 375ml 3.7289719626168 EUR     7.00    SodaStream      7290010498574   5140013 >10     4.99    0.400   0       Haushalt & Kueche > SodaStream & Wassermaxx > Sirup
37773   Haushalt & K�che > Elektro Kleinger�te > Eierkocher     Gastroback 42801 Design Eierkocher Silber       33      EUR     19.00   Gastroback      4016432428011   0580178 1       0       1.000   0       Haushalt & Kueche > Elektro Kleingeraete > Eierkocher
54164           Logitech R400 Wireless Presenter        29.327731092437 EUR     19.00   Logitech        5099206018129   910-001357      2       4.99    0.210   0
Positive:
68132   Computer & Zubeh�r > Eingabeger�te > M�use      Logitech MK710 Wireless Desktop 64.621848739496 EUR     19.00   Logitech        5099206020948   920-002420      10      4.99    1.390   0       Computer & Zubehoer > Eingabegeraete > Maeuse

I am not sure what i am doing wrong with your code but it does not seem to work like your example.

awk -F '\t' '
NR==FNR         {id[$0]
                 next
                }
                {for (SP in id) if ($0 ~ SP)
                  {print > "'"$PAD/removed_woord.csv"'" #positive
                    next
                  }
                }
                {print > "'"$PAD/raw1.csv"'" #negative
                }
' filters/woord_COM.csv $PAD/raw.csv

i get the positive file completly filled with all lines even though it should only add several in there while the negative file does not even get a single line. I added negative and positive in above code to should which is which.

I just noticed i had 2 small errors in my copied code. I removed them but the same problem is still present. the positive file with all the matches contains all lines and the negative one is not even created.

---------- Post updated at 04:32 PM ---------- Previous update was at 12:59 PM ----------

I did some more fiddling and i think i know where the error is and how to solve it.

I took the original code:

awk -F ';' ' NR==FNR         {id[$0]                  next                 }                 {for (SP in id) if ($0 ~ SP)    {print > "Positive"                                                      next                                                 }                 }                 {print > "Negative"                     } ' file1 file2

the only thing i changed on this was the names of file1 and file2. To be sure i used the example data i provided earlier. When i executed it the code worked fine and created both files.

Then i changed file1 to be the original filter file. Now it did not work anymore. I only recieved the Positive file with all lines. To be sure i tested it the other way around also but this worked fine. So i found out the problem was at the filter file. After some thinking i remembered that i copied over the data straight from a excel file and when i used notepad as in between it solved the problem.

So with a new filter file it works fine. Thank you for helping.

The desired two-filter filtering can be done in one go. Does your data file have space field sparators or e.g. <TAB> separators?

That is the annoying part. Some actually have TAB and some have semicolon etc. What is strange though is that for another file where i was testing i tried setting the delimiter to ; but it just ignored the whole code and threw all in the positive file. After i changed it to TAB en also made the delimiter TAB it worked just fine.

It is a bit annoying but i can work around that for at least the time untill i actually start to understand the code some more. I would even go as far as saying that the current code could be done in about half the lines but at least i understand what it does atm :smiley:

I am always open for updated code if you would be happen to know it but it is no biggy to do it in 2 runs :).

If you provide samples that match the requirements (word filter, EAN filter, and data) and post possible results, we could give it a go.

Well to make it actually understandable i will post the whole file although it has been shortened a bit.

## Download the new list
curl -o $PAD/pricing.csv http://website/pricing.csv

## Backup original file
cp $PAD/pricing.csv $ARCHIEF/origineel.$TIJDDATUM.csv

## Remove all lines with 7% in colomn 6
awk -F"\t" '$6 != "7.00"' $PAD/pricing.csv > $PAD/raw.csv

## Put all lines that are in the word_filter_file in a seperated file and the rest in the raw2 for further processing
awk -F"\t" '
NR==FNR         {id[$0]
                 next
                }
                {for (SP in id) if ($0 ~ SP)    {print > "'"$PAD/removed_woord.csv"'"
                                                 next
                                                }
                }
                {print > "'"$PAD/raw1.csv"'"
                }
' $VOEG/filters/woord_COM.csv $PAD/raw.csv

## Remove all colomns except for the ones we need.
cut -f1,4,7-10 $PAD/raw1.csv > $PAD/raw2.csv

## Put all columns in the correct order and use TAB as delimiter
awk 'BEGIN { FS="\t"; OFS="\t"; } {print $4,$1,$5,$6,$2,$3}' $PAD/raw2.csv > $PAD/raw3.csv

## Remove all lines with a blank first column
awk '!/^\t/' $PAD/raw3.csv > $PAD/raw4.csv

## Remove all > in the complete file
sed 's/>//g' $PAD/raw4.csv > $PAD/raw5.csv

## Remove the first line
awk '{if (NR!=1) {print}}' $PAD/raw5.csv > $PAD/raw6.csv

## Limit the chars behind the . (only for column 5 needed)
awk 'BEGIN { FS="\t"; OFS="\t"; } {printf("%s\t%s\t%s\t%s\t%.2f\t%s\n",$1,$2,$3,$4,$5,$6)}' $PAD/raw6.csv > $PAD/raw7.csv

## Make the delimiter ;
awk 'BEGIN { FS="\t"; OFS=";"; } {print $1,$2,$3,$4,$5,$6}' $PAD/raw7.csv > $PAD/raw8.csv

## Put all lines that are in the ean_filter_file in a seperated file and the rest in the raw9 for further processing
awk 'FNR==NR{A[$1]=$1;next} ($1 in A){print >> "'"$PAD/removed_EAN.csv"'"} !($1 in A){print >> "'"$PAD/raw9.csv"'"}' $VOEG/filters/niet_gebruiken_ean.csv FS=";" $PAD/raw8.csv

## Remove line if column 5 contains -
awk -F';' '!($5 ~ "-")' $PAD/raw8.csv > $PAD/raw9.csv

## Add code as last column
awk '$0=$0" ;CMT"' $PAD/raw9.csv > $PAD/clean.csv

## Create backup of the cleaned file
cp $PAD/clean.csv $ARCHIEF/clean.$TIJDDATUM.csv

As mentioned before the code works fine but it is probarly not the best code.

Main sample data (the original file is far bigger)

sku     category        title   price   currency        tax     manufacturer    EAN     suppliernumber  instock shipp_moneyorder        weight  Sped    catSort EEK
37760   Haushalt & K�che > SodaStream & Wassermaxx > Sirup      SodaStream Orange Sirup 500ml   3.1682242990654 EUR     7.00    SodaStream      7290002793335   1020103490      0       4.99    0.699   0       Haushalt & Kueche > SodaStream & Wassermaxx > Sirup
37761   Haushalt & K�che > SodaStream & Wassermaxx > Sirup      SodaStream Zitrone-Limette Sirup 500ml  2.5046728971963 EUR     7.00    SodaStream      7290002793328   1020110490      >10     4.99    0.600   0       Haushalt & Kueche > SodaStream & Wassermaxx > Sirup
37762   Haushalt & K�che > SodaStream & Wassermaxx > Sirup      SodaStream Apfel-Mix Sirup 500ml        3.5046728971963 EUR     7.00    SodaStream      7290002793229   1020108491      3       4.99    0.600   0       Haushalt & Kueche > SodaStream & Wassermaxx > Sirup
37765   Haushalt & K�che > SodaStream & Wassermaxx > Sirup      SodaStream Isotonic Sirup 375ml 3.7289719626168 EUR     7.00    SodaStream      7290010498574   5140013 >10     4.99    0.400   0       Haushalt & Kueche > SodaStream & Wassermaxx > Sirup
37773   Haushalt & K�che > Elektro Kleinger�te > Eierkocher     Gastroback 42801 Design Eierkocher Silber       29.327731092437 EUR     19.00   Gastroback      4016432428011   0580178 1       0       1.000   0       Haushalt & Kueche > Elektro Kleingeraete > Eierkocher
37816   Haushalt & K�che > SodaStream & Wassermaxx > Ersatzflaschen     SodaStream PET Flasche 1 ltr. Boden Edelstahl   7.2521008403361 EUR     19.00   SodaStream      7290006780829   1041191490      0       3.99    1.000   0       Haushalt & Kueche > SodaStream & Wassermaxx > Ersatzflaschen
37898   Haushalt & K�che > Kaffee & Tee > Kaffeebohnen  Melitta Bella Crema LaCrema Kaffeebohnen 1kg    11.598130841121 EUR     7.00    Melitta 4002720008102   008102  8       4.99    1.000   0       Haushalt & Kueche > Kaffee & Tee > Kaffeebohnen
38616   Receiver & SAT-Anlagen > SAT Zubeh�r > SAT LNB  Kathrein UAS 584 Quatro LNB     75.756302521008 EUR     19.00   Kathrein        4021121468858   20110019        4       4.99    0.300   0       Receiver & SAT-Anlagen > SAT Zubehoer > SAT LNB
38909   Gesundheit & Wellness > K�rperpflege > Rasierer Zubeh�r Panasonic WES 035 K503 Reinigungskartusche      7.9831932773109 EUR     19.00   Panasonic       5025232434084   WES035K503      5       3.99    0.100   0       Gesundheit & Wellness > Koerperpflege > Rasierer Zubehoer
39101   Receiver & SAT-Anlagen > SAT Zubeh�r > SAT Stecker & Adapter    Kathrein ESD 84 Antennendose    5.9579831932773 EUR     19.00   Kathrein        4021121338540   274425  5       4.99    0.200   0       Receiver & SAT-Anlagen > SAT Zubehoer > SAT Stecker & Adapter
39417   Receiver & SAT-Anlagen > SAT Zubeh�r > SAT Stecker & Adapter    Kathrein EBC 10 Zweifachverteiler SAT-Verteiler 6.6386554621849 EUR     19.00   Kathrein        4021121435638   272859  6       4.99    0.100   0       Receiver & SAT-Anlagen > SAT Zubehoer > SAT Stecker & Adapter
39837   Haushalt & K�che > Elektro Kleinger�te > Folienschwei�ger�t     Rommelsbacher VRS 2060 Vakuumier Rollen 20x600cm 2er Set        8.3193277310924 EUR     19.00   Rommelsbacher   8018294008659   8018294008659   1       3.99    0.500   0       Haushalt & Kueche > Elektro Kleingeraete > Folienschweissgeraet
39840   Computer & Zubeh�r > Monitore > Monitor-Zubeh�r Dell AX510 Lautsprecher f�r UltraSharp  20.084033613445 EUR     19.00   Dell    0000000039840   520-10703       >10     4.99    1.000   0       Computer & Zubehoer > Monitore > Monitor-Zubehoer
39843   Haushalt & K�che > Elektro Kleinger�te > Zitruspresse   Gastroback 41138 Home Culture Zitruspresse      39      EUR     19.00   Gastroback      4016432411389   41138   3       0       1.000   0       Haushalt & Kueche > Elektro Kleingeraete > Zitruspresse
39934   Haushalt & K�che > Elektro Kleinger�te > Folienschwei�ger�t     Rommelsbacher VRS 3060 Vakuumier Rollen 30x600cm 2er Set        11.680672268908 EUR     19.00   Rommelsbacher   4001797824004   VRS3060 7       4.99    0.500   0       Haushalt & Kueche > Elektro Kleingeraete > Folienschweissgeraet

Word filter file

Schweissgeraete
Sharp Fernseher
Sirup
Software
Solo-Mikrowelle
Sony Fernseher
Standherde
Stromerzeuger

EAN filter file

0085126300272
4960759025241
0018208021444
0182080214444
4960759023858
0024066553065
0240665530652
4001797824004

All files are snippets of the full files as they are too big to put them here. With the code from above it should be understandable what happens.

Each filter file gives a seperate file as output and a raw one which continues the processing. At the end i have a tar command which packs those files together (not the raw ones) and sends them per email and also uploads them to the correct server.

The only real disadvantage that i have is that every supplier has its own layout but this above is one of the longest scripts for this. Most others are smaller but not less important. I know most of the code what i copied here so i should be able to make it work for those if needed.

Wellllll, not sure if I met ALL of your requirements that I tried to infer from your code, but this might be at least a starting point:

awk -F"\t" '
FNR == 1        {FC++
                }
FC == 1         {FILTWORD[$0]
                 next
                }
FC == 2         {FILTEAN[$0]
                 next
                }

FNR == 1 || 
$6 == 7         {next
                }

                {for (SP in FILTWORD) if ($0 ~ SP)      {print > "removed_woord"
                                                         next
                                                        }
                 gsub (/>/, "")
                }

!$4  ||
$9 ~ "-"        {next
                }

$8 in FILTEAN   {print $8, $1, $9, $10, $4+0, $7  > "removed_EAN" 
                 next
                }

                {print $8, $1, $9, $10, $4+0, $7, "CMT"  > "clean"
                }
' OFS=";" OFMT="%.2f" file2 file3 file1
cf removed_EAN clean 
removed_EAN:
4001797824004;39934;VRS3060;7;11.68;Rommelsbacher
clean:
4016432428011;37773;0580178;1;29.33;Gastroback;CMT
7290006780829;37816;1041191490;0;7.25;SodaStream;CMT
4021121468858;38616;20110019;4;75.76;Kathrein;CMT
5025232434084;38909;WES035K503;5;7.98;Panasonic;CMT
4021121338540;39101;274425;5;5.96;Kathrein;CMT
4021121435638;39417;272859;6;6.64;Kathrein;CMT
8018294008659;39837;8018294008659;1;8.32;Rommelsbacher;CMT
4016432411389;39843;41138;3;39;Gastroback;CMT

Not sure if the fields selected and their order is correctly inferred; an output sample to compare to is missing.
removed_woord is empty as all eligible lines were already discarded by the 7.00% VAT criterion.