Sorry for the weird title but i have the following problem.
We have several files which have between 10000 and about 500000 lines in them. From these files we want to remove lines which contain a pattern which is located in another file (around 20000 lines, all EAN codes). We also want to get the removed lines in a seperate file so we can check if lines get removed which shouldn't (this has nothing todo with the matching)
With both the above i should get 1 file that only has "0018208944262;A 562381;VNA750E1;50;4999.14;Nikon" in it and one file which has the rest in it.
I tried with the following awk code
awk -F ';' 'NR==FNR {id[$1]; next} $1 in id' filter.csv main.csv but it does not remove the line or put it in another file. I also tried grep but that only works when the filter file has around 100 or so lines.
Does anyone know a way how i can get those 2 results like above?
I know i can redirect the output but it still did not remove the lines from the main file. Sorry if that was unclear before.
Omg you are fast with editing :D. I tested the revised code and it seems to work fine. I now have 2 files with different output in each. I will test it some more with other files to be sure but it seems to work. Thank you again for the extremly fast helping.
cat script.ksh
awk 'FNR==NR{A[$1]=$1;next} ($1 in A){print >> "Output_match_found_file"} !($1 in A){print >> "Output_match_NOT_found_file"}' pattern_file FS=";" main_file
if [[ $? == 0 ]]
then
mv main_file main_file_Original
mv Output_match_NOT_found_file main_file
else
echo "Please check there seems to be an issue with awk command."
fi
Above code will create a backup for main_file with name main_file_Original and remvove non-matching lines from main_file too, let me know if this helps you.
EDIT: Also you could try following too. Here if first awk statement fails then off course 2nd statement to change Input_file name will not be executed then.
awk 'FNR==NR{A[$1]=$1;next} ($1 in A){print >> "Output_match_found_file"} !($1 in A){print >> "Output_match_NOT_found_file"}' pattern_file FS=";" match_file && mv Output_match_NOT_found_file match_file
I think i figured it out. The code above did not create the files either so i googled a bit and found out that the changed code like below works.
awk 'FNR==NR{A[$1]=$1;next} ($1 in A){print >> "'"$PAD/raw4.csv"'"} !($1 in A){print >> "'"$PAD/removed.csv"'"}' $VOEG/niet_gebruiken_ean.csv FS=";" $PAD/raw3.csv
Thank you for helping with the code.
I overlooked those 2 as i thought it where text inline with the files :D. Later on i noticed it created 2 files named positive and negative. Sorry for overlooking that
---------- Post updated 11-02-16 at 09:47 AM ---------- Previous update was 10-02-16 at 03:16 PM ----------
Here i am again. I tested the code with 1 of the things i needed filtering and it works just fine.
Now i am struggling with the other filter though. Same as before i need too filter lines from 1 file to 2 others. The difference here is that the filter file has words in it in which some have spaces like below:
All In One PC
Asus PC
Bandsaege
CI Module
Desktop PC
And the other problem is that these words can be anywhere in the file. I tried using the awk line from before and let it run on each colomn seperatly but it seems that it looks for each word seperatly creating false negatives. Any idea how i best can solve this?
I know i can do this with grep but its awefull slow and gives problems with too many lines.
---------- Post updated at 09:57 AM ---------- Previous update was at 09:47 AM ----------
This main file is different in most cases but that is mainly the delimiter which i should be able to handle.
filter file
4K Fernseher
Acer Aspire
Acer PC
Acer Veriton
All In One PC
Asus PC
Bandsaege
Wireless Desktop
With above samples it should filter the last line from the main file (and put it in a seperated file) because of the word(s) "Wireless Desktop" but not the line above which only contains "Wireless".
I am not 100% sure what you mean but i added sample data.
In total i have 2 filters which run over different stages of the same file.
The word filter will be run first as that is when i still have all the data to look for those. After that it will be cut and processed to a near finished file which then filters against the other EAN filter to remove the last lines which arent needed.
Technically it should be fine if they are in 1 filter file (and loop in the beginning) but that would probably make the code really complicated i guess.
I am not sure what i am doing wrong with your code but it does not seem to work like your example.
awk -F '\t' '
NR==FNR {id[$0]
next
}
{for (SP in id) if ($0 ~ SP)
{print > "'"$PAD/removed_woord.csv"'" #positive
next
}
}
{print > "'"$PAD/raw1.csv"'" #negative
}
' filters/woord_COM.csv $PAD/raw.csv
i get the positive file completly filled with all lines even though it should only add several in there while the negative file does not even get a single line. I added negative and positive in above code to should which is which.
I just noticed i had 2 small errors in my copied code. I removed them but the same problem is still present. the positive file with all the matches contains all lines and the negative one is not even created.
---------- Post updated at 04:32 PM ---------- Previous update was at 12:59 PM ----------
I did some more fiddling and i think i know where the error is and how to solve it.
I took the original code:
awk -F ';' ' NR==FNR {id[$0] next } {for (SP in id) if ($0 ~ SP) {print > "Positive" next } } {print > "Negative" } ' file1 file2
the only thing i changed on this was the names of file1 and file2. To be sure i used the example data i provided earlier. When i executed it the code worked fine and created both files.
Then i changed file1 to be the original filter file. Now it did not work anymore. I only recieved the Positive file with all lines. To be sure i tested it the other way around also but this worked fine. So i found out the problem was at the filter file. After some thinking i remembered that i copied over the data straight from a excel file and when i used notepad as in between it solved the problem.
So with a new filter file it works fine. Thank you for helping.
That is the annoying part. Some actually have TAB and some have semicolon etc. What is strange though is that for another file where i was testing i tried setting the delimiter to ; but it just ignored the whole code and threw all in the positive file. After i changed it to TAB en also made the delimiter TAB it worked just fine.
It is a bit annoying but i can work around that for at least the time untill i actually start to understand the code some more. I would even go as far as saying that the current code could be done in about half the lines but at least i understand what it does atm
I am always open for updated code if you would be happen to know it but it is no biggy to do it in 2 runs :).
Well to make it actually understandable i will post the whole file although it has been shortened a bit.
## Download the new list
curl -o $PAD/pricing.csv http://website/pricing.csv
## Backup original file
cp $PAD/pricing.csv $ARCHIEF/origineel.$TIJDDATUM.csv
## Remove all lines with 7% in colomn 6
awk -F"\t" '$6 != "7.00"' $PAD/pricing.csv > $PAD/raw.csv
## Put all lines that are in the word_filter_file in a seperated file and the rest in the raw2 for further processing
awk -F"\t" '
NR==FNR {id[$0]
next
}
{for (SP in id) if ($0 ~ SP) {print > "'"$PAD/removed_woord.csv"'"
next
}
}
{print > "'"$PAD/raw1.csv"'"
}
' $VOEG/filters/woord_COM.csv $PAD/raw.csv
## Remove all colomns except for the ones we need.
cut -f1,4,7-10 $PAD/raw1.csv > $PAD/raw2.csv
## Put all columns in the correct order and use TAB as delimiter
awk 'BEGIN { FS="\t"; OFS="\t"; } {print $4,$1,$5,$6,$2,$3}' $PAD/raw2.csv > $PAD/raw3.csv
## Remove all lines with a blank first column
awk '!/^\t/' $PAD/raw3.csv > $PAD/raw4.csv
## Remove all > in the complete file
sed 's/>//g' $PAD/raw4.csv > $PAD/raw5.csv
## Remove the first line
awk '{if (NR!=1) {print}}' $PAD/raw5.csv > $PAD/raw6.csv
## Limit the chars behind the . (only for column 5 needed)
awk 'BEGIN { FS="\t"; OFS="\t"; } {printf("%s\t%s\t%s\t%s\t%.2f\t%s\n",$1,$2,$3,$4,$5,$6)}' $PAD/raw6.csv > $PAD/raw7.csv
## Make the delimiter ;
awk 'BEGIN { FS="\t"; OFS=";"; } {print $1,$2,$3,$4,$5,$6}' $PAD/raw7.csv > $PAD/raw8.csv
## Put all lines that are in the ean_filter_file in a seperated file and the rest in the raw9 for further processing
awk 'FNR==NR{A[$1]=$1;next} ($1 in A){print >> "'"$PAD/removed_EAN.csv"'"} !($1 in A){print >> "'"$PAD/raw9.csv"'"}' $VOEG/filters/niet_gebruiken_ean.csv FS=";" $PAD/raw8.csv
## Remove line if column 5 contains -
awk -F';' '!($5 ~ "-")' $PAD/raw8.csv > $PAD/raw9.csv
## Add code as last column
awk '$0=$0" ;CMT"' $PAD/raw9.csv > $PAD/clean.csv
## Create backup of the cleaned file
cp $PAD/clean.csv $ARCHIEF/clean.$TIJDDATUM.csv
As mentioned before the code works fine but it is probarly not the best code.
Main sample data (the original file is far bigger)
All files are snippets of the full files as they are too big to put them here. With the code from above it should be understandable what happens.
Each filter file gives a seperate file as output and a raw one which continues the processing. At the end i have a tar command which packs those files together (not the raw ones) and sends them per email and also uploads them to the correct server.
The only real disadvantage that i have is that every supplier has its own layout but this above is one of the longest scripts for this. Most others are smaller but not less important. I know most of the code what i copied here so i should be able to make it work for those if needed.
Not sure if the fields selected and their order is correctly inferred; an output sample to compare to is missing. removed_woord is empty as all eligible lines were already discarded by the 7.00% VAT criterion.