I found a solution on stackoverflow for a very similar problem here
but none of the specified solutions in there work for me. They just print whatever the input is given, as it is.
Any ideas on what's wrong and how can I make it work?
#!/usr/bin/awk -f
BEGIN {
FS="\t"; OFS=FS
deduplist=ARGV[1]
ARGV[1]=""
split(deduplist,tmp," ")
for (i in tmp) dedup[tmp]=1
}
{
for (i=1; i<=NF;i++)
if (i in dedup) {
if ($i == prev)
$i = ""
else
prev = $i
}
# prevent printing lines that are completely blank because
# it's an exact duplicate of the preceding line and all fields
# are being deduplicated
if ($0 !~ /^[[:blank:]]*$/)
print
}
As the order in which the array elements are retrieved is unspecified, you may need to sort the output. And, you may want to define a different output field separator.
Those two requests are different: The first one has the columns aggregated with unique key $1 , the second with key $1 $2 .
What about the other columns (indicated by "..." in your sample)? Which contents should be retained?
This is brilliant. Thanks @RudiC. This has reduced a lot of manual work.
@pravin27 I think for the current situation its ok for me to mention all the columns as either a key or an agg field as there are not too many columns.
Your solution works great with tab as delimiter but I am using | and it is failing with the below error:
awk: 0602-521 There is a regular expression error.
*?+ not preceded by valid expression
The input line number is 24. The file is /tmp/abc.del.
The source line number is 1.
The script I posted has 23 lines, so it is difficult to track an error in line 24 without seeing a) the script you ran b) the input file (or a representative extraction of it).
You could run it with <TAB>s and then tr '\t' '|' the result.
To clarify a little bit, 24 is the last line of my input file and source line is 1 because I ran your script by eliminating all the new lines and replacing them with semi colons.
When I run it the you have given the error is on line 18 which is:
gsub (OFS OFS "*", OFS, OP)
Actually I changed the input file to test the code, it comes with | by default.
Of course there is and RudiC has already mentioned it to you (in post #13): use tr or any other text filter to change single characters) to change the separator to something else, then eventually reset it (if needed). Which character is safe to use will have to be decided by you after analysing your data. I suggest trying "@", which is fairly uncommon in normal text if no email addresses are mentioned, otherwise "�", "�" or the like.
Well yes I read it but as I said the default file comes with | by default and it would be preferred if that wasn't replaced with something else as the files are very huge and it is not advised to make their copies.
But may be that's the only thing possible right now. Cheers.