Remove duplicate lines from a file

sudhakar_T · January 6, 2014, 9:50am

Hi,

I have a csv file which contains some millions of lines in it.
The first line(Header) repeats at every 50000th line. I want to remove all the duplicate headers from the second occurance(should not remove the first line).

I don't want to use any pattern from the Header as I have some 100s of such files in which I want to compare the first line with the remaining lines and remove the duplicates from the second occurance.

I cant use 'uniq' as I have other duplicate lines(Non-Headers) which I still need them persist.

Thanks in advance,
Sudhakar

bartus11 · January 6, 2014, 9:53am

Try:

nawk 'NR==1{x=$0;print}$0!=x' file

sudhakar_T · January 7, 2014, 10:48am

It worked.
Thank you Very much.

---------- Post updated 01-07-14 at 09:18 PM ---------- Previous update was 01-06-14 at 09:27 PM ----------

Hi bartus11,

The same command works fine from command line. But, I have script Run_queries.sh as below

/*

#!/bin/ksh
if [ "$4" == "FieldTickets" ]; then
export INTERNAL_CODE="Field Tickets";
echo $INTERNAL_CODE
else
export INTERNAL_CODE=$4;
fi
export DATE=`date +"%d%b%Y"`
cd $HOME/Prod_report_queries/reports/$DATE
sqlplus xyz/xxxx@xxx << EOF
set pagesize      50000
set heading        on
set feedback      off
set trimspool     on
set trim          on
set linesize    32767
set termout       off
set verify        off
set underline     off
set colsep '","'
set headsep '","'
define date_from = '$2'
define date_to = '$3'
define company_id = '$1'
define  internal_code='$INTERNAL_CODE'
define user_list = '$7'
SPOOL $5$DATE.csv
@$HOME/Prod_report_queries/queries/$6
SPOOL OFF
!sed '/^$/d' $5$DATE.csv > temp2.csv
!sed '/SQL>/d' temp2.csv > temp3.csv
!sed 's/^.*$/"&"/' temp3.csv > temp4.csv
!sed 's/[ ]*","/","/g' temp4.csv > temp5.csv
!sed 's/","[ ]*/","/g' temp5.csv > $5$DATE.csv
!awk 'NR==1{x=$0;print}$0!=x' temp6.csv > $5$DATE.csv.csv
!rm temp*.csv
EXIT;

This script takes some parameters from command line and gives me an formatted csv file. command below

sh Run_queries.sh 753807 NULL NULL NULL Hess-RedPOAlertReport Hess-RedPOAlertReport.sql NULL

But the 'awk' fails with the below error

awk: syntax error near line 1
awk: illegal statement near line 1
awk: syntax error near line 1
awk: bailing out near line 1

Is there a way to awk the csv file from within Run_queries.sh?

Thanks,
Sudhakar

bartus11 · January 7, 2014, 11:37am

Try replacing "awk" with "nawk".

sudhakar_T · January 8, 2014, 3:40am

Hi,

I replaced awk with nawk.

Now, I am getting syntax error

SQL> nawk: syntax error at source line 1
 context is
         >>> NR==1{x=Run_queries. <<< sh;print}Run_queries.sh!=x
nawk: illegal statement at source line 1

--Sudhakar

bartus11 · January 8, 2014, 3:56am

Try using it like this:

!nawk 'NR==1{x=\$0;print}\$0!=x' temp6.csv > $5$DATE.csv.csv

Aditya_001 · January 8, 2014, 4:23am

Why don't you try using

sort -u temp6.csv > $5$DATE.csv

Cheers,
Adi

PikK45 · January 8, 2014, 4:34am

Adi, sort -u would take a long time for a huge file. Depending on the situation, I guess awk or sed would perform better.