Compare 2 files, awk maybe?

phaethon · May 18, 2014, 6:59pm

I have 2 files,
file1:

alfa     numbers numbers 
vita     numbers numbers
gama   numbers numbers
delta    numbers numbers
epsilon numbers numbers
zita      numbers numbers
...

file2:

'zita'    keepnumbers keepnumbers keepnumbers
'gama' keepnumbers keepnumbers keepnumbers
'misc'  keepnumbers keepnumbers keepnumbers
'alfa'    keepnumbers keepnumbers keepnumbers
...

and I want to
print the lines of file2
of which the first word (in the first column)
matches with the first word of file 1 (in the first column), BUT keep the order of first file.
The output should look like

'alfa'    keepnumbers keepnumbers keepnumbers
'gama' keepnumbers keepnumbers keepnumbers
'zita'    keepnumbers keepnumbers keepnumbers

I have already tried with

awk 'NR==FNR{a[$1]++;next}a[$1]' file1 file2 > file3

but the order in file3 is like file2.
Moreover awk hits in the quote symbol ' is there a way to ignore it and read only the name inside quotes?
Thanks in advance for the help and time

Aia · May 18, 2014, 7:57pm

Would that work?

awk 'NR==FNR{gsub("\47", "", $1); a[$1]=$0;next} {if( $1 in a) {print a[$1]}}' file2 file1 > file3

Don_Cragun · May 18, 2014, 8:26pm

To exactly match the requested output, you could try:

awk -v sq="'" '
FNR == NR {
	d[$1] = $0
	next
}
(sq $1 sq) in d {
	print d[sq $1 sq]
}' file2 file1 > file3

phaethon · May 18, 2014, 8:37pm

It works but I have a problem because the first word in some lines has a space in the end like

 'alfa '   keepnumbers keepnumbers

do you know a way to overcome it? Thank you very much! For the recomendations too!

Aia · May 18, 2014, 9:23pm

Assuming that you want to keep 'alfa ' in the output.

awk 'NR==FNR{t=$1; gsub("\47", "", t); a[t]=$0; next} {if( $1 in a) {print a[$1]}}' file2 file1 > file3

Don_Cragun · May 18, 2014, 9:24pm

Or you could try:

awk -F " *'" '
FNR == NR {
	d[$2] = $0
	next
}
$1 in d {
	print d[$1]
}' file2 FS=' ' file1 > file3

phaethon · May 18, 2014, 10:12pm

One last question: what can I do if i want to remove the space
when it is followed by single quote from wherever it is inside the file?
The point is to keep the single quote in the previous and next words of a column.
e.g.

 'numbers1' 'te1 ' text 
 'numbers2' 'te2 ' text 
...

will have to result the output:

 'numbers1' 'te1' text
 'numbers2' 'te2' text
...

Note to mention that only 4 characters exist inside the problematic quotes (like 'tes ') including the space.

Aia · May 18, 2014, 11:03pm

Wonder if that would work!

awk 'NR==FNR{t=$1; gsub("\47", "", t); a[t]=$0; next} {if( $1 in a) {gsub(" \47", "\47", a[$1]);print a[$1]}}' file2 file1

Don_Cragun · May 19, 2014, 12:39am

If we keep file1 as it was in the original question and change file2 to contain:

'zita'    'keepnumbers '   'keepnumbers '   'keepnumbers'
'gama' keepnumbers  'keepnumbers '	'keepnumbers'
'misc'  'keepnumbers ' keepnumbers  'keepnumbers'
'alfa '    'keepnumbers '  'keepnumbers '   'keepnumbers'

(note that there is a tab between the last two fields instead of specs on the line containing "gamma"), Aia's code in message #8 in this thread produces:

'alfa'   'keepnumbers' 'keepnumbers'  'keepnumbers'
'gama' keepnumbers 'keepnumbers'	'keepnumbers'
'zita'   'keepnumbers''keepnumbers''keepnumbers'

which I think got rid of too many spaces.

If I understand the third set of requirements properly (only remove a single space at the end of fields between pairs of single quotes; keep spaces between fields as they were), I think this does what you want:

awk -v sq="'" '
BEGIN {	FS = OFS = sq
}
FNR == NR {
	for(i = 2; i <= NF; i +=2)
		if(substr($i, length($i)) == " ")
			$i = substr($i, 1, length($i) - 1)
	d[$2] = $0
	next
}
$1 in d {
	print d[$1]
}' file2 FS=" " file1 > file3

which produces the following output:

'alfa'    'keepnumbers'  'keepnumbers'   'keepnumbers'
'gama' keepnumbers  'keepnumbers'	'keepnumbers'
'zita'    'keepnumbers' 'keepnumbers' 'keepnumbers'

Aia · May 19, 2014, 2:51am

Due to FS.
For the sake of keeping the gsub() saga.

awk 'NR==FNR {gsub(" \47", "\47");gsub("\47\47", "\47 \47"); t=$1; gsub("\47", "", t); s[t]=$0; next} $1 in s {print s[$1]}' file2 file1

Don_Cragun · May 19, 2014, 3:29pm

aia:

Due to FS.
For the sake of keeping the gsub() saga.

awk 'NR==FNR {gsub(" \47", "\47");gsub("\47\47", "\47 \47"); t=$1; gsub("\47", "", t); s[t]=$0; next} $1 in s {print s[$1]}' file2 file1

If you have an empty quoted field in the input such as in:

'alfa '    'keepnumbers '  'keepnumbers '   ''

this will remove spaces at the ends of the quoted fields that have spaces and add a space to the empty quoted field. It also still removes a space between fields when there are multiple spaces between field in the input. as in:

'alfa'   'keepnumbers' 'keepnumbers'  ' '

instead of:

'alfa'    'keepnumbers'  'keepnumbers'   ''

But, of course, we have no way of knowing whether or not this matters to the OP since requirements for these cases were not specified.

phaethon · May 21, 2014, 9:58pm

Thank you very much both! Both of your solutions worked ut fine! you are an inspiration!