Add new column from another file

freeroute · August 2, 2017, 7:09pm

Hi,

I have 2 files.

file1 contains by lines: hash:salt:id
file2 contains by lines: username:hash:salt

I would like to add a new coloumn (id) form file1 to file2.

The new file should contains: username:hash:salt:id
Note: file1 contains less rows than file2.
I tried

paste -d':' file1 file2

, but this solution not good for me.

Thank you

Don_Cragun · August 2, 2017, 7:58pm

What operating system and shell are you using?

Unless both file1 and file2 are sorted by hash and salt and there is a one-to-one correspondence between lines in those two files, there is no chance that the command:

paste -d':' file1 file2

should be expected to work (and even in that case, those two fields would appear in the output twice).

You could consider using the join utility, but you would need to preprocess both input files so the key field used for joining is a single field (not two fields). If that won't work, awk would be the obvious choice. But, before we can suggest any real solution to your problem, you need to clearly define what should happen if:

there is no match in file1 for a pair of key fields found in file2 , or
there a more than one id field value in file1 for a single pair of key fields.

Please show us a representative pair of sample input files and the output that you want to produce from those sample inputs (each in CODE tags). (Please be sure that any cases that require special handling are included in your sample inputs.)

freeroute · August 3, 2017, 1:39am

I use Debian GNU Linux, and bash shell.

I tried this command

join  <(sort -t":" -k 1 file1.txt) <(sort -t":" -k 2 file2.txt) > file3.txt

For your question:
1.) there should be match. (if not, file3.txt should be empty)
2.) there will not be more than 1 id

file1.txt example:

e0113c5bccc164169af0bd68e5ecbd88:ca.!f+w;9[_2j^u<e+i6U=R:biscuit
5356defbc28859a7433a289a3ed7755d:0f#!LWxvi&FnHJdDn?<c{>M74W^[8;:blizzard
558f783143e78fd8df9d8006c993cf5a:v'>#\=c?BESaS^)_j9a7oDh:amigo
f0495d258b379363f560c7aaef658228:SD%!7vo1N;}L,,@]QA"_+-[J1,GBn5:tester
97d3d3227c40e10ea1b8af3f5156051c:4q1!w({=NIuwI32W28$&P["8g8;{bS:gregory
d448d9c1eda2368854a8a6b97d9732b9:6i=!YJ=>?Pza`K+LlY5eu@*ag5=&6O:griffin

file2.txt example:

Mercenary14:e0113c5bccc164169af0bd68e5ecbd88:ca.!f+w;9[_2j^u<e+i6U=R
thiago:8b8eab21b047d9ab19193d90ce0a98e7:=Ug"\^6mVwP*['fy2RGAJs2
uncleslam666:864ea0693e4b69ef8c3abe53fd71f913:k,x"{qla-iPRPR*ome*(!#v
rcmonster112:558f783143e78fd8df9d8006c993cf5a:v'>#\=c?BESaS^)_j9a7oDh

I need this format (so where the hash:salt match, should append username and id coloumns):

Mercenary14:e0113c5bccc164169af0bd68e5ecbd88:ca.!f+w;9[_2j^u<e+i6U=R:biscuit

Don_Cragun · August 3, 2017, 2:35am

With the sample input files you provided, why is there only one line of output? Why shouldn't the output from those two input files be:

Mercenary14:e0113c5bccc164169af0bd68e5ecbd88:ca.!f+w;9[_2j^u<e+i6U=R:biscuit
rcmonster112:558f783143e78fd8df9d8006c993cf5a:v'>#\=c?BESaS^)_j9a7oDh:amigo

?

Note that the join command:

join  -1 1 -2 2 -t: -o2.1,2.2,2.3,1.3  <(sort -t":" -k 1,2 file1.txt) <(sort -t":" -k 2,3 file2.txt) > file3.txt

produces the output above from your sample input, but you ignored my note about preprocessing both input files to produce a single key field. The above command only compares the hash fields in the two files and completely ignores the salt fields when looking for matches in the two input files.

If the above output is acceptable but you need to match both the hash and salt fields, you might try using awk :

awk -F: '
NR == FNR {
	id[$1, $2] = $3
	next
}
($2, $3) in id {
	print $0, id[$2, $3]
}' OFS=: file1.txt file2.txt > file3.txt

which also produces the output shown above from your sample input files.

If someone else wants to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

freeroute · August 3, 2017, 2:51am

With the sample input files you provided, why is there only one line of output?

Sorry, I forget to post other matching output lines.

Thanks. I tried, here is the bash error message:

awk -F: 'NR == FNR {id[$1, $2] = $3 next}($2, $3) in id {print $0, id[$2, $3]}' OFS=: file1.txt file2.txt > file3.txt

awk: line 1: syntax error at or near next

join  -1 1 -2 2 -t: -o2.1,2.2,2.3,1.3  <(sort -t":" -k 1,2 file1.txt) <(sort -t":" -k 2,3 file2.txt) > file3.txt

In this command the outfile (file3.txt) is empty.

Don_Cragun · August 3, 2017, 2:55am

Try using the script I gave you instead of randomly joining lines and introducing syntax errors!

freeroute · August 3, 2017, 3:23am

Okay. Thanks for your help. I try with no success:mad:. I have to read man pages of join and awk again...

Edit: I tried with file1.txt, file2.txt. It works, but with the original big found_2711.txt left_txt40k. files not.

found_2711.txt is ASCII
left_txt40k.txt is data

RudiC · August 3, 2017, 4:20am

How about posting the (curtailed) original files so people get the chance to analyse what might be going wrong? Not everybody's crystal ball can open and read your files remotely...

rbatte1 · August 3, 2017, 4:49am

Can you show us (each block wrapped in CODE tags):-

The (some representative example) of the input
The actual code you are running
The output & errors you are getting

Hopefully then we can spot the issue and guide you to a correction.

If you want to compress code into a single line, you will have to ensure it is still valid. New lines are important. Consider the two scripts below:-

printf "Hello "
printf "world\n"

printf "Hello " printf "world\n"

Would you expect to get the same output?

Kind regards,
Robin

freeroute · August 3, 2017, 6:45am

Thank you for your help. Working.
Every solution you said works (awk and join).

freeroute · August 5, 2017, 1:48am

rbatte1:

Can you show us (each block wrapped in CODE tags):-

The (some representative example) of the input

The actual code you are running

The output & errors you are getting

Hopefully then we can spot the issue and guide you to a correction.

If you want to compress code into a single line, you will have to ensure it is still valid. New lines are important. Consider the two scripts below:-
printf "Hello "
printf "world\n"
printf "Hello " printf "world\n"
Would you expect to get the same output?

Kind regards,
Robin

Tried. Not the same output. Thanks for the support.

freeroute · August 13, 2017, 3:22pm

don cragun:

With the sample input files you provided, why is there only one line of output? Why shouldn't the output from those two input files be:
Mercenary14:e0113c5bccc164169af0bd68e5ecbd88:ca.!f+w;9[_2j^u<e+i6U=R:biscuit
rcmonster112:558f783143e78fd8df9d8006c993cf5a:v'>#\=c?BESaS^)_j9a7oDh:amigo
?

Note that the join command:
join  -1 1 -2 2 -t: -o2.1,2.2,2.3,1.3  <(sort -t":" -k 1,2 file1.txt) <(sort -t":" -k 2,3 file2.txt) > file3.txt
produces the output above from your sample input, but you ignored my note about preprocessing both input files to produce a single key field. The above command only compares the hash fields in the two files and completely ignores the salt fields when looking for matches in the two input files.

If the above output is acceptable but you need to match both the hash and salt fields, you might try using awk :
awk -F: '
NR == FNR {
   id[$1, $2] = $3
   next
}
($2, $3) in id {
   print $0, id[$2, $3]
}' OFS=: file1.txt file2.txt > file3.txt
which also produces the output shown above from your sample input files.

If someone else wants to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

I have a little problem with the above join command:

If a hash contain semicolon ":", than the output will be wrong.

file1.txt format: hash:salt:ID
0276104677c86b442dd63126de7ab5b9:;i:=5:sakura
note: the hash contains ':'

than the output file3.txt (mail:ID)will be wrong:

icyfury75@yahoo.com:=5

It should be:

icyfury75@yahoo.com:sakura

file2.txt format: hash:salt:ID 
icyfury75@yahoo.com:0276104677c86b442dd63126de7ab5b9:;i5

Thanks

Don_Cragun · August 13, 2017, 3:51pm

freeroute:

I have a little problem with the above join command:

If a hash contain semicolon ":", than the output will be wrong.
file1.txt format: hash:salt:ID
0276104677c86b442dd63126de7ab5b9:;i:=5:sakura
note: the hash contains ':'
than the output file3.txt (mail:ID)will be wrong:
icyfury75@yahoo.com:=5
It should be:
icyfury75@yahoo.com:sakura
file2.txt format: hash:salt:ID 
icyfury75@yahoo.com:0276104677c86b442dd63126de7ab5b9:;i5
Thanks

It is clear that join won't work for you with the data you are feeding it.

If you choose a field separator that also appears as data in a field, how is join supposed to guess at which field separator characters are field separators and which ones aren't?

Your sample input doesn't include any field separators in any of the sample hash fields. Your problem statement doesn't say anything about having to ignore some field separators nor does it describe any mechanism that could be used to determine the lengths of the various fields in your input lines.

It is not clear whether or not you will provide a clear enough description of your data for anyone to use any other tools to process your data. We clearly can't do so with the specifications you have provided us so far.