Unicode equivalencies

It has been hard to work with spanish characters (such as accents and ñ) when scrapping the websites and working with a mySQL database. These are some of the equivalencies from Unicode for different characters:

\xc3 it seems that this one refers to the utf-8 encoding, so part of the conversion scheme is:

\xe1 á
\xe9 é
\xed í
\xf3 ó
\xfa ú
\xf1 ñ

When printing these characters to screen, Python does the job, however, when writing a .csv file, it is not able to handle the actual character, so I need to have a look if the mySQL is able to convert those codes to the actual characters.

It seems that the best thing to do will be to export the utf-8 file and then run a script over the file. By doing this I will be able to populate the database with the proper encoding.

I have been doing more research in the encodings and found this behaviour with the data coming from both databases:

mus_pop : : BDMC
\xc3\xa1 : á : \xe1
\xc3\xa9 : é : \xe9
\xc3\xad : í : \xed
\xc3\xb3 : ó : \xf3
\xc3\xba : ú : \xfa
\xc3\xb1 : ñ : \xf1

It should be noticed that mus_pop data was encoded as UTF-8, but I don’t know what encoder is being used for BDMC by Excel.

After lots of digging, I found the iconv library for converting to and from different encodings, and figured out that the encoding of Excel for the tab-delimited, Windows-Formatted files are of type CSISOLATIN-1. So the command-line for converting the files exported by Excel is

iconv -t UTF8 -f CSISOLATIN1 < ./input_file.txt > ./output_file.txt