Encoding scheme

I have been trying different encoding across the application that I am using and the one that works better is Western (MacRoman). Using Spanish, it opens all accents, special characters and the ñ. Hopefully I will be able to use in the DDBB.

 

It has been also hard to figure out how to properly export files from the spreadsheet, until now, the best file format and encoding scheme has been ‘windows_formatted’

Unicode equivalencies

It has been hard to work with spanish characters (such as accents and ñ) when scrapping the websites and working with a mySQL database. These are some of the equivalencies from Unicode for different characters:

\xc3 it seems that this one refers to the utf-8 encoding, so part of the conversion scheme is:

\xe1 á
\xe9 é
\xed í
\xf3 ó
\xfa ú
\xf1 ñ

When printing these characters to screen, Python does the job, however, when writing a .csv file, it is not able to handle the actual character, so I need to have a look if the mySQL is able to convert those codes to the actual characters.

It seems that the best thing to do will be to export the utf-8 file and then run a script over the file. By doing this I will be able to populate the database with the proper encoding.

I have been doing more research in the encodings and found this behaviour with the data coming from both databases:

mus_pop : : BDMC
\xc3\xa1 : á : \xe1
\xc3\xa9 : é : \xe9
\xc3\xad : í : \xed
\xc3\xb3 : ó : \xf3
\xc3\xba : ú : \xfa
\xc3\xb1 : ñ : \xf1

It should be noticed that mus_pop data was encoded as UTF-8, but I don’t know what encoder is being used for BDMC by Excel.

After lots of digging, I found the iconv library for converting to and from different encodings, and figured out that the encoding of Excel for the tab-delimited, Windows-Formatted files are of type CSISOLATIN-1. So the command-line for converting the files exported by Excel is

iconv -t UTF8 -f CSISOLATIN1 < ./input_file.txt > ./output_file.txt

Recursion problems using Python

I have been dealing for a couple of days with recursion problems when dumping data using pickle. The solution that I have used has been pretty simple but effective: just to increment the recursion limit by doing:

sys.setrecursionlimit(10000)

Since I have set up this new recursion limit, I have not had any recursion problems again.