\xc3 it seems that this one refers to the utf-8 encoding, so part of the conversion scheme is:
When printing these characters to screen, Python does the job, however, when writing a .csv file, it is not able to handle the actual character, so I need to have a look if the mySQL is able to convert those codes to the actual characters.
It seems that the best thing to do will be to export the utf-8 file and then run a script over the file. By doing this I will be able to populate the database with the proper encoding.
I have been doing more research in the encodings and found this behaviour with the data coming from both databases:
mus_pop : : BDMC
\xc3\xa1 : á : \xe1
\xc3\xa9 : é : \xe9
\xc3\xad : í : \xed
\xc3\xb3 : ó : \xf3
\xc3\xba : ú : \xfa
\xc3\xb1 : ñ : \xf1
It should be noticed that mus_pop data was encoded as UTF-8, but I don’t know what encoder is being used for BDMC by Excel.
After lots of digging, I found the iconv library for converting to and from different encodings, and figured out that the encoding of Excel for the tab-delimited, Windows-Formatted files are of type CSISOLATIN-1. So the command-line for converting the files exported by Excel is
iconv -t UTF8 -f CSISOLATIN1 < ./input_file.txt > ./output_file.txt
I have been dealing for a couple of days with recursion problems when dumping data using pickle. The solution that I have used has been pretty simple but effective: just to increment the recursion limit by doing:
Since I have set up this new recursion limit, I have not had any recursion problems again.
I was granted with access to the BDMC (“La Base de Datos de la Música Chilena”, compiled by the SCD, the Chilean Copyright Society). Here are some numbers related to the amount of information that this database have:
- 40132 total songs
- 32569 different songs (so, 7563 cover songs or with same name?)
- 3342 different artists
- 3085 different albums (some noise, though, as in the case of “Obras Sinfónicas en Vivo CD1″ and “Obras Sinfónicas en Vivo CD2″, and some possible identical names between releases)
- 79 different genres (tags)
- 432 different record labels
It is interesting to see how the BDMC has a different scope when comparing it with other sources of Chilean music information, as in musicapopular.cl, mus.cl, portaldisc.cl, and vccl.tv. BDMC has in it only songs that already have generated some copyrights for its authors, so most of the songs have been air played.
I have already scraped the data from all other sites, preliminary numbers are:
- 502 album reviews
- 332 interviews
- 564 concert review
- 3353 artist biographies (I still need to extract the full discographies)
- 3634 album reviews (although there is some noise because there are some non-Chilean artists)
- 1661 videoclips
The ‘Base de datos de la música chilena‘ (BDCH, the Chilean-music database) is a project developed by the Fundación Música de Chile (FMC, the Music of Chile Foundation) that allows radio stations with a secure, easy and fast access to the biggest online repository of chilean music.
The scope of genres of the collection is wide, ranging from rock, pop and hip-hop, to classical and folklore, and are structured in 52 genres and sub genres. For an easy navigation, BDCH provides with search methods that allow users to query for artist, song, album, publication year or genre.
The songs in the collection belong to artists that are already members of the ‘Sociedad Chilena del Derecho de Autor’ (SCD, the Chilean copyright society), so most of the time they have been air played in commercial radio stations across Chile.
The website provides access to the statistics about the contents in the database:
Total songs :: 39958
Total directories :: 3140
Total albums and directorios :: 3140
Total album artist :: 1818
Total song artists :: 3163
Average song length :: 03:36
Standard deviation of the song length :: 01:50
Longest song :: 01:00:46
Shortest song :: 00:00
Database total length :: 2393:36:04
Avg. file size :: 4.53 Mb
Total file size :: 176.69 Gb
Avg. file bitrate :: 176.02 Kbps