Database design brainstorming

I’ve been working in the design of the tables for representing all the entities that I would like to make available in my database. For the moment, what I can see for the repository are the following tables and the relations between them:
However, after a conversation with Corina MacDonald, she suggested me that all the album reviews, interviews, pictures, and any possible future material could be considered as a RESOURCE. Also, a composer, photographer, director, journalist and any other person could be part of a PEOPLE table. I have been trying to put these ideas into one schema, and the result, still fuzzy is:
After parsing some people linked to artists, I have been refining a little bit the DB structure. In the PEOPLE_ARTIST table I will be storing an array with the instruments that a person played in a specific artist, also the type of relations (e.g., one-to-many, many-to-many, etc) is now declared.

Unicode equivalencies

It has been hard to work with spanish characters (such as accents and ñ) when scrapping the websites and working with a mySQL database. These are some of the equivalencies from Unicode for different characters:

\xc3 it seems that this one refers to the utf-8 encoding, so part of the conversion scheme is:

\xe1 á
\xe9 é
\xed í
\xf3 ó
\xfa ú
\xf1 ñ

When printing these characters to screen, Python does the job, however, when writing a .csv file, it is not able to handle the actual character, so I need to have a look if the mySQL is able to convert those codes to the actual characters.

It seems that the best thing to do will be to export the utf-8 file and then run a script over the file. By doing this I will be able to populate the database with the proper encoding.

I have been doing more research in the encodings and found this behaviour with the data coming from both databases:

mus_pop : : BDMC
\xc3\xa1 : á : \xe1
\xc3\xa9 : é : \xe9
\xc3\xad : í : \xed
\xc3\xb3 : ó : \xf3
\xc3\xba : ú : \xfa
\xc3\xb1 : ñ : \xf1

It should be noticed that mus_pop data was encoded as UTF-8, but I don’t know what encoder is being used for BDMC by Excel.

After lots of digging, I found the iconv library for converting to and from different encodings, and figured out that the encoding of Excel for the tab-delimited, Windows-Formatted files are of type CSISOLATIN-1. So the command-line for converting the files exported by Excel is

iconv -t UTF8 -f CSISOLATIN1 < ./input_file.txt > ./output_file.txt

Recursion problems using Python

I have been dealing for a couple of days with recursion problems when dumping data using pickle. The solution that I have used has been pretty simple but effective: just to increment the recursion limit by doing:

sys.setrecursionlimit(10000)

Since I have set up this new recursion limit, I have not had any recursion problems again.

Very first numbers…

I was granted with access to the BDMC (“La Base de Datos de la Música Chilena”, compiled by the SCD, the Chilean Copyright Society). Here are some numbers related to the amount of information that this database have:

bdmc

  • 40132 total songs
  • 32569 different songs (so, 7563 cover songs or with same name?)
  • 3342 different artists
  • 3085 different albums (some noise, though, as in the case of “Obras Sinfónicas en Vivo CD1″ and “Obras Sinfónicas en Vivo CD2″, and some possible identical names between releases)
  • 79 different genres (tags)
  • 432 different record labels
However, there is some noise in this data because entries with different styles appear as different things (e.g.,  “DJ Méndez y Yoan Amor” and “DJ Méndez – Yoan Amor”, “A ti”, “A Ti”, and “A tí”). A process of normalization of the data is required for further processing!

It is interesting to see how the BDMC has a different scope when comparing it with other sources of Chilean music information, as in musicapopular.cl, mus.cl, portaldisc.cl, and vccl.tv. BDMC has in it only songs that already have generated some copyrights for its authors, so most of the songs have been air played.

I have already scraped the data from all other sites, preliminary numbers are:

mus.cl

  • 502 album reviews
  • 332 interviews
  • 564 concert review

musicapopular.cl

  • 3353 artist biographies (I still need to extract the full discographies)

portaldisc.cl

  • 3634 album reviews (although there is some noise because there are some non-Chilean artists)

vccl.tv

  • 1661 videoclips

Base de Datos de la Música Chilena

The ‘Base de datos de la música chilena‘ (BDCH, the Chilean-music database) is a project developed by the Fundación Música de Chile (FMC, the Music of Chile Foundation) that allows radio stations with a secure, easy and fast access to the biggest online repository of chilean music.

The scope of genres of the collection is wide, ranging from rock, pop and hip-hop, to classical and folklore, and are structured in 52 genres and sub genres. For an easy navigation, BDCH provides with search methods that allow users to query for artist, song, album, publication year or genre.

The songs in the collection belong to artists that are already members of the ‘Sociedad Chilena del Derecho de Autor’ (SCD, the Chilean copyright society), so most of the time they have been air played in commercial radio stations across Chile.

The website provides access to the statistics about the contents in the database:

Total songs :: 39958
Total directories :: 3140
Total albums and directorios :: 3140
Total album artist :: 1818
Total song artists :: 3163

Average song length :: 03:36
Standard deviation of the song length :: 01:50
Longest song :: 01:00:46
Shortest song :: 00:00
Database total length :: 2393:36:04

Avg. file size :: 4.53 Mb
Total file size :: 176.69 Gb

Avg. file bitrate :: 176.02 Kbps