Correcting the BDMC data

GENRES

I’ve been correcting of the musical genres that appear on the BDCH, as follows.

  • Hip Hop to Hip-Hop
  • ‘Heavy Metal’ to ‘Metal’
  • ‘Electronica’ to ‘Electronic’
  • ‘Folclor’ to ‘Folklore’
  • ‘Rock & Roll’ to ‘Rock and Roll’
  • ‘Bluegras’ to ‘Bluegrass’

However, it is necessary to group genres in meta-genres, to have ‘Metal’ and ‘Rock’ under the same label. This labour should be done by a musicologist, though.

Anyway, here is the distribution of genres according to the tags for songs in the BDMC database. It should be noticed, however, that the labels tag a whole album instead of a song.

it is very interesting to note the coverage of some genres in this database. 26% of the songs are reported as part of ‘Folklore’, while ‘Rock’ + ‘Pop’ sums a total of only 19%. ‘Jazz’ and ‘Classical’, on the other hand, report only a 3% of the total songs. This data could be interesting for the artist in these genres to see how they are being represented in the database.

Files:

  • BDCH-BMAT_27012012_WORKING.xlsx (data)
  • genres_from_BDCH-BMAT_27012012_WORKING.xlsx (plot)

SONG TITLES, ALBUM TITLES, AND ARTIST NAMES

To provide the best possible data to MB, I need to normalize the data to meet the MB Spanish style guide.

 

Matching BDMC and MusicBrainz

 

I have been querying MusicBrainz with the data from the BDMC, as a first outcome:

  • In the BDMC there is a total of:
    • 40132 entries
    • 3343 different artists
    • 3085 different albums
    • 32570 songs with different names

From that total, there are

  • 457 artist names (with the EXACT spelling that can be found in MusicBrainz)
  • 2886 artists that can not be found

This is only the 14% of the total amount. However, there are some artist names that are not properly spelled, but are close to the original, in the databases (e.g., ‘DJ Mendez’ instead of ‘DJ Méndez’, or ‘Alvaro Henriquez’ instead of ‘Álvaro Henríquez’), and those should be considered as found artistsAlso, some of the artist have the same name with other artist, such as ‘Mito’. The Chilean ‘Mito’ appears as the third entry in MB, without an explicit country, only with a disambiguation (‘Chilean’).

After running the script again considering if the entry in the BDCH matches some of the aliases for each artist in MB, the numbers are a bit better:

  • 565 (17%) artists were recognized
  • 56 (2%) have CL as the country (2%)
  • 72 (2%) have another country as the country type
So, if we extract this last number of artists from the database, which are very likely to not be chilean, we end up with 493 recognized artists.

 

I’ve been also correcting the many inconsistencies of the BDCH: renaming artist with different spellings and entering accents for artists without them. I have done 25% of it (10^4 entries) and the new numbers I got are:

  • 3308 different artists
  • 551 artists were recognized (17% of the total)
    • 466 possibly Chilean (14%)
      • 56 Chilean (explicitly declared)
      • 410 undeclared country
      • 177 groups (38% of the recognized possibly Chilean artists)
      • 142 people (30% of the recognized possibly Chilean artists)
      • 147 undefined (32% of the recognized possibly Chilean artists)
    • 75 non-chilean artists (should be discarded from the database)

Our idea is to provide MB with a big file with all data in our database with the corresponding MBIDs for artist, title, and album (if any).

  • From the 551 recognized artists using the out_correct file, there are:
    •  9454 titles (out_BDMC_w_artist_MBID)

During the last days I’ve been trying to solve the following problem: for the Chilean artist Dogma there are 8 different entries with the same score (100):

Score Name Sort Name Type Begin End
100 Døgma Døgma
100 Dogma (German trance artist) Dogma
100 Dogma (portuguese band) Dogma Group 1996 2003
100 Dogma (Brazilian progressive rock band) Dogma Group 1996
100 Dogma (Swiss trance duo Robin Mandrysch & Guido Walter) Dogma Group
100 Dogma (goa trance duo Damir Ludvig & Goran Stetic) Dogma Group
100 Dogma (Chilean artist) Dogma
100 Dogma (Italo-dance artist) Dogma

It seems that I need to take a look to the disambiguation field and look for the ‘Chile’ word (or a derivative) to consider it as the artist we are looking for.

Encoding scheme

I have been trying different encoding across the application that I am using and the one that works better is Western (MacRoman). Using Spanish, it opens all accents, special characters and the ñ. Hopefully I will be able to use in the DDBB.

 

It has been also hard to figure out how to properly export files from the spreadsheet, until now, the best file format and encoding scheme has been ‘windows_formatted’

musicapopular.cl parsing outcome

I just finished parsing the http://musicapopular.cl website. As a first outcome, I can see the following numbers:

  • There are 1547 bands in their database
  • There are 1826 people. This value does not mean 1825 soloist because in the database there is some people, such as managers, writers, journalists and music producers working for the Chilean music industry. These people should be part of the PEOPLE table and linked to a resource, if necessary.
  • Here is the list of the genres and the number of artists associated with it:
    • Balada 252
    • Bolero 113
    • Canción melódica 119
    • Vals 29
    • Nueva Ola 128
    • Neofolklore 34
    • TV pop 180
    • Canto 150
    • Trova 150
    • Fusión latinoamericana 439
    • Pop 833
    • Funk 112
    • Jazz 565
    • Tango 28
    • Ranchera 48
    • Corrido 48
    • Tropical 204
    • Folclor 270
    • Música orquestada 48
    • Canto a lo poeta 21
    • Canto Nuevo 57
    • Música andina 22
    • Música infantil 37
    • Nueva Canción Chilena 61
    • Rock 789
    • Cueca 112
    • Tonada 43
    • Electr?nica 224
    • Hiphop 145
    • Música experimental 208
    • Música típica 45
    • Foxtrot 13
    • Fusión étnica 55
    • Música clásica 38
    • Música contempor?nea 101
    • Música incidental 25
    • Rock progresivo 39
    • Música chilota 9
    • Proyección folclórica 62
    • Metal 78
    • Punk 55

 

 

As n idea, there is information about birth date and dead death for many musicians, it would be great to create a memorial with the dates.

Linking tables and finding duplicates

I have been working in parsing the already scrapped websites and I figured out that I need to know in advance the structure of the file that I want to generate. This is important because it must be delineated by the structure of the tables in the database.

The parsing outcome of the websites will be a .csv file all the data for a specific website. This is the lower-level moment with the data structured, so it is a good moment for assigning an id for the entities. If the id is assigned later, it will be harder to solve problems because we will be already in the structure of the database.

We could also use a script to ensure that there is no duplicates in the table, and if it finds duplicates we could take two options:

  • to write a second csv file with all the duplicate entries, while in the first one there are only the unique ones
  • to write a unique csv file where all the duplicate entries can be marked in the id column by a special character that can be fixed manually (e.g., id = *1234)

Scrapping and parsing entities

It is hard to know how to parse names with more than two names, especially considering the different structures that names can have. See the example:

  • Juan Pablo González is two names (‘Juan’ and ‘Pablo’) and a last name (‘González’)
  • Sergio del Río is one name (‘Sergio’) and a compound last name (‘del’ and ‘Río’) that must be sorted by ‘Río’ according to the Musicbrainz Style guide.
As in most cases the structure of the name is one name plus one last name I will use this structure, but the database must be manually fixed afterwards.
It is practically impossible to think in advance in all possible human errors when parsing the scrapped data from the database, so the developed methods should be as general as possible when parsing the data, otherwise errors could be rise. This kind of problems can be interpretations about how to enter the data, such as using <em> or <it> or <span class …> for emphasizing a certain text. As a consequence of this fact, time for manual data correction should be considered in the actual database afterwards.

Database design brainstorming

I’ve been working in the design of the tables for representing all the entities that I would like to make available in my database. For the moment, what I can see for the repository are the following tables and the relations between them:
However, after a conversation with Corina MacDonald, she suggested me that all the album reviews, interviews, pictures, and any possible future material could be considered as a RESOURCE. Also, a composer, photographer, director, journalist and any other person could be part of a PEOPLE table. I have been trying to put these ideas into one schema, and the result, still fuzzy is:
After parsing some people linked to artists, I have been refining a little bit the DB structure. In the PEOPLE_ARTIST table I will be storing an array with the instruments that a person played in a specific artist, also the type of relations (e.g., one-to-many, many-to-many, etc) is now declared.

Unicode equivalencies

It has been hard to work with spanish characters (such as accents and ñ) when scrapping the websites and working with a mySQL database. These are some of the equivalencies from Unicode for different characters:

\xc3 it seems that this one refers to the utf-8 encoding, so part of the conversion scheme is:

\xe1 á
\xe9 é
\xed í
\xf3 ó
\xfa ú
\xf1 ñ

When printing these characters to screen, Python does the job, however, when writing a .csv file, it is not able to handle the actual character, so I need to have a look if the mySQL is able to convert those codes to the actual characters.

It seems that the best thing to do will be to export the utf-8 file and then run a script over the file. By doing this I will be able to populate the database with the proper encoding.

I have been doing more research in the encodings and found this behaviour with the data coming from both databases:

mus_pop : : BDMC
\xc3\xa1 : á : \xe1
\xc3\xa9 : é : \xe9
\xc3\xad : í : \xed
\xc3\xb3 : ó : \xf3
\xc3\xba : ú : \xfa
\xc3\xb1 : ñ : \xf1

It should be noticed that mus_pop data was encoded as UTF-8, but I don’t know what encoder is being used for BDMC by Excel.

After lots of digging, I found the iconv library for converting to and from different encodings, and figured out that the encoding of Excel for the tab-delimited, Windows-Formatted files are of type CSISOLATIN-1. So the command-line for converting the files exported by Excel is

iconv -t UTF8 -f CSISOLATIN1 < ./input_file.txt > ./output_file.txt

Recursion problems using Python

I have been dealing for a couple of days with recursion problems when dumping data using pickle. The solution that I have used has been pretty simple but effective: just to increment the recursion limit by doing:

sys.setrecursionlimit(10000)

Since I have set up this new recursion limit, I have not had any recursion problems again.