Merging artist names between BDMC and musicapopular

I have been done a preliminary testing in merging my artist name data from the BDMC and MB with a 17% of recognized artists. I am also searching in the MB alias of the artists, and I am comparing the lowercase version of the names in order to avoid different capitalization styles.

Before continuing this research on name-entities, which is a hot one in the [Music-IR] list, I have been comparing the artist name data between musicapopular and the BDMC. Using the scrapped data in PEOPLE_ARTIST I obtained:

  • There is a total of 4313 artists
  • 468 artist names are recognized in both databases
  • 3845 appear in one or the other database

However, when using ARTIST_INFO:

  • 764 in both databases
  • 4975 in either one of the two lists
  • 5712 total artist

What is the difference of the data in PEOPLE_ARTIST and ARTIST_INFO? When comparing both lists, there are:

  • 1636 artist in both lists
  • 1723 in either one of the two lists
Also, an artist like ‘Marco Aurelio’ appears in ARTIST_INFO but it does not appear in PEOPLE_ARTIST.
I checked the methods and realized/remembered than PEOPLE_ARTIST was done for extracting all the people that has worked over the years in an specific group, so that is way ‘Marco Aurelio’ does not appear in the PEOPLE_ARTIST file.
In other words, for future work comparing files the ARTIST_INFO .txt file must be used.

 

Also, there are no intersected artists with accents, so there is something wrong with the way both lists, from musicapopular and BDMC, are encoded.

I tested the BDMC file, exported as a windows-formatted, tab-delimited txt file, and the ARTIST_INFO txt file. The encoders for each one of them was different. I figured out that the encoder for the BDMC file was CSISOLATIN1, and for ARTIST_INFO was UTF-8 (see previous post). So now, the two files have the some coding scheme and we obtained a much better:

  • 973 artists in both lists
  • 4556 in either one of the two lists

MusicBrainz Spanish Style Guide

I need to follow the MusicBrainz Style Guide for spanish, which is different to other languages, so here is the list of rules taken from their website:

As a rule, we will use lower case for all words in a sentence with the exception of:

  • La primera palabra del título o la que vaya después de punto.
  • Todo nombre propio: Dios, Jehová, Jesús, Luzbel, Platón, Pedro, María, Álvarez, Pantoja, Apolo, Calíope, Amadís de Gaula; Europa, España, Castilla, Valencia, Oviedo, Plaza Mayor; Cáucaso,Himalaya, Oriente, Occidente, Adriático, Estrella Polar, Támesis, el Ebro, la ciudad de México, la cordillera de los Andes.
  • En caso de nombres de lugares, por ejemplo, que incorporan el artículo, este deberá escribirse en mayúsculas: Los Ángeles, La Haya, Las Palmas, La Habana, El Cairo.
  • “Tierra” sólo se escribirá en mayúscula cuando hablamos del planeta: Madre Tierra, Mi tierra, El avión tomó tierra. En el caso de “sol” y “luna” sólo se escribe en mayúscula en los textos científicos, así en los títulos normalmente se escribirán en minúscula: Bajo la luna llena, El sol de mediodía.
  • Los atributos divinos y títulos nobiliarios, como Creador, Redentor; Sumo Pontífice, Duquesa de Alva, Marqués de Osuna.
  • Los nombres y apodos de personas, como el Gran Capitán, Alfonso el Sabio, el Drogas (el artículo solo se escribirá en maýúsculas si cumple la norma 1).
  • Los sustantivos y adjetivos que compongan el nombre de una institución, de un cuerpo o establecimiento: Real Academia de Música, Colegio Naval, Museo de Bellas Artes.
  • Los nombres de festividades religiosas o civiles, como Epifanía, Navidad, Año Nuevo. Los nombres de santos y similares: Virgen de Guadalupe, San Antonio.
  • Nombres de calles y espacios, pero sólo el nombre propio: calle Recoletos, plaza de Ríos
  • Los números romanos, como I, X, MCMXC.

REGLAS COMPLEMENTARIAS

  • Las letras mayúsculas deben ser siempre acentuadas, cuando así sea necesario segun las reglas mencionadas, por ejemplo Álvaro.
  • En las palabras que comiencen con las letras Ch y Ll, sólo la primera debe ser mayúscula. CHile o LLorente será incorrecto: debe ser Chile y Llorente.
  • Después de dos puntos (:), (Bailes tradicionales: la jota) excepto cuando sirvan para introducir una cita: Juan dijo: Ten cuidado.
  • Si después de puntos suspensivos (…) se cierra un enunciado usaremos mayúscula. Fuimos al cine… pero llegamos tarde o Fuimos al cine… La película estuvo muy bien.
  • Después de abrir un signo de admiración o interrogación, sólo cuando sea el principio del título o después de punto (¿Quién dijo miedo?, ¿Quiénes somos? ¿de dónde venimos? ¿a dónde vamos?.
  • Después de abrir paréntesis escribiremos mayúscula únicamente si se cumple alguna regla de las anteriores, por norma general empezaremos por minúscula: Al otro lado del río (en directo) o PCCh (Partido Comunista de China).

CASOS DONDE NO UTILIZAREMOS MAYÚSCULA

  • Los nombres de los días de la semana, meses y estaciones del año: lunes, abril, verano. Sólo las escribiremos con mayúscula cuando forman parte de fechas históricas, festividades o nombres propios: Primero de Mayo, Primavera de Praga, Viernes Santo, Hospital Doce de Octubre.
  • Los nombres de las notas musicales: do, re, mi, fa, sol, la, si.
  • Nombres propios de persona que pasan a designar genéricamente a quienes poseen el rasgo más característico o destacable del original: Mi tía Petra es una auténtica celestina; Siempre vas de quijote por la vida; Mi padre, de joven, era un donjuán.
  • Muchos objetos, productos, etc. que son llamados por el nombre de su inventor, descubridor, o la marca que lo fabricó o popularizó (zepelín, braille, quevedos, rebeca, napoleón), o del lugar en que se producen o del que son originarios (cabrales, rioja, damasco, fez). Por el contrario, conservan la mayúscula inicial los nombres de los autores aplicados a sus obras (El Quijote de Cervantes).
  • Nombres de marcas comerciales, cuando no nos referimos a un producto de la marca, sino, de forma genérica, a cualquier objeto de características similares (zippo, wambas, moulinex, mobylette,derbi, minipimer).

The Chilean record labels contribution, demystified

Using the data from the BDMC, we can see that there is 37335 songs with an assigned record label (93% of the BDMC total entries), and the number of labels is 334.

The distribution of edited songs by the record labels is shown here:

We can see in this distribution the ‘short-head’ and the ‘long-tail’ of music.

If we make a zoom in the ‘short-head’ and select labels with more than 100 published songs, we can see:

We can see that the first column, with 26% of the total published songs correspond to ‘Independent’ editions, so they are isolated publications that can not count as record labels. Interesting results can be seen in the figure. Emi Odeón and Warner Music are the two biggest record labels (9% and 6% respectively), but Oveja Negra, label of the SCD (the Chilean music copyright society), has also a 6% of participation on published songs after only 10 years of work. We can also see that ‘Sello Azul’, a second label belonging to SCD devoted to new people, has a 3%, so the SCD is responsible for almost the same amount of Chilean music that the biggest record label. However, this effect can also be seen like as the SCD takes much care on the distribution of their own products and most of their represented artist are in their database.

If we do a per-year analysis, we can see:

We can see that the songs in the BDMC database belong mostly to the 90’s and 00’s decades, with a 23% and 69% percent of the total. Songs before that period are only a 1% of the total. We can also see that the peak of published songs was in 2007, and this amount has decreased rapidly in the following years, reaching in 2011 the same number of songs that in 1993.

The merging problem (by Thierry B Mahieux)

  • When you integrate different sources of data you start to add error because you match data from different sources, which is different in some percentage (from Jamendo to Musicbrainz (e.g., 10% error), from MB to DBPedia (e.g. 30%)
  • MB by Thierry: “MB is a database of music knowledge”
  • for matching artists from different sources:
    • everything lowercase, remove spaces and creating all possible comparisons
    • Aerosmith – Run D.M.C -> aerosmith run dmc ->aerosmithrundmc etc
  • “Matching is imperfect, period” (Thierry B Mahieux), so we have big issues to solve on big databases:
    • Improve the matching algorithms
    • Deal with the noise
  • Also, matching is a trade-off, so start merging!
  • Under merge: we might miss connections with other resources
  • Over-merge: we don’t know which artist is which anymore
  • Maybe something to try would be to use fingerprinting in order to find matches (G. Tzanetakis)

Correcting the BDMC data

GENRES

I’ve been correcting of the musical genres that appear on the BDCH, as follows.

  • Hip Hop to Hip-Hop
  • ‘Heavy Metal’ to ‘Metal’
  • ‘Electronica’ to ‘Electronic’
  • ‘Folclor’ to ‘Folklore’
  • ‘Rock & Roll’ to ‘Rock and Roll’
  • ‘Bluegras’ to ‘Bluegrass’

However, it is necessary to group genres in meta-genres, to have ‘Metal’ and ‘Rock’ under the same label. This labour should be done by a musicologist, though.

Anyway, here is the distribution of genres according to the tags for songs in the BDMC database. It should be noticed, however, that the labels tag a whole album instead of a song.

it is very interesting to note the coverage of some genres in this database. 26% of the songs are reported as part of ‘Folklore’, while ‘Rock’ + ‘Pop’ sums a total of only 19%. ‘Jazz’ and ‘Classical’, on the other hand, report only a 3% of the total songs. This data could be interesting for the artist in these genres to see how they are being represented in the database.

Files:

  • BDCH-BMAT_27012012_WORKING.xlsx (data)
  • genres_from_BDCH-BMAT_27012012_WORKING.xlsx (plot)

SONG TITLES, ALBUM TITLES, AND ARTIST NAMES

To provide the best possible data to MB, I need to normalize the data to meet the MB Spanish style guide.

 

Matching BDMC and MusicBrainz

 

I have been querying MusicBrainz with the data from the BDMC, as a first outcome:

  • In the BDMC there is a total of:
    • 40132 entries
    • 3343 different artists
    • 3085 different albums
    • 32570 songs with different names

From that total, there are

  • 457 artist names (with the EXACT spelling that can be found in MusicBrainz)
  • 2886 artists that can not be found

This is only the 14% of the total amount. However, there are some artist names that are not properly spelled, but are close to the original, in the databases (e.g., ‘DJ Mendez’ instead of ‘DJ Méndez’, or ‘Alvaro Henriquez’ instead of ‘Álvaro Henríquez’), and those should be considered as found artistsAlso, some of the artist have the same name with other artist, such as ‘Mito’. The Chilean ‘Mito’ appears as the third entry in MB, without an explicit country, only with a disambiguation (‘Chilean’).

After running the script again considering if the entry in the BDCH matches some of the aliases for each artist in MB, the numbers are a bit better:

  • 565 (17%) artists were recognized
  • 56 (2%) have CL as the country (2%)
  • 72 (2%) have another country as the country type
So, if we extract this last number of artists from the database, which are very likely to not be chilean, we end up with 493 recognized artists.

 

I’ve been also correcting the many inconsistencies of the BDCH: renaming artist with different spellings and entering accents for artists without them. I have done 25% of it (10^4 entries) and the new numbers I got are:

  • 3308 different artists
  • 551 artists were recognized (17% of the total)
    • 466 possibly Chilean (14%)
      • 56 Chilean (explicitly declared)
      • 410 undeclared country
      • 177 groups (38% of the recognized possibly Chilean artists)
      • 142 people (30% of the recognized possibly Chilean artists)
      • 147 undefined (32% of the recognized possibly Chilean artists)
    • 75 non-chilean artists (should be discarded from the database)

Our idea is to provide MB with a big file with all data in our database with the corresponding MBIDs for artist, title, and album (if any).

  • From the 551 recognized artists using the out_correct file, there are:
    •  9454 titles (out_BDMC_w_artist_MBID)

During the last days I’ve been trying to solve the following problem: for the Chilean artist Dogma there are 8 different entries with the same score (100):

Score Name Sort Name Type Begin End
100 Døgma Døgma
100 Dogma (German trance artist) Dogma
100 Dogma (portuguese band) Dogma Group 1996 2003
100 Dogma (Brazilian progressive rock band) Dogma Group 1996
100 Dogma (Swiss trance duo Robin Mandrysch & Guido Walter) Dogma Group
100 Dogma (goa trance duo Damir Ludvig & Goran Stetic) Dogma Group
100 Dogma (Chilean artist) Dogma
100 Dogma (Italo-dance artist) Dogma

It seems that I need to take a look to the disambiguation field and look for the ‘Chile’ word (or a derivative) to consider it as the artist we are looking for.

Encoding scheme

I have been trying different encoding across the application that I am using and the one that works better is Western (MacRoman). Using Spanish, it opens all accents, special characters and the ñ. Hopefully I will be able to use in the DDBB.

 

It has been also hard to figure out how to properly export files from the spreadsheet, until now, the best file format and encoding scheme has been ‘windows_formatted’

musicapopular.cl parsing outcome

I just finished parsing the http://musicapopular.cl website. As a first outcome, I can see the following numbers:

  • There are 1547 bands in their database
  • There are 1826 people. This value does not mean 1825 soloist because in the database there is some people, such as managers, writers, journalists and music producers working for the Chilean music industry. These people should be part of the PEOPLE table and linked to a resource, if necessary.
  • Here is the list of the genres and the number of artists associated with it:
    • Balada 252
    • Bolero 113
    • Canción melódica 119
    • Vals 29
    • Nueva Ola 128
    • Neofolklore 34
    • TV pop 180
    • Canto 150
    • Trova 150
    • Fusión latinoamericana 439
    • Pop 833
    • Funk 112
    • Jazz 565
    • Tango 28
    • Ranchera 48
    • Corrido 48
    • Tropical 204
    • Folclor 270
    • Música orquestada 48
    • Canto a lo poeta 21
    • Canto Nuevo 57
    • Música andina 22
    • Música infantil 37
    • Nueva Canción Chilena 61
    • Rock 789
    • Cueca 112
    • Tonada 43
    • Electr?nica 224
    • Hiphop 145
    • Música experimental 208
    • Música típica 45
    • Foxtrot 13
    • Fusión étnica 55
    • Música clásica 38
    • Música contempor?nea 101
    • Música incidental 25
    • Rock progresivo 39
    • Música chilota 9
    • Proyección folclórica 62
    • Metal 78
    • Punk 55

 

 

As n idea, there is information about birth date and dead death for many musicians, it would be great to create a memorial with the dates.

Linking tables and finding duplicates

I have been working in parsing the already scrapped websites and I figured out that I need to know in advance the structure of the file that I want to generate. This is important because it must be delineated by the structure of the tables in the database.

The parsing outcome of the websites will be a .csv file all the data for a specific website. This is the lower-level moment with the data structured, so it is a good moment for assigning an id for the entities. If the id is assigned later, it will be harder to solve problems because we will be already in the structure of the database.

We could also use a script to ensure that there is no duplicates in the table, and if it finds duplicates we could take two options:

  • to write a second csv file with all the duplicate entries, while in the first one there are only the unique ones
  • to write a unique csv file where all the duplicate entries can be marked in the id column by a special character that can be fixed manually (e.g., id = *1234)

Scrapping and parsing entities

It is hard to know how to parse names with more than two names, especially considering the different structures that names can have. See the example:

  • Juan Pablo González is two names (‘Juan’ and ‘Pablo’) and a last name (‘González’)
  • Sergio del Río is one name (‘Sergio’) and a compound last name (‘del’ and ‘Río’) that must be sorted by ‘Río’ according to the Musicbrainz Style guide.
As in most cases the structure of the name is one name plus one last name I will use this structure, but the database must be manually fixed afterwards.
It is practically impossible to think in advance in all possible human errors when parsing the scrapped data from the database, so the developed methods should be as general as possible when parsing the data, otherwise errors could be rise. This kind of problems can be interpretations about how to enter the data, such as using <em> or <it> or <span class …> for emphasizing a certain text. As a consequence of this fact, time for manual data correction should be considered in the actual database afterwards.