Looking for MBIDs for the whole database

After the ISMIR submission I am working on finding MBID’s for all entries in the consolidated database. As we already know all artists that are on MusicBrainz what we are doing is compare all more than 40K entries artist names with those names. The numbers we get are:

  • There are 335 Chilean artist on MusicBrainz
  • There are 10546 songs by those 335 artists

We are going to iterate over the albums and songs of those artist to extract MBID for their albums and songs

Extracting and parsing spanish-formatted dates

We have collected birth and death dates for many of the Chilean artists that belong to our databases. Most of this data comes from musicapopular.cl. However, this data doesn’t have any formatting rules and it seems that different people entered data with different criteria. In other words, there are different styles of spanish dates.

As MB supports fields for date periods in the form YYYY-MM-DD, we developed a script using regular expression to parse all dates to this format. Done.

String comparison metrics

We are comparing 6 different metrics based on the Levenshtein distance for the string comparison between the BDMC and MB. These metrics are:

  • Original string Levenshtein ratio
  • Original string Levenshtein jaro
  • ascii strings Levenshtein ratio
  • ascii strings Levenshtein jaro
  • lowercase, no-spaces, ascii strings Levenshtein ratio
  • lowercase, no-spaces, ascii strings Levenshtein jaro

For the actual comparison, I will create a know dataset and will measure precision and recall for the six metrics, but how large this dataset should be? Rule of thumb: (100/x)^2, where x is the margin of error that you want. However, this is for an infinite population, so we should implement a ‘finite population correction’ (FPC),

FPC = sqr((N-n)/(N-1)), where N is the population size, and n is the sample size.

We have three interesting things that we should have a look:

  1. How many artists have an exact match (i.e., they are already in the database)
  2. How many artists do not match (i.e., they are not in the database)
  3. How many artists match partially. Among these we need to see what is the best threshold to obtain the best precision and recall, and after that using the bootstrapping technique, create error bars for both metrics.

Although we have realized that the threshold for doing the string matching is located around 0.88 (for the lowercase, no-spaces, ascii strings Levenshtein ratio), we are running a query test with a more *generous* thresholding. Later, we will extract from there our subset to calculate the best threshold for the best precision and recall values (or by using the mixed ‘f-score’).

The experiment we are thinking about has the following steps:

  1. To create a ground truth dataset. This subset will be randomly chosen among all artist names that actually exist in the BMDC and MB. As mentioned before the size of this population should be somewhere between 100 and 400 entries.
  2. Manually look for the correct MBID for these entries
  3. Create a script for calculating precision and recall using the six metrics for all these entries.

 

STEPS

  1. We created a script that takes random samples for those entries with distance values between 0.75 and 1.0, and which belong to ‘Chile’ or ‘None’. We randomly chose 400 entries in order to be able to discard those who are not in MB (in fact, this is the maximum number of entries with those constraints)
  2. We are marking as RED those entries from the BDMC who are not in MB. GREEN are those who are already in MB, and YELLOW those who have a wrong entry in MB (false positives: same name but a different artist, so they should be considered as RED). To check false positives and negatives we use the data from musicapopular and BDMC. The numbers we got are:
    • GREEN: 179
    • YELLOW: 98
    • RED: 123

We have realized several interesting facts:

  • There is a large amount of artists within the Folklore genre. Most of these entries belong to compiled releases from Various Artists. Hence, most of these artist have just one or a few works associated with them.
  • There is a large amount of false positives among those artist with very common names such as Rachel, Quorum, Criminal, Polter, Rock Hudson, Trilogía, Twilight, and many others. The only way to determine if it is a true or false positive is researching in the releases and works developed by the artist. Hopefully, we have large information coming from several websites to determine if the artist has already en entry or not, or by analyzing if there is any reference to Chile in the MB entry, in the release country of a release, or in the relationship field.

After some trial-error, we have changed our script and now we are running another one among all artists and without any threshold for the six metrics we are comparing. Also, now the queries are properly done, and an artist like ‘Gonzalo Yáñez’ is properly formatted in the URL query to MB. We think that with this approach will be able to compare all the metrics at once. Once this was done for all 3142 different artists, we filtered again all entries with values between [0.75, 1[ but we left wrong countries in the set (we can’t mix pears and apples).

The settings for this latest approach gave us 335 artists in the range [0.75, 1[, and 464 artist with a value of [1]. Also, there are 2344 in the range [0., 0.75[. We considered artists correctly retrieved as ‘true positives’, those with the same name but being referred to another artist as ‘false positives’, and those wrongly retrieved as the false negatives. This selection should be discussed. The first plots are as follow:

It is strange that in plots 3, 4, 5, and 6 the recall stays growing forever while the precision diminishes just a few. We think there is something wrong with the election of the true and false positives, and true and false negatives.

We have been designing a third approach for analyzing this data. The first part of this approach has to do with how many Chilean artist are in the database and how well the algorithm performs in here. Things to calculate:

1. Recall on just the ones for the different thresholds

2. Recall on the ‘ones’ and ‘twos’

But for other ‘real-world’ applications, where string matching could be used, we will:

3. Calculate precision and recall considering “two’s” as correct matching (the string matching algorithm did the job),

4. Calculate precision and recall considering “two’s” as incorrect matches.

Moreover, to calculate the error we will use the bootstrapping technique: to create a number of other populations starting from my sample population. In other words, if my sample population is 380 entries, we will create 1000 populations starting from this population without replacement (this means that we can have duplicate entries in the new population, otherwise we will have the same one again and again), and then we can discard the 25 lower and 25 higher ones, and we will have our error boundaries for a 95% of confidence interval)

 

Retrieving already assigned MBID from MB

We are comparing our database of songs (recordings), artists (artists), and albums (releases) with Musicbrainz. I am querying MB with an advanced search syntax for an specific recording, like:

recording+name artist:artist+name release:release+title comment:Chile* country:CL

A result list is retrieved and in this round I am only considering the first entry as the potential response. Then, I am comparing the recording title of the BDMC and MB using Levenshtein distance and the total number of letter of the query and the response.

After some trial and error, and leave the script running for 30 hours, the results we obtained are:

  • 10615 entries with an MB artist ID
  • 9074 entries with an MB release ID
  • 8436 entries with an MB recording ID

However, taking a look to the resulting file I can see some things that I need to fix for having better results:

  1. After receiving a query response, I should only consider artist names with country=[‘CL’, ‘Chile’, ”]. Also I should apply a string comparison using Levenshtein distance between the names from the BDMC and MB.
  2. When a response is back, we should iterate over a number of songs to see which one of all of the is the proper match (sometimes the true positive is not the first option)

For the next round I will use a Levenshtein ratio instead of the distance. This approach returns a normalized value between 0 and 1 instead of the number of necessary edits and changes for going from one word to the another one.

Big question to solve: which values are the best ones when comparing Levenshtein distances. Trying and comparing by-hand I have arrived to 0.75 as a *nice* threshold, but this value should be revised (suggestion: make plots of different thresholds)

I implemented the iteration over the retrieved songs to match the hopefully true positive among all items with a 100% score (number 2 above). The amount of noise is being diminished, but I don’t have all results yet.

About the artists’ names, it seems that the proper approach would be to query the MB database only by artist in order to refine the results. The query and filtering should be:

1) firstname+lastname country:CL comment:Chile*

2) Filter those artist names with a country ≠ [CL, ”]

3) Iterate over the a number of names and retrieve the one with the highest Levenshtein ratio or jaro distance.

Just as a note, when asking MB for all chilean artists (comment:Chile* country:CL), it returns 208 artists)

New approach for discovering already entered songs on MB

I am trying to develop a new, faster, cleaner approach to see which songs(recording), album(release), and artists(artist) are already on MusicBrainz, and I’ve noticed that creating an advanced query like this:

Intro release:Polvo+de+Estrellas artist:Alberto+Plaza

or this equivalent:

http://musicbrainz.org/search?query=Intro+release%3APolvo%2Bde%2BEstrellas+artist%3AAlberto%2BPlaza&type=recording&limit=25&advanced=1

generates a good output.

Then, parsing and comparing the output with BeautifulSoup


soup = BeautifulSoup('http://www.the.url')
out = soup.findAll('tbody')[0].findAll('a')

we can easily obtain the links for the recording, release, and artist:

[Intro,
Alberto Plaza,
Polvo de Estrellas,
Intro,
Alberto Plaza,
Polvo de estrellas,
Milagro de Abril,
Alberto Plaza,
Polvo de Estrellas,
No Seas Cruel,
Alberto Plaza…]

However, I still need to figure out how to filter a recording, release, or artist without a perfect score, like this one:
Sol+Luminoso release:Indi artist:Indi or
http://musicbrainz.org/search?query=Sol%2BLuminoso+release%3AIndi+artist%3AIndi&type=recording&limit=5&advanced=1
it returns:

If we are *too* strict with the three fields we will loose some of the songs already in the database, so we need to assign some flexibility. For instance, the artist can be retrieved correctly, and also the release, but the recording can be wrong.

An approach could be to calculate the Levenshtein distance for each field, and relate that with the quantity of letters for each field (more letters can imply a larger distance.

(Preliminary tests with the level of *strictness* for each field indicate that while artist is more strict (it doesn’t find anything for ‘viglienponi’), release and recording are more relaxe (it retrieves the correct release for ‘anything artist:vigliensoni recording:twist and shout’)
 

 

Merging artist names between BDMC and musicapopular

I have been done a preliminary testing in merging my artist name data from the BDMC and MB with a 17% of recognized artists. I am also searching in the MB alias of the artists, and I am comparing the lowercase version of the names in order to avoid different capitalization styles.

Before continuing this research on name-entities, which is a hot one in the [Music-IR] list, I have been comparing the artist name data between musicapopular and the BDMC. Using the scrapped data in PEOPLE_ARTIST I obtained:

  • There is a total of 4313 artists
  • 468 artist names are recognized in both databases
  • 3845 appear in one or the other database

However, when using ARTIST_INFO:

  • 764 in both databases
  • 4975 in either one of the two lists
  • 5712 total artist

What is the difference of the data in PEOPLE_ARTIST and ARTIST_INFO? When comparing both lists, there are:

  • 1636 artist in both lists
  • 1723 in either one of the two lists
Also, an artist like ‘Marco Aurelio’ appears in ARTIST_INFO but it does not appear in PEOPLE_ARTIST.
I checked the methods and realized/remembered than PEOPLE_ARTIST was done for extracting all the people that has worked over the years in an specific group, so that is way ‘Marco Aurelio’ does not appear in the PEOPLE_ARTIST file.
In other words, for future work comparing files the ARTIST_INFO .txt file must be used.

 

Also, there are no intersected artists with accents, so there is something wrong with the way both lists, from musicapopular and BDMC, are encoded.

I tested the BDMC file, exported as a windows-formatted, tab-delimited txt file, and the ARTIST_INFO txt file. The encoders for each one of them was different. I figured out that the encoder for the BDMC file was CSISOLATIN1, and for ARTIST_INFO was UTF-8 (see previous post). So now, the two files have the some coding scheme and we obtained a much better:

  • 973 artists in both lists
  • 4556 in either one of the two lists

MusicBrainz Spanish Style Guide

I need to follow the MusicBrainz Style Guide for spanish, which is different to other languages, so here is the list of rules taken from their website:

As a rule, we will use lower case for all words in a sentence with the exception of:

  • La primera palabra del título o la que vaya después de punto.
  • Todo nombre propio: Dios, Jehová, Jesús, Luzbel, Platón, Pedro, María, Álvarez, Pantoja, Apolo, Calíope, Amadís de Gaula; Europa, España, Castilla, Valencia, Oviedo, Plaza Mayor; Cáucaso,Himalaya, Oriente, Occidente, Adriático, Estrella Polar, Támesis, el Ebro, la ciudad de México, la cordillera de los Andes.
  • En caso de nombres de lugares, por ejemplo, que incorporan el artículo, este deberá escribirse en mayúsculas: Los Ángeles, La Haya, Las Palmas, La Habana, El Cairo.
  • “Tierra” sólo se escribirá en mayúscula cuando hablamos del planeta: Madre Tierra, Mi tierra, El avión tomó tierra. En el caso de “sol” y “luna” sólo se escribe en mayúscula en los textos científicos, así en los títulos normalmente se escribirán en minúscula: Bajo la luna llena, El sol de mediodía.
  • Los atributos divinos y títulos nobiliarios, como Creador, Redentor; Sumo Pontífice, Duquesa de Alva, Marqués de Osuna.
  • Los nombres y apodos de personas, como el Gran Capitán, Alfonso el Sabio, el Drogas (el artículo solo se escribirá en maýúsculas si cumple la norma 1).
  • Los sustantivos y adjetivos que compongan el nombre de una institución, de un cuerpo o establecimiento: Real Academia de Música, Colegio Naval, Museo de Bellas Artes.
  • Los nombres de festividades religiosas o civiles, como Epifanía, Navidad, Año Nuevo. Los nombres de santos y similares: Virgen de Guadalupe, San Antonio.
  • Nombres de calles y espacios, pero sólo el nombre propio: calle Recoletos, plaza de Ríos
  • Los números romanos, como I, X, MCMXC.

REGLAS COMPLEMENTARIAS

  • Las letras mayúsculas deben ser siempre acentuadas, cuando así sea necesario segun las reglas mencionadas, por ejemplo Álvaro.
  • En las palabras que comiencen con las letras Ch y Ll, sólo la primera debe ser mayúscula. CHile o LLorente será incorrecto: debe ser Chile y Llorente.
  • Después de dos puntos (:), (Bailes tradicionales: la jota) excepto cuando sirvan para introducir una cita: Juan dijo: Ten cuidado.
  • Si después de puntos suspensivos (…) se cierra un enunciado usaremos mayúscula. Fuimos al cine… pero llegamos tarde o Fuimos al cine… La película estuvo muy bien.
  • Después de abrir un signo de admiración o interrogación, sólo cuando sea el principio del título o después de punto (¿Quién dijo miedo?, ¿Quiénes somos? ¿de dónde venimos? ¿a dónde vamos?.
  • Después de abrir paréntesis escribiremos mayúscula únicamente si se cumple alguna regla de las anteriores, por norma general empezaremos por minúscula: Al otro lado del río (en directo) o PCCh (Partido Comunista de China).

CASOS DONDE NO UTILIZAREMOS MAYÚSCULA

  • Los nombres de los días de la semana, meses y estaciones del año: lunes, abril, verano. Sólo las escribiremos con mayúscula cuando forman parte de fechas históricas, festividades o nombres propios: Primero de Mayo, Primavera de Praga, Viernes Santo, Hospital Doce de Octubre.
  • Los nombres de las notas musicales: do, re, mi, fa, sol, la, si.
  • Nombres propios de persona que pasan a designar genéricamente a quienes poseen el rasgo más característico o destacable del original: Mi tía Petra es una auténtica celestina; Siempre vas de quijote por la vida; Mi padre, de joven, era un donjuán.
  • Muchos objetos, productos, etc. que son llamados por el nombre de su inventor, descubridor, o la marca que lo fabricó o popularizó (zepelín, braille, quevedos, rebeca, napoleón), o del lugar en que se producen o del que son originarios (cabrales, rioja, damasco, fez). Por el contrario, conservan la mayúscula inicial los nombres de los autores aplicados a sus obras (El Quijote de Cervantes).
  • Nombres de marcas comerciales, cuando no nos referimos a un producto de la marca, sino, de forma genérica, a cualquier objeto de características similares (zippo, wambas, moulinex, mobylette,derbi, minipimer).

The Chilean record labels contribution, demystified

Using the data from the BDMC, we can see that there is 37335 songs with an assigned record label (93% of the BDMC total entries), and the number of labels is 334.

The distribution of edited songs by the record labels is shown here:

We can see in this distribution the ‘short-head’ and the ‘long-tail’ of music.

If we make a zoom in the ‘short-head’ and select labels with more than 100 published songs, we can see:

We can see that the first column, with 26% of the total published songs correspond to ‘Independent’ editions, so they are isolated publications that can not count as record labels. Interesting results can be seen in the figure. Emi Odeón and Warner Music are the two biggest record labels (9% and 6% respectively), but Oveja Negra, label of the SCD (the Chilean music copyright society), has also a 6% of participation on published songs after only 10 years of work. We can also see that ‘Sello Azul’, a second label belonging to SCD devoted to new people, has a 3%, so the SCD is responsible for almost the same amount of Chilean music that the biggest record label. However, this effect can also be seen like as the SCD takes much care on the distribution of their own products and most of their represented artist are in their database.

If we do a per-year analysis, we can see:

We can see that the songs in the BDMC database belong mostly to the 90’s and 00’s decades, with a 23% and 69% percent of the total. Songs before that period are only a 1% of the total. We can also see that the peak of published songs was in 2007, and this amount has decreased rapidly in the following years, reaching in 2011 the same number of songs that in 1993.

Correcting the BDMC data

GENRES

I’ve been correcting of the musical genres that appear on the BDCH, as follows.

  • Hip Hop to Hip-Hop
  • ‘Heavy Metal’ to ‘Metal’
  • ‘Electronica’ to ‘Electronic’
  • ‘Folclor’ to ‘Folklore’
  • ‘Rock & Roll’ to ‘Rock and Roll’
  • ‘Bluegras’ to ‘Bluegrass’

However, it is necessary to group genres in meta-genres, to have ‘Metal’ and ‘Rock’ under the same label. This labour should be done by a musicologist, though.

Anyway, here is the distribution of genres according to the tags for songs in the BDMC database. It should be noticed, however, that the labels tag a whole album instead of a song.

it is very interesting to note the coverage of some genres in this database. 26% of the songs are reported as part of ‘Folklore’, while ‘Rock’ + ‘Pop’ sums a total of only 19%. ‘Jazz’ and ‘Classical’, on the other hand, report only a 3% of the total songs. This data could be interesting for the artist in these genres to see how they are being represented in the database.

Files:

  • BDCH-BMAT_27012012_WORKING.xlsx (data)
  • genres_from_BDCH-BMAT_27012012_WORKING.xlsx (plot)

SONG TITLES, ALBUM TITLES, AND ARTIST NAMES

To provide the best possible data to MB, I need to normalize the data to meet the MB Spanish style guide.

 

Matching BDMC and MusicBrainz

 

I have been querying MusicBrainz with the data from the BDMC, as a first outcome:

  • In the BDMC there is a total of:
    • 40132 entries
    • 3343 different artists
    • 3085 different albums
    • 32570 songs with different names

From that total, there are

  • 457 artist names (with the EXACT spelling that can be found in MusicBrainz)
  • 2886 artists that can not be found

This is only the 14% of the total amount. However, there are some artist names that are not properly spelled, but are close to the original, in the databases (e.g., ‘DJ Mendez’ instead of ‘DJ Méndez’, or ‘Alvaro Henriquez’ instead of ‘Álvaro Henríquez’), and those should be considered as found artistsAlso, some of the artist have the same name with other artist, such as ‘Mito’. The Chilean ‘Mito’ appears as the third entry in MB, without an explicit country, only with a disambiguation (‘Chilean’).

After running the script again considering if the entry in the BDCH matches some of the aliases for each artist in MB, the numbers are a bit better:

  • 565 (17%) artists were recognized
  • 56 (2%) have CL as the country (2%)
  • 72 (2%) have another country as the country type
So, if we extract this last number of artists from the database, which are very likely to not be chilean, we end up with 493 recognized artists.

 

I’ve been also correcting the many inconsistencies of the BDCH: renaming artist with different spellings and entering accents for artists without them. I have done 25% of it (10^4 entries) and the new numbers I got are:

  • 3308 different artists
  • 551 artists were recognized (17% of the total)
    • 466 possibly Chilean (14%)
      • 56 Chilean (explicitly declared)
      • 410 undeclared country
      • 177 groups (38% of the recognized possibly Chilean artists)
      • 142 people (30% of the recognized possibly Chilean artists)
      • 147 undefined (32% of the recognized possibly Chilean artists)
    • 75 non-chilean artists (should be discarded from the database)

Our idea is to provide MB with a big file with all data in our database with the corresponding MBIDs for artist, title, and album (if any).

  • From the 551 recognized artists using the out_correct file, there are:
    •  9454 titles (out_BDMC_w_artist_MBID)

During the last days I’ve been trying to solve the following problem: for the Chilean artist Dogma there are 8 different entries with the same score (100):

Score Name Sort Name Type Begin End
100 Døgma Døgma
100 Dogma (German trance artist) Dogma
100 Dogma (portuguese band) Dogma Group 1996 2003
100 Dogma (Brazilian progressive rock band) Dogma Group 1996
100 Dogma (Swiss trance duo Robin Mandrysch & Guido Walter) Dogma Group
100 Dogma (goa trance duo Damir Ludvig & Goran Stetic) Dogma Group
100 Dogma (Chilean artist) Dogma
100 Dogma (Italo-dance artist) Dogma

It seems that I need to take a look to the disambiguation field and look for the ‘Chile’ word (or a derivative) to consider it as the artist we are looking for.