Looking for MBIDs for the whole database

After the ISMIR submission I am working on finding MBID’s for all entries in the consolidated database. As we already know all artists that are on MusicBrainz what we are doing is compare all more than 40K entries artist names with those names. The numbers we get are:

  • There are 335 Chilean artist on MusicBrainz
  • There are 10546 songs by those 335 artists

We are going to iterate over the albums and songs of those artist to extract MBID for their albums and songs

Extracting and parsing spanish-formatted dates

We have collected birth and death dates for many of the Chilean artists that belong to our databases. Most of this data comes from musicapopular.cl. However, this data doesn’t have any formatting rules and it seems that different people entered data with different criteria. In other words, there are different styles of spanish dates.

As MB supports fields for date periods in the form YYYY-MM-DD, we developed a script using regular expression to parse all dates to this format. Done.

String comparison metrics

We are comparing 6 different metrics based on the Levenshtein distance for the string comparison between the BDMC and MB. These metrics are:

  • Original string Levenshtein ratio
  • Original string Levenshtein jaro
  • ascii strings Levenshtein ratio
  • ascii strings Levenshtein jaro
  • lowercase, no-spaces, ascii strings Levenshtein ratio
  • lowercase, no-spaces, ascii strings Levenshtein jaro

For the actual comparison, I will create a know dataset and will measure precision and recall for the six metrics, but how large this dataset should be? Rule of thumb: (100/x)^2, where x is the margin of error that you want. However, this is for an infinite population, so we should implement a ‘finite population correction’ (FPC),

FPC = sqr((N-n)/(N-1)), where N is the population size, and n is the sample size.

We have three interesting things that we should have a look:

  1. How many artists have an exact match (i.e., they are already in the database)
  2. How many artists do not match (i.e., they are not in the database)
  3. How many artists match partially. Among these we need to see what is the best threshold to obtain the best precision and recall, and after that using the bootstrapping technique, create error bars for both metrics.

Although we have realized that the threshold for doing the string matching is located around 0.88 (for the lowercase, no-spaces, ascii strings Levenshtein ratio), we are running a query test with a more *generous* thresholding. Later, we will extract from there our subset to calculate the best threshold for the best precision and recall values (or by using the mixed ‘f-score’).

The experiment we are thinking about has the following steps:

  1. To create a ground truth dataset. This subset will be randomly chosen among all artist names that actually exist in the BMDC and MB. As mentioned before the size of this population should be somewhere between 100 and 400 entries.
  2. Manually look for the correct MBID for these entries
  3. Create a script for calculating precision and recall using the six metrics for all these entries.

 

STEPS

  1. We created a script that takes random samples for those entries with distance values between 0.75 and 1.0, and which belong to ‘Chile’ or ‘None’. We randomly chose 400 entries in order to be able to discard those who are not in MB (in fact, this is the maximum number of entries with those constraints)
  2. We are marking as RED those entries from the BDMC who are not in MB. GREEN are those who are already in MB, and YELLOW those who have a wrong entry in MB (false positives: same name but a different artist, so they should be considered as RED). To check false positives and negatives we use the data from musicapopular and BDMC. The numbers we got are:
    • GREEN: 179
    • YELLOW: 98
    • RED: 123

We have realized several interesting facts:

  • There is a large amount of artists within the Folklore genre. Most of these entries belong to compiled releases from Various Artists. Hence, most of these artist have just one or a few works associated with them.
  • There is a large amount of false positives among those artist with very common names such as Rachel, Quorum, Criminal, Polter, Rock Hudson, Trilogía, Twilight, and many others. The only way to determine if it is a true or false positive is researching in the releases and works developed by the artist. Hopefully, we have large information coming from several websites to determine if the artist has already en entry or not, or by analyzing if there is any reference to Chile in the MB entry, in the release country of a release, or in the relationship field.

After some trial-error, we have changed our script and now we are running another one among all artists and without any threshold for the six metrics we are comparing. Also, now the queries are properly done, and an artist like ‘Gonzalo Yáñez’ is properly formatted in the URL query to MB. We think that with this approach will be able to compare all the metrics at once. Once this was done for all 3142 different artists, we filtered again all entries with values between [0.75, 1[ but we left wrong countries in the set (we can’t mix pears and apples).

The settings for this latest approach gave us 335 artists in the range [0.75, 1[, and 464 artist with a value of [1]. Also, there are 2344 in the range [0., 0.75[. We considered artists correctly retrieved as ‘true positives’, those with the same name but being referred to another artist as ‘false positives’, and those wrongly retrieved as the false negatives. This selection should be discussed. The first plots are as follow:

It is strange that in plots 3, 4, 5, and 6 the recall stays growing forever while the precision diminishes just a few. We think there is something wrong with the election of the true and false positives, and true and false negatives.

We have been designing a third approach for analyzing this data. The first part of this approach has to do with how many Chilean artist are in the database and how well the algorithm performs in here. Things to calculate:

1. Recall on just the ones for the different thresholds

2. Recall on the ‘ones’ and ‘twos’

But for other ‘real-world’ applications, where string matching could be used, we will:

3. Calculate precision and recall considering “two’s” as correct matching (the string matching algorithm did the job),

4. Calculate precision and recall considering “two’s” as incorrect matches.

Moreover, to calculate the error we will use the bootstrapping technique: to create a number of other populations starting from my sample population. In other words, if my sample population is 380 entries, we will create 1000 populations starting from this population without replacement (this means that we can have duplicate entries in the new population, otherwise we will have the same one again and again), and then we can discard the 25 lower and 25 higher ones, and we will have our error boundaries for a 95% of confidence interval)

 

Retrieving already assigned MBID from MB

We are comparing our database of songs (recordings), artists (artists), and albums (releases) with Musicbrainz. I am querying MB with an advanced search syntax for an specific recording, like:

recording+name artist:artist+name release:release+title comment:Chile* country:CL

A result list is retrieved and in this round I am only considering the first entry as the potential response. Then, I am comparing the recording title of the BDMC and MB using Levenshtein distance and the total number of letter of the query and the response.

After some trial and error, and leave the script running for 30 hours, the results we obtained are:

  • 10615 entries with an MB artist ID
  • 9074 entries with an MB release ID
  • 8436 entries with an MB recording ID

However, taking a look to the resulting file I can see some things that I need to fix for having better results:

  1. After receiving a query response, I should only consider artist names with country=[‘CL’, ‘Chile’, ”]. Also I should apply a string comparison using Levenshtein distance between the names from the BDMC and MB.
  2. When a response is back, we should iterate over a number of songs to see which one of all of the is the proper match (sometimes the true positive is not the first option)

For the next round I will use a Levenshtein ratio instead of the distance. This approach returns a normalized value between 0 and 1 instead of the number of necessary edits and changes for going from one word to the another one.

Big question to solve: which values are the best ones when comparing Levenshtein distances. Trying and comparing by-hand I have arrived to 0.75 as a *nice* threshold, but this value should be revised (suggestion: make plots of different thresholds)

I implemented the iteration over the retrieved songs to match the hopefully true positive among all items with a 100% score (number 2 above). The amount of noise is being diminished, but I don’t have all results yet.

About the artists’ names, it seems that the proper approach would be to query the MB database only by artist in order to refine the results. The query and filtering should be:

1) firstname+lastname country:CL comment:Chile*

2) Filter those artist names with a country ≠ [CL, ”]

3) Iterate over the a number of names and retrieve the one with the highest Levenshtein ratio or jaro distance.

Just as a note, when asking MB for all chilean artists (comment:Chile* country:CL), it returns 208 artists)

New approach for discovering already entered songs on MB

I am trying to develop a new, faster, cleaner approach to see which songs(recording), album(release), and artists(artist) are already on MusicBrainz, and I’ve noticed that creating an advanced query like this:

Intro release:Polvo+de+Estrellas artist:Alberto+Plaza

or this equivalent:

http://musicbrainz.org/search?query=Intro+release%3APolvo%2Bde%2BEstrellas+artist%3AAlberto%2BPlaza&type=recording&limit=25&advanced=1

generates a good output.

Then, parsing and comparing the output with BeautifulSoup


soup = BeautifulSoup('http://www.the.url')
out = soup.findAll('tbody')[0].findAll('a')

we can easily obtain the links for the recording, release, and artist:

[Intro,
Alberto Plaza,
Polvo de Estrellas,
Intro,
Alberto Plaza,
Polvo de estrellas,
Milagro de Abril,
Alberto Plaza,
Polvo de Estrellas,
No Seas Cruel,
Alberto Plaza…]

However, I still need to figure out how to filter a recording, release, or artist without a perfect score, like this one:
Sol+Luminoso release:Indi artist:Indi or
http://musicbrainz.org/search?query=Sol%2BLuminoso+release%3AIndi+artist%3AIndi&type=recording&limit=5&advanced=1
it returns:

If we are *too* strict with the three fields we will loose some of the songs already in the database, so we need to assign some flexibility. For instance, the artist can be retrieved correctly, and also the release, but the recording can be wrong.

An approach could be to calculate the Levenshtein distance for each field, and relate that with the quantity of letters for each field (more letters can imply a larger distance.

(Preliminary tests with the level of *strictness* for each field indicate that while artist is more strict (it doesn’t find anything for ‘viglienponi’), release and recording are more relaxe (it retrieves the correct release for ‘anything artist:vigliensoni recording:twist and shout’)
 

 

Digging into MusicBrainz NGS webservices

I have been dealing with the following problem: when searching the ‘Dogma’ artist in the MB website, I obtain

For our project I am only interested in the Chilean artist, which is the one that has a disambiguation comment field with the ‘Chilean artist’ note.

However, when searching the ‘Dogma’ artist using the musicbrainzngs.search_artist method, it outputs:

{'artist-list': [{'alias-list': ['Dogma'],
'id': '87373e74-74ca-4a0e-af24-2e17ab83f6f5',
'name': 'Dogma',
'sort-name': 'Dogma',
'type': 'Group'},
{'alias-list': ['Dogma'],
'id': '30fad333-2d95-4650-b27e-7c3147254105',
'name': 'Dogma',
'sort-name': 'Dogma',
'type': 'Group'},
{'alias-list': ['Dogma'],
'id': '66b7aa34-3117-42d7-b108-942ba99ba30b',
'name': 'Dogma',
'sort-name': 'Dogma'},
{'alias-list': ['Dogma'],
'id': '2b582ed9-2776-4f9f-9895-3ee0e9962f8e',
'name': 'Dogma',
'sort-name': 'Dogma'},
{'alias-list': [u'D\xf8gma'],
'id': '02a66935-f631-43cf-9788-15ef1e19f28a',
'name': u'D\xf8gma',
'sort-name': u'D\xf8gma'},
{'alias-list': ['Dogma'],
'id': '5839ff7d-88af-45c6-be93-a8f29b276f70',
'name': 'Dogma',
'sort-name': 'Dogma'},
{'alias-list': ['Dogma'],
'id': 'a6746c54-bdbc-4691-b8f5-8dabfab788cd',
'life-span': {'begin': '1996', 'end': '2003'},
'name': 'Dogma',
'sort-name': 'Dogma',
'type': 'Group'},
{'alias-list': ['Dogma'],
'id': 'dfd4ed8a-5626-4826-97ba-22905a9e22ba',
'life-span': {'end': '1996'},
'name': 'Dogma',
'sort-name': 'Dogma',
'type': 'Group'},
{'alias-list': ['Dogma', 'Dogma Crew'],
'country': 'ES',
'id': 'c98ecf9f-5572-4317-b15a-79cde78698ac',
'name': 'Dogma Crew',
'sort-name': 'Dogma Crew',
'type': 'Group'},
{'alias-list': ['Dogma 3000'],
'id': '8712f8c3-8f82-4f5a-a1f1-5702651f497a',
'name': 'Dogma 3000',
'sort-name': 'Dogma 3000'},
{'alias-list': ['Dogma 1'],
'id': '1a328d22-a2d4-43c6-9a92-489c23e2e042',
'name': 'Dogma 1',
'sort-name': 'Dogma 1'},
{'alias-list': ['The Dogma'],
'country': 'IT',
'id': '87067d59-89cf-4549-8d8e-28f503a563fe',
'life-span': {'begin': '1999'},
'name': 'The Dogma',
'sort-name': 'Dogma, The',
'type': 'Group'},
{'alias-list': ['Dogma Hollow'],
'id': 'ba8833ab-56bc-4c1f-8a60-a077c30d8a51',
'name': 'Dogma Hollow',
'sort-name': 'Dogma Hollow'},
{'alias-list': ['Dogma Cats'],
'country': 'GB',
'id': '236df439-6f5c-4280-bf8a-40ac44448350',
'name': 'Dogma Cats',
'sort-name': 'Dogma Cats',
'tag-list': [{'count': '1', 'name': 'uk'},
{'count': '1', 'name': 'england'},
{'count': '1', 'name': 'cambridge'}],
'type': 'Group'},
{'alias-list': ['Hot Dogma'],
'id': 'f41d67c5-a2e5-4a25-af96-39a91b72693b',
'life-span': {'begin': '2010'},
'name': 'Hot Dogma',
'sort-name': 'Hot Dogma',
'type': 'Group'},
{'alias-list': ['Dogma and The Afro-Cubans Rhythms',
'Dogma & The Afro-Cuban Rhythms'],
'id': 'cf264d63-a810-4ce0-8357-3b6a513cd7a2',
'name': 'Dogma & The Afro-Cuban Rhythms',
'sort-name': 'Dogma & The Afro-Cuban Rhythms',
'tag-list': [{'count': '1', 'name': 'splitme'}],
'type': 'Group'},
{'id': '5a73a61e-a9bc-4dfe-83e1-756e842c616b',
'name': 'Falso Dogma',
'sort-name': 'Falso Dogma',
'type': 'Group'}]}

Hence, the MB NGS python does not provide by default a way to look into this field, so I modded the distribution in order to retrieve this field.
Now, when I query MB for ‘Dogma’:

m.search_artists('Dogma', limit = 1, offset = 2)
http://musicbrainz.org/ws/2/artist/?query=Dogma&limit=1&offset=2

I obtain

{'artist-list': [{'alias-list': ['Dogma'],
'disambiguation': 'Chilean artist',
'id': '66b7aa34-3117-42d7-b108-942ba99ba30b',
'name': 'Dogma',
'sort-name': 'Dogma'}]}

, which is what I am looking for. Since this point, I just need to iterate over a number of artists, and see if any of them has:

  • a ‘country’:’CL’ value
  • or ‘chile’ within the value of the key:value pair (re.search('chile', value)

However, a second problem that I have had is that when I search using the same search_artists method:

search_artists(query='', limit=None, offset=None, **fields)

Specifying these key:values for the **fields:
{'tags':'uk', 'tags':'england', 'country':'GB'}
and doing this query
m.search_artists('Dogma', {'tags':'uk', 'tags':'england', 'country':'GB'})
I get the same list as before, so these extra fields are not narrowing the search:

{'artist-list': [{'alias-list': [u'D\xf8gma'],
'id': '02a66935-f631-43cf-9788-15ef1e19f28a',
'name': u'D\xf8gma',
'sort-name': u'D\xf8gma'},
{'alias-list': ['Dogma'],
'id': '5839ff7d-88af-45c6-be93-a8f29b276f70',
'name': 'Dogma',
'sort-name': 'Dogma'},
{'alias-list': ['Dogma'],
'id': 'a6746c54-bdbc-4691-b8f5-8dabfab788cd',
'life-span': {'begin': '1996', 'end': '2003'},
'name': 'Dogma',
'sort-name': 'Dogma',
'type': 'Group'},
{'alias-list': ['Dogma'],
'id': 'dfd4ed8a-5626-4826-97ba-22905a9e22ba',
'life-span': {'end': '1996'},
'name': 'Dogma',
'sort-name': 'Dogma',
'type': 'Group'},
{'alias-list': ['Dogma'],
'id': '87373e74-74ca-4a0e-af24-2e17ab83f6f5',
'name': 'Dogma',
'sort-name': 'Dogma',
'type': 'Group'},
{'alias-list': ['Dogma'],
'id': '30fad333-2d95-4650-b27e-7c3147254105',
'name': 'Dogma',
'sort-name': 'Dogma',
'type': 'Group'},
{'alias-list': ['Dogma'],
'id': '66b7aa34-3117-42d7-b108-942ba99ba30b',
'name': 'Dogma',
'sort-name': 'Dogma'},
{'alias-list': ['Dogma'],
'id': '2b582ed9-2776-4f9f-9895-3ee0e9962f8e',
'name': 'Dogma',
'sort-name': 'Dogma'},
{'alias-list': ['Dogma', 'Dogma Crew'],
'country': 'ES',
'id': 'c98ecf9f-5572-4317-b15a-79cde78698ac',
'name': 'Dogma Crew',
'sort-name': 'Dogma Crew',
'type': 'Group'},
{'alias-list': ['Dogma 3000'],
'id': '8712f8c3-8f82-4f5a-a1f1-5702651f497a',
'name': 'Dogma 3000',
'sort-name': 'Dogma 3000'},
{'alias-list': ['Dogma Cats'],
'country': 'GB',
'id': '236df439-6f5c-4280-bf8a-40ac44448350',
'name': 'Dogma Cats',
'sort-name': 'Dogma Cats',
'tag-list': [{'count': '1', 'name': 'uk'},
{'count': '1', 'name': 'england'},
{'count': '1', 'name': 'cambridge'}],
'type': 'Group'},
{'alias-list': ['Hot Dogma'],
'id': 'f41d67c5-a2e5-4a25-af96-39a91b72693b',
'life-span': {'begin': '2010'},
'name': 'Hot Dogma',
'sort-name': 'Hot Dogma',
'type': 'Group'},
{'alias-list': ['Dogma 1'],
'id': '1a328d22-a2d4-43c6-9a92-489c23e2e042',
'name': 'Dogma 1',
'sort-name': 'Dogma 1'},
{'alias-list': ['The Dogma'],
'country': 'IT',
'id': '87067d59-89cf-4549-8d8e-28f503a563fe',
'life-span': {'begin': '1999'},
'name': 'The Dogma',
'sort-name': 'Dogma, The',
'type': 'Group'},
{'alias-list': ['Dogma Hollow'],
'id': 'ba8833ab-56bc-4c1f-8a60-a077c30d8a51',
'name': 'Dogma Hollow',
'sort-name': 'Dogma Hollow'},
{'alias-list': ['Dogma and The Afro-Cubans Rhythms',
'Dogma & The Afro-Cuban Rhythms'],
'id': 'cf264d63-a810-4ce0-8357-3b6a513cd7a2',
'name': 'Dogma & The Afro-Cuban Rhythms',
'sort-name': 'Dogma & The Afro-Cuban Rhythms',
'tag-list': [{'count': '1', 'name': 'splitme'}],
'type': 'Group'},
{'id': '5a73a61e-a9bc-4dfe-83e1-756e842c616b',
'name': 'Falso Dogma',
'sort-name': 'Falso Dogma',
'type': 'Group'}]}

A third problem is that if I do:
m.search_artists('Dogma', limit = 1, {'tags':'uk', 'tags':'england', 'country':'GB'})
I obtain this error:
SyntaxError: non-keyword arg after keyword arg (, line 1)
, which ought to be an error of the Python module because I am properly following the module syntax.

I’ve been taking a closer look to the syntax when doing advanced queries using MB and it is possible to create complex queries such as:

Advanced query syntax : dogma (comment:chile*) (country:CL)

or in the web-browser:

http://musicbrainz.org/search?query=dogma+%28comment%3Achile*%29+%28country%3ACL%29&type=artist&limit=25&advanced=1

This returns:

So I will try to replicate this syntax in my queries within my scripts:

http://musicbrainz.org/search?query=supernova+(comment:chile*)+(country:CL)&type=artist&limit=5&advanced=1
 

Merging artist names between BDMC and musicapopular

I have been done a preliminary testing in merging my artist name data from the BDMC and MB with a 17% of recognized artists. I am also searching in the MB alias of the artists, and I am comparing the lowercase version of the names in order to avoid different capitalization styles.

Before continuing this research on name-entities, which is a hot one in the [Music-IR] list, I have been comparing the artist name data between musicapopular and the BDMC. Using the scrapped data in PEOPLE_ARTIST I obtained:

  • There is a total of 4313 artists
  • 468 artist names are recognized in both databases
  • 3845 appear in one or the other database

However, when using ARTIST_INFO:

  • 764 in both databases
  • 4975 in either one of the two lists
  • 5712 total artist

What is the difference of the data in PEOPLE_ARTIST and ARTIST_INFO? When comparing both lists, there are:

  • 1636 artist in both lists
  • 1723 in either one of the two lists
Also, an artist like ‘Marco Aurelio’ appears in ARTIST_INFO but it does not appear in PEOPLE_ARTIST.
I checked the methods and realized/remembered than PEOPLE_ARTIST was done for extracting all the people that has worked over the years in an specific group, so that is way ‘Marco Aurelio’ does not appear in the PEOPLE_ARTIST file.
In other words, for future work comparing files the ARTIST_INFO .txt file must be used.

 

Also, there are no intersected artists with accents, so there is something wrong with the way both lists, from musicapopular and BDMC, are encoded.

I tested the BDMC file, exported as a windows-formatted, tab-delimited txt file, and the ARTIST_INFO txt file. The encoders for each one of them was different. I figured out that the encoder for the BDMC file was CSISOLATIN1, and for ARTIST_INFO was UTF-8 (see previous post). So now, the two files have the some coding scheme and we obtained a much better:

  • 973 artists in both lists
  • 4556 in either one of the two lists

MusicBrainz Spanish Style Guide

I need to follow the MusicBrainz Style Guide for spanish, which is different to other languages, so here is the list of rules taken from their website:

As a rule, we will use lower case for all words in a sentence with the exception of:

  • La primera palabra del título o la que vaya después de punto.
  • Todo nombre propio: Dios, Jehová, Jesús, Luzbel, Platón, Pedro, María, Álvarez, Pantoja, Apolo, Calíope, Amadís de Gaula; Europa, España, Castilla, Valencia, Oviedo, Plaza Mayor; Cáucaso,Himalaya, Oriente, Occidente, Adriático, Estrella Polar, Támesis, el Ebro, la ciudad de México, la cordillera de los Andes.
  • En caso de nombres de lugares, por ejemplo, que incorporan el artículo, este deberá escribirse en mayúsculas: Los Ángeles, La Haya, Las Palmas, La Habana, El Cairo.
  • “Tierra” sólo se escribirá en mayúscula cuando hablamos del planeta: Madre Tierra, Mi tierra, El avión tomó tierra. En el caso de “sol” y “luna” sólo se escribe en mayúscula en los textos científicos, así en los títulos normalmente se escribirán en minúscula: Bajo la luna llena, El sol de mediodía.
  • Los atributos divinos y títulos nobiliarios, como Creador, Redentor; Sumo Pontífice, Duquesa de Alva, Marqués de Osuna.
  • Los nombres y apodos de personas, como el Gran Capitán, Alfonso el Sabio, el Drogas (el artículo solo se escribirá en maýúsculas si cumple la norma 1).
  • Los sustantivos y adjetivos que compongan el nombre de una institución, de un cuerpo o establecimiento: Real Academia de Música, Colegio Naval, Museo de Bellas Artes.
  • Los nombres de festividades religiosas o civiles, como Epifanía, Navidad, Año Nuevo. Los nombres de santos y similares: Virgen de Guadalupe, San Antonio.
  • Nombres de calles y espacios, pero sólo el nombre propio: calle Recoletos, plaza de Ríos
  • Los números romanos, como I, X, MCMXC.

REGLAS COMPLEMENTARIAS

  • Las letras mayúsculas deben ser siempre acentuadas, cuando así sea necesario segun las reglas mencionadas, por ejemplo Álvaro.
  • En las palabras que comiencen con las letras Ch y Ll, sólo la primera debe ser mayúscula. CHile o LLorente será incorrecto: debe ser Chile y Llorente.
  • Después de dos puntos (:), (Bailes tradicionales: la jota) excepto cuando sirvan para introducir una cita: Juan dijo: Ten cuidado.
  • Si después de puntos suspensivos (…) se cierra un enunciado usaremos mayúscula. Fuimos al cine… pero llegamos tarde o Fuimos al cine… La película estuvo muy bien.
  • Después de abrir un signo de admiración o interrogación, sólo cuando sea el principio del título o después de punto (¿Quién dijo miedo?, ¿Quiénes somos? ¿de dónde venimos? ¿a dónde vamos?.
  • Después de abrir paréntesis escribiremos mayúscula únicamente si se cumple alguna regla de las anteriores, por norma general empezaremos por minúscula: Al otro lado del río (en directo) o PCCh (Partido Comunista de China).

CASOS DONDE NO UTILIZAREMOS MAYÚSCULA

  • Los nombres de los días de la semana, meses y estaciones del año: lunes, abril, verano. Sólo las escribiremos con mayúscula cuando forman parte de fechas históricas, festividades o nombres propios: Primero de Mayo, Primavera de Praga, Viernes Santo, Hospital Doce de Octubre.
  • Los nombres de las notas musicales: do, re, mi, fa, sol, la, si.
  • Nombres propios de persona que pasan a designar genéricamente a quienes poseen el rasgo más característico o destacable del original: Mi tía Petra es una auténtica celestina; Siempre vas de quijote por la vida; Mi padre, de joven, era un donjuán.
  • Muchos objetos, productos, etc. que son llamados por el nombre de su inventor, descubridor, o la marca que lo fabricó o popularizó (zepelín, braille, quevedos, rebeca, napoleón), o del lugar en que se producen o del que son originarios (cabrales, rioja, damasco, fez). Por el contrario, conservan la mayúscula inicial los nombres de los autores aplicados a sus obras (El Quijote de Cervantes).
  • Nombres de marcas comerciales, cuando no nos referimos a un producto de la marca, sino, de forma genérica, a cualquier objeto de características similares (zippo, wambas, moulinex, mobylette,derbi, minipimer).

The Chilean record labels contribution, demystified

Using the data from the BDMC, we can see that there is 37335 songs with an assigned record label (93% of the BDMC total entries), and the number of labels is 334.

The distribution of edited songs by the record labels is shown here:

We can see in this distribution the ‘short-head’ and the ‘long-tail’ of music.

If we make a zoom in the ‘short-head’ and select labels with more than 100 published songs, we can see:

We can see that the first column, with 26% of the total published songs correspond to ‘Independent’ editions, so they are isolated publications that can not count as record labels. Interesting results can be seen in the figure. Emi Odeón and Warner Music are the two biggest record labels (9% and 6% respectively), but Oveja Negra, label of the SCD (the Chilean music copyright society), has also a 6% of participation on published songs after only 10 years of work. We can also see that ‘Sello Azul’, a second label belonging to SCD devoted to new people, has a 3%, so the SCD is responsible for almost the same amount of Chilean music that the biggest record label. However, this effect can also be seen like as the SCD takes much care on the distribution of their own products and most of their represented artist are in their database.

If we do a per-year analysis, we can see:

We can see that the songs in the BDMC database belong mostly to the 90’s and 00’s decades, with a 23% and 69% percent of the total. Songs before that period are only a 1% of the total. We can also see that the peak of published songs was in 2007, and this amount has decreased rapidly in the following years, reaching in 2011 the same number of songs that in 1993.

The merging problem (by Thierry B Mahieux)

  • When you integrate different sources of data you start to add error because you match data from different sources, which is different in some percentage (from Jamendo to Musicbrainz (e.g., 10% error), from MB to DBPedia (e.g. 30%)
  • MB by Thierry: “MB is a database of music knowledge”
  • for matching artists from different sources:
    • everything lowercase, remove spaces and creating all possible comparisons
    • Aerosmith – Run D.M.C -> aerosmith run dmc ->aerosmithrundmc etc
  • “Matching is imperfect, period” (Thierry B Mahieux), so we have big issues to solve on big databases:
    • Improve the matching algorithms
    • Deal with the noise
  • Also, matching is a trade-off, so start merging!
  • Under merge: we might miss connections with other resources
  • Over-merge: we don’t know which artist is which anymore
  • Maybe something to try would be to use fingerprinting in order to find matches (G. Tzanetakis)