String comparison metrics

We are comparing 6 different metrics based on the Levenshtein distance for the string comparison between the BDMC and MB. These metrics are:

  • Original string Levenshtein ratio
  • Original string Levenshtein jaro
  • ascii strings Levenshtein ratio
  • ascii strings Levenshtein jaro
  • lowercase, no-spaces, ascii strings Levenshtein ratio
  • lowercase, no-spaces, ascii strings Levenshtein jaro

For the actual comparison, I will create a know dataset and will measure precision and recall for the six metrics, but how large this dataset should be? Rule of thumb: (100/x)^2, where x is the margin of error that you want. However, this is for an infinite population, so we should implement a ‘finite population correction’ (FPC),

FPC = sqr((N-n)/(N-1)), where N is the population size, and n is the sample size.

We have three interesting things that we should have a look:

  1. How many artists have an exact match (i.e., they are already in the database)
  2. How many artists do not match (i.e., they are not in the database)
  3. How many artists match partially. Among these we need to see what is the best threshold to obtain the best precision and recall, and after that using the bootstrapping technique, create error bars for both metrics.

Although we have realized that the threshold for doing the string matching is located around 0.88 (for the lowercase, no-spaces, ascii strings Levenshtein ratio), we are running a query test with a more *generous* thresholding. Later, we will extract from there our subset to calculate the best threshold for the best precision and recall values (or by using the mixed ‘f-score’).

The experiment we are thinking about has the following steps:

  1. To create a ground truth dataset. This subset will be randomly chosen among all artist names that actually exist in the BMDC and MB. As mentioned before the size of this population should be somewhere between 100 and 400 entries.
  2. Manually look for the correct MBID for these entries
  3. Create a script for calculating precision and recall using the six metrics for all these entries.

 

STEPS

  1. We created a script that takes random samples for those entries with distance values between 0.75 and 1.0, and which belong to ‘Chile’ or ‘None’. We randomly chose 400 entries in order to be able to discard those who are not in MB (in fact, this is the maximum number of entries with those constraints)
  2. We are marking as RED those entries from the BDMC who are not in MB. GREEN are those who are already in MB, and YELLOW those who have a wrong entry in MB (false positives: same name but a different artist, so they should be considered as RED). To check false positives and negatives we use the data from musicapopular and BDMC. The numbers we got are:
    • GREEN: 179
    • YELLOW: 98
    • RED: 123

We have realized several interesting facts:

  • There is a large amount of artists within the Folklore genre. Most of these entries belong to compiled releases from Various Artists. Hence, most of these artist have just one or a few works associated with them.
  • There is a large amount of false positives among those artist with very common names such as Rachel, Quorum, Criminal, Polter, Rock Hudson, Trilogía, Twilight, and many others. The only way to determine if it is a true or false positive is researching in the releases and works developed by the artist. Hopefully, we have large information coming from several websites to determine if the artist has already en entry or not, or by analyzing if there is any reference to Chile in the MB entry, in the release country of a release, or in the relationship field.

After some trial-error, we have changed our script and now we are running another one among all artists and without any threshold for the six metrics we are comparing. Also, now the queries are properly done, and an artist like ‘Gonzalo Yáñez’ is properly formatted in the URL query to MB. We think that with this approach will be able to compare all the metrics at once. Once this was done for all 3142 different artists, we filtered again all entries with values between [0.75, 1[ but we left wrong countries in the set (we can’t mix pears and apples).

The settings for this latest approach gave us 335 artists in the range [0.75, 1[, and 464 artist with a value of [1]. Also, there are 2344 in the range [0., 0.75[. We considered artists correctly retrieved as ‘true positives’, those with the same name but being referred to another artist as ‘false positives’, and those wrongly retrieved as the false negatives. This selection should be discussed. The first plots are as follow:

It is strange that in plots 3, 4, 5, and 6 the recall stays growing forever while the precision diminishes just a few. We think there is something wrong with the election of the true and false positives, and true and false negatives.

We have been designing a third approach for analyzing this data. The first part of this approach has to do with how many Chilean artist are in the database and how well the algorithm performs in here. Things to calculate:

1. Recall on just the ones for the different thresholds

2. Recall on the ‘ones’ and ‘twos’

But for other ‘real-world’ applications, where string matching could be used, we will:

3. Calculate precision and recall considering “two’s” as correct matching (the string matching algorithm did the job),

4. Calculate precision and recall considering “two’s” as incorrect matches.

Moreover, to calculate the error we will use the bootstrapping technique: to create a number of other populations starting from my sample population. In other words, if my sample population is 380 entries, we will create 1000 populations starting from this population without replacement (this means that we can have duplicate entries in the new population, otherwise we will have the same one again and again), and then we can discard the 25 lower and 25 higher ones, and we will have our error boundaries for a 95% of confidence interval)

 

Retrieving already assigned MBID from MB

We are comparing our database of songs (recordings), artists (artists), and albums (releases) with Musicbrainz. I am querying MB with an advanced search syntax for an specific recording, like:

recording+name artist:artist+name release:release+title comment:Chile* country:CL

A result list is retrieved and in this round I am only considering the first entry as the potential response. Then, I am comparing the recording title of the BDMC and MB using Levenshtein distance and the total number of letter of the query and the response.

After some trial and error, and leave the script running for 30 hours, the results we obtained are:

  • 10615 entries with an MB artist ID
  • 9074 entries with an MB release ID
  • 8436 entries with an MB recording ID

However, taking a look to the resulting file I can see some things that I need to fix for having better results:

  1. After receiving a query response, I should only consider artist names with country=[‘CL’, ‘Chile’, ”]. Also I should apply a string comparison using Levenshtein distance between the names from the BDMC and MB.
  2. When a response is back, we should iterate over a number of songs to see which one of all of the is the proper match (sometimes the true positive is not the first option)

For the next round I will use a Levenshtein ratio instead of the distance. This approach returns a normalized value between 0 and 1 instead of the number of necessary edits and changes for going from one word to the another one.

Big question to solve: which values are the best ones when comparing Levenshtein distances. Trying and comparing by-hand I have arrived to 0.75 as a *nice* threshold, but this value should be revised (suggestion: make plots of different thresholds)

I implemented the iteration over the retrieved songs to match the hopefully true positive among all items with a 100% score (number 2 above). The amount of noise is being diminished, but I don’t have all results yet.

About the artists’ names, it seems that the proper approach would be to query the MB database only by artist in order to refine the results. The query and filtering should be:

1) firstname+lastname country:CL comment:Chile*

2) Filter those artist names with a country ≠ [CL, ”]

3) Iterate over the a number of names and retrieve the one with the highest Levenshtein ratio or jaro distance.

Just as a note, when asking MB for all chilean artists (comment:Chile* country:CL), it returns 208 artists)

Digging into MusicBrainz NGS webservices

I have been dealing with the following problem: when searching the ‘Dogma’ artist in the MB website, I obtain

For our project I am only interested in the Chilean artist, which is the one that has a disambiguation comment field with the ‘Chilean artist’ note.

However, when searching the ‘Dogma’ artist using the musicbrainzngs.search_artist method, it outputs:

{'artist-list': [{'alias-list': ['Dogma'],
'id': '87373e74-74ca-4a0e-af24-2e17ab83f6f5',
'name': 'Dogma',
'sort-name': 'Dogma',
'type': 'Group'},
{'alias-list': ['Dogma'],
'id': '30fad333-2d95-4650-b27e-7c3147254105',
'name': 'Dogma',
'sort-name': 'Dogma',
'type': 'Group'},
{'alias-list': ['Dogma'],
'id': '66b7aa34-3117-42d7-b108-942ba99ba30b',
'name': 'Dogma',
'sort-name': 'Dogma'},
{'alias-list': ['Dogma'],
'id': '2b582ed9-2776-4f9f-9895-3ee0e9962f8e',
'name': 'Dogma',
'sort-name': 'Dogma'},
{'alias-list': [u'D\xf8gma'],
'id': '02a66935-f631-43cf-9788-15ef1e19f28a',
'name': u'D\xf8gma',
'sort-name': u'D\xf8gma'},
{'alias-list': ['Dogma'],
'id': '5839ff7d-88af-45c6-be93-a8f29b276f70',
'name': 'Dogma',
'sort-name': 'Dogma'},
{'alias-list': ['Dogma'],
'id': 'a6746c54-bdbc-4691-b8f5-8dabfab788cd',
'life-span': {'begin': '1996', 'end': '2003'},
'name': 'Dogma',
'sort-name': 'Dogma',
'type': 'Group'},
{'alias-list': ['Dogma'],
'id': 'dfd4ed8a-5626-4826-97ba-22905a9e22ba',
'life-span': {'end': '1996'},
'name': 'Dogma',
'sort-name': 'Dogma',
'type': 'Group'},
{'alias-list': ['Dogma', 'Dogma Crew'],
'country': 'ES',
'id': 'c98ecf9f-5572-4317-b15a-79cde78698ac',
'name': 'Dogma Crew',
'sort-name': 'Dogma Crew',
'type': 'Group'},
{'alias-list': ['Dogma 3000'],
'id': '8712f8c3-8f82-4f5a-a1f1-5702651f497a',
'name': 'Dogma 3000',
'sort-name': 'Dogma 3000'},
{'alias-list': ['Dogma 1'],
'id': '1a328d22-a2d4-43c6-9a92-489c23e2e042',
'name': 'Dogma 1',
'sort-name': 'Dogma 1'},
{'alias-list': ['The Dogma'],
'country': 'IT',
'id': '87067d59-89cf-4549-8d8e-28f503a563fe',
'life-span': {'begin': '1999'},
'name': 'The Dogma',
'sort-name': 'Dogma, The',
'type': 'Group'},
{'alias-list': ['Dogma Hollow'],
'id': 'ba8833ab-56bc-4c1f-8a60-a077c30d8a51',
'name': 'Dogma Hollow',
'sort-name': 'Dogma Hollow'},
{'alias-list': ['Dogma Cats'],
'country': 'GB',
'id': '236df439-6f5c-4280-bf8a-40ac44448350',
'name': 'Dogma Cats',
'sort-name': 'Dogma Cats',
'tag-list': [{'count': '1', 'name': 'uk'},
{'count': '1', 'name': 'england'},
{'count': '1', 'name': 'cambridge'}],
'type': 'Group'},
{'alias-list': ['Hot Dogma'],
'id': 'f41d67c5-a2e5-4a25-af96-39a91b72693b',
'life-span': {'begin': '2010'},
'name': 'Hot Dogma',
'sort-name': 'Hot Dogma',
'type': 'Group'},
{'alias-list': ['Dogma and The Afro-Cubans Rhythms',
'Dogma & The Afro-Cuban Rhythms'],
'id': 'cf264d63-a810-4ce0-8357-3b6a513cd7a2',
'name': 'Dogma & The Afro-Cuban Rhythms',
'sort-name': 'Dogma & The Afro-Cuban Rhythms',
'tag-list': [{'count': '1', 'name': 'splitme'}],
'type': 'Group'},
{'id': '5a73a61e-a9bc-4dfe-83e1-756e842c616b',
'name': 'Falso Dogma',
'sort-name': 'Falso Dogma',
'type': 'Group'}]}

Hence, the MB NGS python does not provide by default a way to look into this field, so I modded the distribution in order to retrieve this field.
Now, when I query MB for ‘Dogma’:

m.search_artists('Dogma', limit = 1, offset = 2)
http://musicbrainz.org/ws/2/artist/?query=Dogma&limit=1&offset=2

I obtain

{'artist-list': [{'alias-list': ['Dogma'],
'disambiguation': 'Chilean artist',
'id': '66b7aa34-3117-42d7-b108-942ba99ba30b',
'name': 'Dogma',
'sort-name': 'Dogma'}]}

, which is what I am looking for. Since this point, I just need to iterate over a number of artists, and see if any of them has:

  • a ‘country’:’CL’ value
  • or ‘chile’ within the value of the key:value pair (re.search('chile', value)

However, a second problem that I have had is that when I search using the same search_artists method:

search_artists(query='', limit=None, offset=None, **fields)

Specifying these key:values for the **fields:
{'tags':'uk', 'tags':'england', 'country':'GB'}
and doing this query
m.search_artists('Dogma', {'tags':'uk', 'tags':'england', 'country':'GB'})
I get the same list as before, so these extra fields are not narrowing the search:

{'artist-list': [{'alias-list': [u'D\xf8gma'],
'id': '02a66935-f631-43cf-9788-15ef1e19f28a',
'name': u'D\xf8gma',
'sort-name': u'D\xf8gma'},
{'alias-list': ['Dogma'],
'id': '5839ff7d-88af-45c6-be93-a8f29b276f70',
'name': 'Dogma',
'sort-name': 'Dogma'},
{'alias-list': ['Dogma'],
'id': 'a6746c54-bdbc-4691-b8f5-8dabfab788cd',
'life-span': {'begin': '1996', 'end': '2003'},
'name': 'Dogma',
'sort-name': 'Dogma',
'type': 'Group'},
{'alias-list': ['Dogma'],
'id': 'dfd4ed8a-5626-4826-97ba-22905a9e22ba',
'life-span': {'end': '1996'},
'name': 'Dogma',
'sort-name': 'Dogma',
'type': 'Group'},
{'alias-list': ['Dogma'],
'id': '87373e74-74ca-4a0e-af24-2e17ab83f6f5',
'name': 'Dogma',
'sort-name': 'Dogma',
'type': 'Group'},
{'alias-list': ['Dogma'],
'id': '30fad333-2d95-4650-b27e-7c3147254105',
'name': 'Dogma',
'sort-name': 'Dogma',
'type': 'Group'},
{'alias-list': ['Dogma'],
'id': '66b7aa34-3117-42d7-b108-942ba99ba30b',
'name': 'Dogma',
'sort-name': 'Dogma'},
{'alias-list': ['Dogma'],
'id': '2b582ed9-2776-4f9f-9895-3ee0e9962f8e',
'name': 'Dogma',
'sort-name': 'Dogma'},
{'alias-list': ['Dogma', 'Dogma Crew'],
'country': 'ES',
'id': 'c98ecf9f-5572-4317-b15a-79cde78698ac',
'name': 'Dogma Crew',
'sort-name': 'Dogma Crew',
'type': 'Group'},
{'alias-list': ['Dogma 3000'],
'id': '8712f8c3-8f82-4f5a-a1f1-5702651f497a',
'name': 'Dogma 3000',
'sort-name': 'Dogma 3000'},
{'alias-list': ['Dogma Cats'],
'country': 'GB',
'id': '236df439-6f5c-4280-bf8a-40ac44448350',
'name': 'Dogma Cats',
'sort-name': 'Dogma Cats',
'tag-list': [{'count': '1', 'name': 'uk'},
{'count': '1', 'name': 'england'},
{'count': '1', 'name': 'cambridge'}],
'type': 'Group'},
{'alias-list': ['Hot Dogma'],
'id': 'f41d67c5-a2e5-4a25-af96-39a91b72693b',
'life-span': {'begin': '2010'},
'name': 'Hot Dogma',
'sort-name': 'Hot Dogma',
'type': 'Group'},
{'alias-list': ['Dogma 1'],
'id': '1a328d22-a2d4-43c6-9a92-489c23e2e042',
'name': 'Dogma 1',
'sort-name': 'Dogma 1'},
{'alias-list': ['The Dogma'],
'country': 'IT',
'id': '87067d59-89cf-4549-8d8e-28f503a563fe',
'life-span': {'begin': '1999'},
'name': 'The Dogma',
'sort-name': 'Dogma, The',
'type': 'Group'},
{'alias-list': ['Dogma Hollow'],
'id': 'ba8833ab-56bc-4c1f-8a60-a077c30d8a51',
'name': 'Dogma Hollow',
'sort-name': 'Dogma Hollow'},
{'alias-list': ['Dogma and The Afro-Cubans Rhythms',
'Dogma & The Afro-Cuban Rhythms'],
'id': 'cf264d63-a810-4ce0-8357-3b6a513cd7a2',
'name': 'Dogma & The Afro-Cuban Rhythms',
'sort-name': 'Dogma & The Afro-Cuban Rhythms',
'tag-list': [{'count': '1', 'name': 'splitme'}],
'type': 'Group'},
{'id': '5a73a61e-a9bc-4dfe-83e1-756e842c616b',
'name': 'Falso Dogma',
'sort-name': 'Falso Dogma',
'type': 'Group'}]}

A third problem is that if I do:
m.search_artists('Dogma', limit = 1, {'tags':'uk', 'tags':'england', 'country':'GB'})
I obtain this error:
SyntaxError: non-keyword arg after keyword arg (, line 1)
, which ought to be an error of the Python module because I am properly following the module syntax.

I’ve been taking a closer look to the syntax when doing advanced queries using MB and it is possible to create complex queries such as:

Advanced query syntax : dogma (comment:chile*) (country:CL)

or in the web-browser:

http://musicbrainz.org/search?query=dogma+%28comment%3Achile*%29+%28country%3ACL%29&type=artist&limit=25&advanced=1

This returns:

So I will try to replicate this syntax in my queries within my scripts:

http://musicbrainz.org/search?query=supernova+(comment:chile*)+(country:CL)&type=artist&limit=5&advanced=1
 

Matching BDMC and MusicBrainz

 

I have been querying MusicBrainz with the data from the BDMC, as a first outcome:

  • In the BDMC there is a total of:
    • 40132 entries
    • 3343 different artists
    • 3085 different albums
    • 32570 songs with different names

From that total, there are

  • 457 artist names (with the EXACT spelling that can be found in MusicBrainz)
  • 2886 artists that can not be found

This is only the 14% of the total amount. However, there are some artist names that are not properly spelled, but are close to the original, in the databases (e.g., ‘DJ Mendez’ instead of ‘DJ Méndez’, or ‘Alvaro Henriquez’ instead of ‘Álvaro Henríquez’), and those should be considered as found artistsAlso, some of the artist have the same name with other artist, such as ‘Mito’. The Chilean ‘Mito’ appears as the third entry in MB, without an explicit country, only with a disambiguation (‘Chilean’).

After running the script again considering if the entry in the BDCH matches some of the aliases for each artist in MB, the numbers are a bit better:

  • 565 (17%) artists were recognized
  • 56 (2%) have CL as the country (2%)
  • 72 (2%) have another country as the country type
So, if we extract this last number of artists from the database, which are very likely to not be chilean, we end up with 493 recognized artists.

 

I’ve been also correcting the many inconsistencies of the BDCH: renaming artist with different spellings and entering accents for artists without them. I have done 25% of it (10^4 entries) and the new numbers I got are:

  • 3308 different artists
  • 551 artists were recognized (17% of the total)
    • 466 possibly Chilean (14%)
      • 56 Chilean (explicitly declared)
      • 410 undeclared country
      • 177 groups (38% of the recognized possibly Chilean artists)
      • 142 people (30% of the recognized possibly Chilean artists)
      • 147 undefined (32% of the recognized possibly Chilean artists)
    • 75 non-chilean artists (should be discarded from the database)

Our idea is to provide MB with a big file with all data in our database with the corresponding MBIDs for artist, title, and album (if any).

  • From the 551 recognized artists using the out_correct file, there are:
    •  9454 titles (out_BDMC_w_artist_MBID)

During the last days I’ve been trying to solve the following problem: for the Chilean artist Dogma there are 8 different entries with the same score (100):

Score Name Sort Name Type Begin End
100 Døgma Døgma
100 Dogma (German trance artist) Dogma
100 Dogma (portuguese band) Dogma Group 1996 2003
100 Dogma (Brazilian progressive rock band) Dogma Group 1996
100 Dogma (Swiss trance duo Robin Mandrysch & Guido Walter) Dogma Group
100 Dogma (goa trance duo Damir Ludvig & Goran Stetic) Dogma Group
100 Dogma (Chilean artist) Dogma
100 Dogma (Italo-dance artist) Dogma

It seems that I need to take a look to the disambiguation field and look for the ‘Chile’ word (or a derivative) to consider it as the artist we are looking for.