Retrieving already assigned MBID from MB

We are comparing our database of songs (recordings), artists (artists), and albums (releases) with Musicbrainz. I am querying MB with an advanced search syntax for an specific recording, like:

recording+name artist:artist+name release:release+title comment:Chile* country:CL

A result list is retrieved and in this round I am only considering the first entry as the potential response. Then, I am comparing the recording title of the BDMC and MB using Levenshtein distance and the total number of letter of the query and the response.

After some trial and error, and leave the script running for 30 hours, the results we obtained are:

  • 10615 entries with an MB artist ID
  • 9074 entries with an MB release ID
  • 8436 entries with an MB recording ID

However, taking a look to the resulting file I can see some things that I need to fix for having better results:

  1. After receiving a query response, I should only consider artist names with country=[‘CL’, ‘Chile’, ”]. Also I should apply a string comparison using Levenshtein distance between the names from the BDMC and MB.
  2. When a response is back, we should iterate over a number of songs to see which one of all of the is the proper match (sometimes the true positive is not the first option)

For the next round I will use a Levenshtein ratio instead of the distance. This approach returns a normalized value between 0 and 1 instead of the number of necessary edits and changes for going from one word to the another one.

Big question to solve: which values are the best ones when comparing Levenshtein distances. Trying and comparing by-hand I have arrived to 0.75 as a *nice* threshold, but this value should be revised (suggestion: make plots of different thresholds)

I implemented the iteration over the retrieved songs to match the hopefully true positive among all items with a 100% score (number 2 above). The amount of noise is being diminished, but I don’t have all results yet.

About the artists’ names, it seems that the proper approach would be to query the MB database only by artist in order to refine the results. The query and filtering should be:

1) firstname+lastname country:CL comment:Chile*

2) Filter those artist names with a country ≠ [CL, ”]

3) Iterate over the a number of names and retrieve the one with the highest Levenshtein ratio or jaro distance.

Just as a note, when asking MB for all chilean artists (comment:Chile* country:CL), it returns 208 artists)

Digging into MusicBrainz NGS webservices

I have been dealing with the following problem: when searching the ‘Dogma’ artist in the MB website, I obtain

For our project I am only interested in the Chilean artist, which is the one that has a disambiguation comment field with the ‘Chilean artist’ note.

However, when searching the ‘Dogma’ artist using the musicbrainzngs.search_artist method, it outputs:

{'artist-list': [{'alias-list': ['Dogma'],
'id': '87373e74-74ca-4a0e-af24-2e17ab83f6f5',
'name': 'Dogma',
'sort-name': 'Dogma',
'type': 'Group'},
{'alias-list': ['Dogma'],
'id': '30fad333-2d95-4650-b27e-7c3147254105',
'name': 'Dogma',
'sort-name': 'Dogma',
'type': 'Group'},
{'alias-list': ['Dogma'],
'id': '66b7aa34-3117-42d7-b108-942ba99ba30b',
'name': 'Dogma',
'sort-name': 'Dogma'},
{'alias-list': ['Dogma'],
'id': '2b582ed9-2776-4f9f-9895-3ee0e9962f8e',
'name': 'Dogma',
'sort-name': 'Dogma'},
{'alias-list': [u'D\xf8gma'],
'id': '02a66935-f631-43cf-9788-15ef1e19f28a',
'name': u'D\xf8gma',
'sort-name': u'D\xf8gma'},
{'alias-list': ['Dogma'],
'id': '5839ff7d-88af-45c6-be93-a8f29b276f70',
'name': 'Dogma',
'sort-name': 'Dogma'},
{'alias-list': ['Dogma'],
'id': 'a6746c54-bdbc-4691-b8f5-8dabfab788cd',
'life-span': {'begin': '1996', 'end': '2003'},
'name': 'Dogma',
'sort-name': 'Dogma',
'type': 'Group'},
{'alias-list': ['Dogma'],
'id': 'dfd4ed8a-5626-4826-97ba-22905a9e22ba',
'life-span': {'end': '1996'},
'name': 'Dogma',
'sort-name': 'Dogma',
'type': 'Group'},
{'alias-list': ['Dogma', 'Dogma Crew'],
'country': 'ES',
'id': 'c98ecf9f-5572-4317-b15a-79cde78698ac',
'name': 'Dogma Crew',
'sort-name': 'Dogma Crew',
'type': 'Group'},
{'alias-list': ['Dogma 3000'],
'id': '8712f8c3-8f82-4f5a-a1f1-5702651f497a',
'name': 'Dogma 3000',
'sort-name': 'Dogma 3000'},
{'alias-list': ['Dogma 1'],
'id': '1a328d22-a2d4-43c6-9a92-489c23e2e042',
'name': 'Dogma 1',
'sort-name': 'Dogma 1'},
{'alias-list': ['The Dogma'],
'country': 'IT',
'id': '87067d59-89cf-4549-8d8e-28f503a563fe',
'life-span': {'begin': '1999'},
'name': 'The Dogma',
'sort-name': 'Dogma, The',
'type': 'Group'},
{'alias-list': ['Dogma Hollow'],
'id': 'ba8833ab-56bc-4c1f-8a60-a077c30d8a51',
'name': 'Dogma Hollow',
'sort-name': 'Dogma Hollow'},
{'alias-list': ['Dogma Cats'],
'country': 'GB',
'id': '236df439-6f5c-4280-bf8a-40ac44448350',
'name': 'Dogma Cats',
'sort-name': 'Dogma Cats',
'tag-list': [{'count': '1', 'name': 'uk'},
{'count': '1', 'name': 'england'},
{'count': '1', 'name': 'cambridge'}],
'type': 'Group'},
{'alias-list': ['Hot Dogma'],
'id': 'f41d67c5-a2e5-4a25-af96-39a91b72693b',
'life-span': {'begin': '2010'},
'name': 'Hot Dogma',
'sort-name': 'Hot Dogma',
'type': 'Group'},
{'alias-list': ['Dogma and The Afro-Cubans Rhythms',
'Dogma & The Afro-Cuban Rhythms'],
'id': 'cf264d63-a810-4ce0-8357-3b6a513cd7a2',
'name': 'Dogma & The Afro-Cuban Rhythms',
'sort-name': 'Dogma & The Afro-Cuban Rhythms',
'tag-list': [{'count': '1', 'name': 'splitme'}],
'type': 'Group'},
{'id': '5a73a61e-a9bc-4dfe-83e1-756e842c616b',
'name': 'Falso Dogma',
'sort-name': 'Falso Dogma',
'type': 'Group'}]}

Hence, the MB NGS python does not provide by default a way to look into this field, so I modded the distribution in order to retrieve this field.
Now, when I query MB for ‘Dogma’:

m.search_artists('Dogma', limit = 1, offset = 2)
http://musicbrainz.org/ws/2/artist/?query=Dogma&limit=1&offset=2

I obtain

{'artist-list': [{'alias-list': ['Dogma'],
'disambiguation': 'Chilean artist',
'id': '66b7aa34-3117-42d7-b108-942ba99ba30b',
'name': 'Dogma',
'sort-name': 'Dogma'}]}

, which is what I am looking for. Since this point, I just need to iterate over a number of artists, and see if any of them has:

  • a ‘country’:’CL’ value
  • or ‘chile’ within the value of the key:value pair (re.search('chile', value)

However, a second problem that I have had is that when I search using the same search_artists method:

search_artists(query='', limit=None, offset=None, **fields)

Specifying these key:values for the **fields:
{'tags':'uk', 'tags':'england', 'country':'GB'}
and doing this query
m.search_artists('Dogma', {'tags':'uk', 'tags':'england', 'country':'GB'})
I get the same list as before, so these extra fields are not narrowing the search:

{'artist-list': [{'alias-list': [u'D\xf8gma'],
'id': '02a66935-f631-43cf-9788-15ef1e19f28a',
'name': u'D\xf8gma',
'sort-name': u'D\xf8gma'},
{'alias-list': ['Dogma'],
'id': '5839ff7d-88af-45c6-be93-a8f29b276f70',
'name': 'Dogma',
'sort-name': 'Dogma'},
{'alias-list': ['Dogma'],
'id': 'a6746c54-bdbc-4691-b8f5-8dabfab788cd',
'life-span': {'begin': '1996', 'end': '2003'},
'name': 'Dogma',
'sort-name': 'Dogma',
'type': 'Group'},
{'alias-list': ['Dogma'],
'id': 'dfd4ed8a-5626-4826-97ba-22905a9e22ba',
'life-span': {'end': '1996'},
'name': 'Dogma',
'sort-name': 'Dogma',
'type': 'Group'},
{'alias-list': ['Dogma'],
'id': '87373e74-74ca-4a0e-af24-2e17ab83f6f5',
'name': 'Dogma',
'sort-name': 'Dogma',
'type': 'Group'},
{'alias-list': ['Dogma'],
'id': '30fad333-2d95-4650-b27e-7c3147254105',
'name': 'Dogma',
'sort-name': 'Dogma',
'type': 'Group'},
{'alias-list': ['Dogma'],
'id': '66b7aa34-3117-42d7-b108-942ba99ba30b',
'name': 'Dogma',
'sort-name': 'Dogma'},
{'alias-list': ['Dogma'],
'id': '2b582ed9-2776-4f9f-9895-3ee0e9962f8e',
'name': 'Dogma',
'sort-name': 'Dogma'},
{'alias-list': ['Dogma', 'Dogma Crew'],
'country': 'ES',
'id': 'c98ecf9f-5572-4317-b15a-79cde78698ac',
'name': 'Dogma Crew',
'sort-name': 'Dogma Crew',
'type': 'Group'},
{'alias-list': ['Dogma 3000'],
'id': '8712f8c3-8f82-4f5a-a1f1-5702651f497a',
'name': 'Dogma 3000',
'sort-name': 'Dogma 3000'},
{'alias-list': ['Dogma Cats'],
'country': 'GB',
'id': '236df439-6f5c-4280-bf8a-40ac44448350',
'name': 'Dogma Cats',
'sort-name': 'Dogma Cats',
'tag-list': [{'count': '1', 'name': 'uk'},
{'count': '1', 'name': 'england'},
{'count': '1', 'name': 'cambridge'}],
'type': 'Group'},
{'alias-list': ['Hot Dogma'],
'id': 'f41d67c5-a2e5-4a25-af96-39a91b72693b',
'life-span': {'begin': '2010'},
'name': 'Hot Dogma',
'sort-name': 'Hot Dogma',
'type': 'Group'},
{'alias-list': ['Dogma 1'],
'id': '1a328d22-a2d4-43c6-9a92-489c23e2e042',
'name': 'Dogma 1',
'sort-name': 'Dogma 1'},
{'alias-list': ['The Dogma'],
'country': 'IT',
'id': '87067d59-89cf-4549-8d8e-28f503a563fe',
'life-span': {'begin': '1999'},
'name': 'The Dogma',
'sort-name': 'Dogma, The',
'type': 'Group'},
{'alias-list': ['Dogma Hollow'],
'id': 'ba8833ab-56bc-4c1f-8a60-a077c30d8a51',
'name': 'Dogma Hollow',
'sort-name': 'Dogma Hollow'},
{'alias-list': ['Dogma and The Afro-Cubans Rhythms',
'Dogma & The Afro-Cuban Rhythms'],
'id': 'cf264d63-a810-4ce0-8357-3b6a513cd7a2',
'name': 'Dogma & The Afro-Cuban Rhythms',
'sort-name': 'Dogma & The Afro-Cuban Rhythms',
'tag-list': [{'count': '1', 'name': 'splitme'}],
'type': 'Group'},
{'id': '5a73a61e-a9bc-4dfe-83e1-756e842c616b',
'name': 'Falso Dogma',
'sort-name': 'Falso Dogma',
'type': 'Group'}]}

A third problem is that if I do:
m.search_artists('Dogma', limit = 1, {'tags':'uk', 'tags':'england', 'country':'GB'})
I obtain this error:
SyntaxError: non-keyword arg after keyword arg (, line 1)
, which ought to be an error of the Python module because I am properly following the module syntax.

I’ve been taking a closer look to the syntax when doing advanced queries using MB and it is possible to create complex queries such as:

Advanced query syntax : dogma (comment:chile*) (country:CL)

or in the web-browser:

http://musicbrainz.org/search?query=dogma+%28comment%3Achile*%29+%28country%3ACL%29&type=artist&limit=25&advanced=1

This returns:

So I will try to replicate this syntax in my queries within my scripts:

http://musicbrainz.org/search?query=supernova+(comment:chile*)+(country:CL)&type=artist&limit=5&advanced=1
 

Very first numbers…

I was granted with access to the BDMC (“La Base de Datos de la Música Chilena”, compiled by the SCD, the Chilean Copyright Society). Here are some numbers related to the amount of information that this database have:

bdmc

  • 40132 total songs
  • 32569 different songs (so, 7563 cover songs or with same name?)
  • 3342 different artists
  • 3085 different albums (some noise, though, as in the case of “Obras Sinfónicas en Vivo CD1″ and “Obras Sinfónicas en Vivo CD2″, and some possible identical names between releases)
  • 79 different genres (tags)
  • 432 different record labels
However, there is some noise in this data because entries with different styles appear as different things (e.g.,  “DJ Méndez y Yoan Amor” and “DJ Méndez – Yoan Amor”, “A ti”, “A Ti”, and “A tí”). A process of normalization of the data is required for further processing!

It is interesting to see how the BDMC has a different scope when comparing it with other sources of Chilean music information, as in musicapopular.cl, mus.cl, portaldisc.cl, and vccl.tv. BDMC has in it only songs that already have generated some copyrights for its authors, so most of the songs have been air played.

I have already scraped the data from all other sites, preliminary numbers are:

mus.cl

  • 502 album reviews
  • 332 interviews
  • 564 concert review

musicapopular.cl

  • 3353 artist biographies (I still need to extract the full discographies)

portaldisc.cl

  • 3634 album reviews (although there is some noise because there are some non-Chilean artists)

vccl.tv

  • 1661 videoclips