We are comparing our database of songs (recordings), artists (artists), and albums (releases) with Musicbrainz. I am querying MB with an advanced search syntax for an specific recording, like:
recording+name artist:artist+name release:release+title comment:Chile* country:CL
A result list is retrieved and in this round I am only considering the first entry as the potential response. Then, I am comparing the recording title of the BDMC and MB using Levenshtein distance and the total number of letter of the query and the response.
After some trial and error, and leave the script running for 30 hours, the results we obtained are:
- 10615 entries with an MB artist ID
- 9074 entries with an MB release ID
- 8436 entries with an MB recording ID
However, taking a look to the resulting file I can see some things that I need to fix for having better results:
- After receiving a query response, I should only consider artist names with country=[‘CL’, ‘Chile’, ”]. Also I should apply a string comparison using Levenshtein distance between the names from the BDMC and MB.
- When a response is back, we should iterate over a number of songs to see which one of all of the is the proper match (sometimes the true positive is not the first option)
For the next round I will use a Levenshtein ratio instead of the distance. This approach returns a normalized value between 0 and 1 instead of the number of necessary edits and changes for going from one word to the another one.
Big question to solve: which values are the best ones when comparing Levenshtein distances. Trying and comparing by-hand I have arrived to 0.75 as a *nice* threshold, but this value should be revised (suggestion: make plots of different thresholds)
I implemented the iteration over the retrieved songs to match the hopefully true positive among all items with a 100% score (number 2 above). The amount of noise is being diminished, but I don’t have all results yet.
About the artists’ names, it seems that the proper approach would be to query the MB database only by artist in order to refine the results. The query and filtering should be:
1) firstname+lastname country:CL comment:Chile*
2) Filter those artist names with a country ≠ [CL, ”]
3) Iterate over the a number of names and retrieve the one with the highest Levenshtein ratio or jaro distance.
Just as a note, when asking MB for all chilean artists (comment:Chile* country:CL), it returns 208 artists)