String comparison metrics

We are comparing 6 different metrics based on the Levenshtein distance for the string comparison between the BDMC and MB. These metrics are:

  • Original string Levenshtein ratio
  • Original string Levenshtein jaro
  • ascii strings Levenshtein ratio
  • ascii strings Levenshtein jaro
  • lowercase, no-spaces, ascii strings Levenshtein ratio
  • lowercase, no-spaces, ascii strings Levenshtein jaro

For the actual comparison, I will create a know dataset and will measure precision and recall for the six metrics, but how large this dataset should be? Rule of thumb: (100/x)^2, where x is the margin of error that you want. However, this is for an infinite population, so we should implement a ‘finite population correction’ (FPC),

FPC = sqr((N-n)/(N-1)), where N is the population size, and n is the sample size.

We have three interesting things that we should have a look:

  1. How many artists have an exact match (i.e., they are already in the database)
  2. How many artists do not match (i.e., they are not in the database)
  3. How many artists match partially. Among these we need to see what is the best threshold to obtain the best precision and recall, and after that using the bootstrapping technique, create error bars for both metrics.

Although we have realized that the threshold for doing the string matching is located around 0.88 (for the lowercase, no-spaces, ascii strings Levenshtein ratio), we are running a query test with a more *generous* thresholding. Later, we will extract from there our subset to calculate the best threshold for the best precision and recall values (or by using the mixed ‘f-score’).

The experiment we are thinking about has the following steps:

  1. To create a ground truth dataset. This subset will be randomly chosen among all artist names that actually exist in the BMDC and MB. As mentioned before the size of this population should be somewhere between 100 and 400 entries.
  2. Manually look for the correct MBID for these entries
  3. Create a script for calculating precision and recall using the six metrics for all these entries.

 

STEPS

  1. We created a script that takes random samples for those entries with distance values between 0.75 and 1.0, and which belong to ‘Chile’ or ‘None’. We randomly chose 400 entries in order to be able to discard those who are not in MB (in fact, this is the maximum number of entries with those constraints)
  2. We are marking as RED those entries from the BDMC who are not in MB. GREEN are those who are already in MB, and YELLOW those who have a wrong entry in MB (false positives: same name but a different artist, so they should be considered as RED). To check false positives and negatives we use the data from musicapopular and BDMC. The numbers we got are:
    • GREEN: 179
    • YELLOW: 98
    • RED: 123

We have realized several interesting facts:

  • There is a large amount of artists within the Folklore genre. Most of these entries belong to compiled releases from Various Artists. Hence, most of these artist have just one or a few works associated with them.
  • There is a large amount of false positives among those artist with very common names such as Rachel, Quorum, Criminal, Polter, Rock Hudson, Trilogía, Twilight, and many others. The only way to determine if it is a true or false positive is researching in the releases and works developed by the artist. Hopefully, we have large information coming from several websites to determine if the artist has already en entry or not, or by analyzing if there is any reference to Chile in the MB entry, in the release country of a release, or in the relationship field.

After some trial-error, we have changed our script and now we are running another one among all artists and without any threshold for the six metrics we are comparing. Also, now the queries are properly done, and an artist like ‘Gonzalo Yáñez’ is properly formatted in the URL query to MB. We think that with this approach will be able to compare all the metrics at once. Once this was done for all 3142 different artists, we filtered again all entries with values between [0.75, 1[ but we left wrong countries in the set (we can’t mix pears and apples).

The settings for this latest approach gave us 335 artists in the range [0.75, 1[, and 464 artist with a value of [1]. Also, there are 2344 in the range [0., 0.75[. We considered artists correctly retrieved as ‘true positives’, those with the same name but being referred to another artist as ‘false positives’, and those wrongly retrieved as the false negatives. This selection should be discussed. The first plots are as follow:

It is strange that in plots 3, 4, 5, and 6 the recall stays growing forever while the precision diminishes just a few. We think there is something wrong with the election of the true and false positives, and true and false negatives.

We have been designing a third approach for analyzing this data. The first part of this approach has to do with how many Chilean artist are in the database and how well the algorithm performs in here. Things to calculate:

1. Recall on just the ones for the different thresholds

2. Recall on the ‘ones’ and ‘twos’

But for other ‘real-world’ applications, where string matching could be used, we will:

3. Calculate precision and recall considering “two’s” as correct matching (the string matching algorithm did the job),

4. Calculate precision and recall considering “two’s” as incorrect matches.

Moreover, to calculate the error we will use the bootstrapping technique: to create a number of other populations starting from my sample population. In other words, if my sample population is 380 entries, we will create 1000 populations starting from this population without replacement (this means that we can have duplicate entries in the new population, otherwise we will have the same one again and again), and then we can discard the 25 lower and 25 higher ones, and we will have our error boundaries for a 95% of confidence interval)

 

Digging into MusicBrainz NGS webservices

I have been dealing with the following problem: when searching the ‘Dogma’ artist in the MB website, I obtain

For our project I am only interested in the Chilean artist, which is the one that has a disambiguation comment field with the ‘Chilean artist’ note.

However, when searching the ‘Dogma’ artist using the musicbrainzngs.search_artist method, it outputs:

{'artist-list': [{'alias-list': ['Dogma'],
'id': '87373e74-74ca-4a0e-af24-2e17ab83f6f5',
'name': 'Dogma',
'sort-name': 'Dogma',
'type': 'Group'},
{'alias-list': ['Dogma'],
'id': '30fad333-2d95-4650-b27e-7c3147254105',
'name': 'Dogma',
'sort-name': 'Dogma',
'type': 'Group'},
{'alias-list': ['Dogma'],
'id': '66b7aa34-3117-42d7-b108-942ba99ba30b',
'name': 'Dogma',
'sort-name': 'Dogma'},
{'alias-list': ['Dogma'],
'id': '2b582ed9-2776-4f9f-9895-3ee0e9962f8e',
'name': 'Dogma',
'sort-name': 'Dogma'},
{'alias-list': [u'D\xf8gma'],
'id': '02a66935-f631-43cf-9788-15ef1e19f28a',
'name': u'D\xf8gma',
'sort-name': u'D\xf8gma'},
{'alias-list': ['Dogma'],
'id': '5839ff7d-88af-45c6-be93-a8f29b276f70',
'name': 'Dogma',
'sort-name': 'Dogma'},
{'alias-list': ['Dogma'],
'id': 'a6746c54-bdbc-4691-b8f5-8dabfab788cd',
'life-span': {'begin': '1996', 'end': '2003'},
'name': 'Dogma',
'sort-name': 'Dogma',
'type': 'Group'},
{'alias-list': ['Dogma'],
'id': 'dfd4ed8a-5626-4826-97ba-22905a9e22ba',
'life-span': {'end': '1996'},
'name': 'Dogma',
'sort-name': 'Dogma',
'type': 'Group'},
{'alias-list': ['Dogma', 'Dogma Crew'],
'country': 'ES',
'id': 'c98ecf9f-5572-4317-b15a-79cde78698ac',
'name': 'Dogma Crew',
'sort-name': 'Dogma Crew',
'type': 'Group'},
{'alias-list': ['Dogma 3000'],
'id': '8712f8c3-8f82-4f5a-a1f1-5702651f497a',
'name': 'Dogma 3000',
'sort-name': 'Dogma 3000'},
{'alias-list': ['Dogma 1'],
'id': '1a328d22-a2d4-43c6-9a92-489c23e2e042',
'name': 'Dogma 1',
'sort-name': 'Dogma 1'},
{'alias-list': ['The Dogma'],
'country': 'IT',
'id': '87067d59-89cf-4549-8d8e-28f503a563fe',
'life-span': {'begin': '1999'},
'name': 'The Dogma',
'sort-name': 'Dogma, The',
'type': 'Group'},
{'alias-list': ['Dogma Hollow'],
'id': 'ba8833ab-56bc-4c1f-8a60-a077c30d8a51',
'name': 'Dogma Hollow',
'sort-name': 'Dogma Hollow'},
{'alias-list': ['Dogma Cats'],
'country': 'GB',
'id': '236df439-6f5c-4280-bf8a-40ac44448350',
'name': 'Dogma Cats',
'sort-name': 'Dogma Cats',
'tag-list': [{'count': '1', 'name': 'uk'},
{'count': '1', 'name': 'england'},
{'count': '1', 'name': 'cambridge'}],
'type': 'Group'},
{'alias-list': ['Hot Dogma'],
'id': 'f41d67c5-a2e5-4a25-af96-39a91b72693b',
'life-span': {'begin': '2010'},
'name': 'Hot Dogma',
'sort-name': 'Hot Dogma',
'type': 'Group'},
{'alias-list': ['Dogma and The Afro-Cubans Rhythms',
'Dogma & The Afro-Cuban Rhythms'],
'id': 'cf264d63-a810-4ce0-8357-3b6a513cd7a2',
'name': 'Dogma & The Afro-Cuban Rhythms',
'sort-name': 'Dogma & The Afro-Cuban Rhythms',
'tag-list': [{'count': '1', 'name': 'splitme'}],
'type': 'Group'},
{'id': '5a73a61e-a9bc-4dfe-83e1-756e842c616b',
'name': 'Falso Dogma',
'sort-name': 'Falso Dogma',
'type': 'Group'}]}

Hence, the MB NGS python does not provide by default a way to look into this field, so I modded the distribution in order to retrieve this field.
Now, when I query MB for ‘Dogma’:

m.search_artists('Dogma', limit = 1, offset = 2)
http://musicbrainz.org/ws/2/artist/?query=Dogma&limit=1&offset=2

I obtain

{'artist-list': [{'alias-list': ['Dogma'],
'disambiguation': 'Chilean artist',
'id': '66b7aa34-3117-42d7-b108-942ba99ba30b',
'name': 'Dogma',
'sort-name': 'Dogma'}]}

, which is what I am looking for. Since this point, I just need to iterate over a number of artists, and see if any of them has:

  • a ‘country’:’CL’ value
  • or ‘chile’ within the value of the key:value pair (re.search('chile', value)

However, a second problem that I have had is that when I search using the same search_artists method:

search_artists(query='', limit=None, offset=None, **fields)

Specifying these key:values for the **fields:
{'tags':'uk', 'tags':'england', 'country':'GB'}
and doing this query
m.search_artists('Dogma', {'tags':'uk', 'tags':'england', 'country':'GB'})
I get the same list as before, so these extra fields are not narrowing the search:

{'artist-list': [{'alias-list': [u'D\xf8gma'],
'id': '02a66935-f631-43cf-9788-15ef1e19f28a',
'name': u'D\xf8gma',
'sort-name': u'D\xf8gma'},
{'alias-list': ['Dogma'],
'id': '5839ff7d-88af-45c6-be93-a8f29b276f70',
'name': 'Dogma',
'sort-name': 'Dogma'},
{'alias-list': ['Dogma'],
'id': 'a6746c54-bdbc-4691-b8f5-8dabfab788cd',
'life-span': {'begin': '1996', 'end': '2003'},
'name': 'Dogma',
'sort-name': 'Dogma',
'type': 'Group'},
{'alias-list': ['Dogma'],
'id': 'dfd4ed8a-5626-4826-97ba-22905a9e22ba',
'life-span': {'end': '1996'},
'name': 'Dogma',
'sort-name': 'Dogma',
'type': 'Group'},
{'alias-list': ['Dogma'],
'id': '87373e74-74ca-4a0e-af24-2e17ab83f6f5',
'name': 'Dogma',
'sort-name': 'Dogma',
'type': 'Group'},
{'alias-list': ['Dogma'],
'id': '30fad333-2d95-4650-b27e-7c3147254105',
'name': 'Dogma',
'sort-name': 'Dogma',
'type': 'Group'},
{'alias-list': ['Dogma'],
'id': '66b7aa34-3117-42d7-b108-942ba99ba30b',
'name': 'Dogma',
'sort-name': 'Dogma'},
{'alias-list': ['Dogma'],
'id': '2b582ed9-2776-4f9f-9895-3ee0e9962f8e',
'name': 'Dogma',
'sort-name': 'Dogma'},
{'alias-list': ['Dogma', 'Dogma Crew'],
'country': 'ES',
'id': 'c98ecf9f-5572-4317-b15a-79cde78698ac',
'name': 'Dogma Crew',
'sort-name': 'Dogma Crew',
'type': 'Group'},
{'alias-list': ['Dogma 3000'],
'id': '8712f8c3-8f82-4f5a-a1f1-5702651f497a',
'name': 'Dogma 3000',
'sort-name': 'Dogma 3000'},
{'alias-list': ['Dogma Cats'],
'country': 'GB',
'id': '236df439-6f5c-4280-bf8a-40ac44448350',
'name': 'Dogma Cats',
'sort-name': 'Dogma Cats',
'tag-list': [{'count': '1', 'name': 'uk'},
{'count': '1', 'name': 'england'},
{'count': '1', 'name': 'cambridge'}],
'type': 'Group'},
{'alias-list': ['Hot Dogma'],
'id': 'f41d67c5-a2e5-4a25-af96-39a91b72693b',
'life-span': {'begin': '2010'},
'name': 'Hot Dogma',
'sort-name': 'Hot Dogma',
'type': 'Group'},
{'alias-list': ['Dogma 1'],
'id': '1a328d22-a2d4-43c6-9a92-489c23e2e042',
'name': 'Dogma 1',
'sort-name': 'Dogma 1'},
{'alias-list': ['The Dogma'],
'country': 'IT',
'id': '87067d59-89cf-4549-8d8e-28f503a563fe',
'life-span': {'begin': '1999'},
'name': 'The Dogma',
'sort-name': 'Dogma, The',
'type': 'Group'},
{'alias-list': ['Dogma Hollow'],
'id': 'ba8833ab-56bc-4c1f-8a60-a077c30d8a51',
'name': 'Dogma Hollow',
'sort-name': 'Dogma Hollow'},
{'alias-list': ['Dogma and The Afro-Cubans Rhythms',
'Dogma & The Afro-Cuban Rhythms'],
'id': 'cf264d63-a810-4ce0-8357-3b6a513cd7a2',
'name': 'Dogma & The Afro-Cuban Rhythms',
'sort-name': 'Dogma & The Afro-Cuban Rhythms',
'tag-list': [{'count': '1', 'name': 'splitme'}],
'type': 'Group'},
{'id': '5a73a61e-a9bc-4dfe-83e1-756e842c616b',
'name': 'Falso Dogma',
'sort-name': 'Falso Dogma',
'type': 'Group'}]}

A third problem is that if I do:
m.search_artists('Dogma', limit = 1, {'tags':'uk', 'tags':'england', 'country':'GB'})
I obtain this error:
SyntaxError: non-keyword arg after keyword arg (, line 1)
, which ought to be an error of the Python module because I am properly following the module syntax.

I’ve been taking a closer look to the syntax when doing advanced queries using MB and it is possible to create complex queries such as:

Advanced query syntax : dogma (comment:chile*) (country:CL)

or in the web-browser:

http://musicbrainz.org/search?query=dogma+%28comment%3Achile*%29+%28country%3ACL%29&type=artist&limit=25&advanced=1

This returns:

So I will try to replicate this syntax in my queries within my scripts:

http://musicbrainz.org/search?query=supernova+(comment:chile*)+(country:CL)&type=artist&limit=5&advanced=1
 

Merging artist names between BDMC and musicapopular

I have been done a preliminary testing in merging my artist name data from the BDMC and MB with a 17% of recognized artists. I am also searching in the MB alias of the artists, and I am comparing the lowercase version of the names in order to avoid different capitalization styles.

Before continuing this research on name-entities, which is a hot one in the [Music-IR] list, I have been comparing the artist name data between musicapopular and the BDMC. Using the scrapped data in PEOPLE_ARTIST I obtained:

  • There is a total of 4313 artists
  • 468 artist names are recognized in both databases
  • 3845 appear in one or the other database

However, when using ARTIST_INFO:

  • 764 in both databases
  • 4975 in either one of the two lists
  • 5712 total artist

What is the difference of the data in PEOPLE_ARTIST and ARTIST_INFO? When comparing both lists, there are:

  • 1636 artist in both lists
  • 1723 in either one of the two lists
Also, an artist like ‘Marco Aurelio’ appears in ARTIST_INFO but it does not appear in PEOPLE_ARTIST.
I checked the methods and realized/remembered than PEOPLE_ARTIST was done for extracting all the people that has worked over the years in an specific group, so that is way ‘Marco Aurelio’ does not appear in the PEOPLE_ARTIST file.
In other words, for future work comparing files the ARTIST_INFO .txt file must be used.

 

Also, there are no intersected artists with accents, so there is something wrong with the way both lists, from musicapopular and BDMC, are encoded.

I tested the BDMC file, exported as a windows-formatted, tab-delimited txt file, and the ARTIST_INFO txt file. The encoders for each one of them was different. I figured out that the encoder for the BDMC file was CSISOLATIN1, and for ARTIST_INFO was UTF-8 (see previous post). So now, the two files have the some coding scheme and we obtained a much better:

  • 973 artists in both lists
  • 4556 in either one of the two lists

The merging problem (by Thierry B Mahieux)

  • When you integrate different sources of data you start to add error because you match data from different sources, which is different in some percentage (from Jamendo to Musicbrainz (e.g., 10% error), from MB to DBPedia (e.g. 30%)
  • MB by Thierry: “MB is a database of music knowledge”
  • for matching artists from different sources:
    • everything lowercase, remove spaces and creating all possible comparisons
    • Aerosmith – Run D.M.C -> aerosmith run dmc ->aerosmithrundmc etc
  • “Matching is imperfect, period” (Thierry B Mahieux), so we have big issues to solve on big databases:
    • Improve the matching algorithms
    • Deal with the noise
  • Also, matching is a trade-off, so start merging!
  • Under merge: we might miss connections with other resources
  • Over-merge: we don’t know which artist is which anymore
  • Maybe something to try would be to use fingerprinting in order to find matches (G. Tzanetakis)

Unicode equivalencies

It has been hard to work with spanish characters (such as accents and ñ) when scrapping the websites and working with a mySQL database. These are some of the equivalencies from Unicode for different characters:

\xc3 it seems that this one refers to the utf-8 encoding, so part of the conversion scheme is:

\xe1 á
\xe9 é
\xed í
\xf3 ó
\xfa ú
\xf1 ñ

When printing these characters to screen, Python does the job, however, when writing a .csv file, it is not able to handle the actual character, so I need to have a look if the mySQL is able to convert those codes to the actual characters.

It seems that the best thing to do will be to export the utf-8 file and then run a script over the file. By doing this I will be able to populate the database with the proper encoding.

I have been doing more research in the encodings and found this behaviour with the data coming from both databases:

mus_pop : : BDMC
\xc3\xa1 : á : \xe1
\xc3\xa9 : é : \xe9
\xc3\xad : í : \xed
\xc3\xb3 : ó : \xf3
\xc3\xba : ú : \xfa
\xc3\xb1 : ñ : \xf1

It should be noticed that mus_pop data was encoded as UTF-8, but I don’t know what encoder is being used for BDMC by Excel.

After lots of digging, I found the iconv library for converting to and from different encodings, and figured out that the encoding of Excel for the tab-delimited, Windows-Formatted files are of type CSISOLATIN-1. So the command-line for converting the files exported by Excel is

iconv -t UTF8 -f CSISOLATIN1 < ./input_file.txt > ./output_file.txt