String comparison metrics

We are comparing 6 different metrics based on the Levenshtein distance for the string comparison between the BDMC and MB. These metrics are:

  • Original string Levenshtein ratio
  • Original string Levenshtein jaro
  • ascii strings Levenshtein ratio
  • ascii strings Levenshtein jaro
  • lowercase, no-spaces, ascii strings Levenshtein ratio
  • lowercase, no-spaces, ascii strings Levenshtein jaro

For the actual comparison, I will create a know dataset and will measure precision and recall for the six metrics, but how large this dataset should be? Rule of thumb: (100/x)^2, where x is the margin of error that you want. However, this is for an infinite population, so we should implement a ‘finite population correction’ (FPC),

FPC = sqr((N-n)/(N-1)), where N is the population size, and n is the sample size.

We have three interesting things that we should have a look:

  1. How many artists have an exact match (i.e., they are already in the database)
  2. How many artists do not match (i.e., they are not in the database)
  3. How many artists match partially. Among these we need to see what is the best threshold to obtain the best precision and recall, and after that using the bootstrapping technique, create error bars for both metrics.

Although we have realized that the threshold for doing the string matching is located around 0.88 (for the lowercase, no-spaces, ascii strings Levenshtein ratio), we are running a query test with a more *generous* thresholding. Later, we will extract from there our subset to calculate the best threshold for the best precision and recall values (or by using the mixed ‘f-score’).

The experiment we are thinking about has the following steps:

  1. To create a ground truth dataset. This subset will be randomly chosen among all artist names that actually exist in the BMDC and MB. As mentioned before the size of this population should be somewhere between 100 and 400 entries.
  2. Manually look for the correct MBID for these entries
  3. Create a script for calculating precision and recall using the six metrics for all these entries.

 

STEPS

  1. We created a script that takes random samples for those entries with distance values between 0.75 and 1.0, and which belong to ‘Chile’ or ‘None’. We randomly chose 400 entries in order to be able to discard those who are not in MB (in fact, this is the maximum number of entries with those constraints)
  2. We are marking as RED those entries from the BDMC who are not in MB. GREEN are those who are already in MB, and YELLOW those who have a wrong entry in MB (false positives: same name but a different artist, so they should be considered as RED). To check false positives and negatives we use the data from musicapopular and BDMC. The numbers we got are:
    • GREEN: 179
    • YELLOW: 98
    • RED: 123

We have realized several interesting facts:

  • There is a large amount of artists within the Folklore genre. Most of these entries belong to compiled releases from Various Artists. Hence, most of these artist have just one or a few works associated with them.
  • There is a large amount of false positives among those artist with very common names such as Rachel, Quorum, Criminal, Polter, Rock Hudson, Trilogía, Twilight, and many others. The only way to determine if it is a true or false positive is researching in the releases and works developed by the artist. Hopefully, we have large information coming from several websites to determine if the artist has already en entry or not, or by analyzing if there is any reference to Chile in the MB entry, in the release country of a release, or in the relationship field.

After some trial-error, we have changed our script and now we are running another one among all artists and without any threshold for the six metrics we are comparing. Also, now the queries are properly done, and an artist like ‘Gonzalo Yáñez’ is properly formatted in the URL query to MB. We think that with this approach will be able to compare all the metrics at once. Once this was done for all 3142 different artists, we filtered again all entries with values between [0.75, 1[ but we left wrong countries in the set (we can’t mix pears and apples).

The settings for this latest approach gave us 335 artists in the range [0.75, 1[, and 464 artist with a value of [1]. Also, there are 2344 in the range [0., 0.75[. We considered artists correctly retrieved as ‘true positives’, those with the same name but being referred to another artist as ‘false positives’, and those wrongly retrieved as the false negatives. This selection should be discussed. The first plots are as follow:

It is strange that in plots 3, 4, 5, and 6 the recall stays growing forever while the precision diminishes just a few. We think there is something wrong with the election of the true and false positives, and true and false negatives.

We have been designing a third approach for analyzing this data. The first part of this approach has to do with how many Chilean artist are in the database and how well the algorithm performs in here. Things to calculate:

1. Recall on just the ones for the different thresholds

2. Recall on the ‘ones’ and ‘twos’

But for other ‘real-world’ applications, where string matching could be used, we will:

3. Calculate precision and recall considering “two’s” as correct matching (the string matching algorithm did the job),

4. Calculate precision and recall considering “two’s” as incorrect matches.

Moreover, to calculate the error we will use the bootstrapping technique: to create a number of other populations starting from my sample population. In other words, if my sample population is 380 entries, we will create 1000 populations starting from this population without replacement (this means that we can have duplicate entries in the new population, otherwise we will have the same one again and again), and then we can discard the 25 lower and 25 higher ones, and we will have our error boundaries for a 95% of confidence interval)