Merging artist names between BDMC and musicapopular

I have been done a preliminary testing in merging my artist name data from the BDMC and MB with a 17% of recognized artists. I am also searching in the MB alias of the artists, and I am comparing the lowercase version of the names in order to avoid different capitalization styles.

Before continuing this research on name-entities, which is a hot one in the [Music-IR] list, I have been comparing the artist name data between musicapopular and the BDMC. Using the scrapped data in PEOPLE_ARTIST I obtained:

  • There is a total of 4313 artists
  • 468 artist names are recognized in both databases
  • 3845 appear in one or the other database

However, when using ARTIST_INFO:

  • 764 in both databases
  • 4975 in either one of the two lists
  • 5712 total artist

What is the difference of the data in PEOPLE_ARTIST and ARTIST_INFO? When comparing both lists, there are:

  • 1636 artist in both lists
  • 1723 in either one of the two lists
Also, an artist like ‘Marco Aurelio’ appears in ARTIST_INFO but it does not appear in PEOPLE_ARTIST.
I checked the methods and realized/remembered than PEOPLE_ARTIST was done for extracting all the people that has worked over the years in an specific group, so that is way ‘Marco Aurelio’ does not appear in the PEOPLE_ARTIST file.
In other words, for future work comparing files the ARTIST_INFO .txt file must be used.

 

Also, there are no intersected artists with accents, so there is something wrong with the way both lists, from musicapopular and BDMC, are encoded.

I tested the BDMC file, exported as a windows-formatted, tab-delimited txt file, and the ARTIST_INFO txt file. The encoders for each one of them was different. I figured out that the encoder for the BDMC file was CSISOLATIN1, and for ARTIST_INFO was UTF-8 (see previous post). So now, the two files have the some coding scheme and we obtained a much better:

  • 973 artists in both lists
  • 4556 in either one of the two lists

Leave a Reply

Your email address will not be published. Required fields are marked *