I have been done a preliminary testing in merging my artist name data from the BDMC and MB with a 17% of recognized artists. I am also searching in the MB alias of the artists, and I am comparing the lowercase version of the names in order to avoid different capitalization styles.
Before continuing this research on name-entities, which is a hot one in the [Music-IR] list, I have been comparing the artist name data between musicapopular and the BDMC. Using the scrapped data in PEOPLE_ARTIST I obtained:
- There is a total of 4313 artists
- 468 artist names are recognized in both databases
- 3845 appear in one or the other database
However, when using ARTIST_INFO:
- 764 in both databases
- 4975 in either one of the two lists
- 5712 total artist
What is the difference of the data in PEOPLE_ARTIST and ARTIST_INFO? When comparing both lists, there are:
- 1636 artist in both lists
- 1723 in either one of the two lists
Also, there are no intersected artists with accents, so there is something wrong with the way both lists, from musicapopular and BDMC, are encoded.
I tested the BDMC file, exported as a windows-formatted, tab-delimited txt file, and the ARTIST_INFO txt file. The encoders for each one of them was different. I figured out that the encoder for the BDMC file was CSISOLATIN1, and for ARTIST_INFO was UTF-8 (see previous post). So now, the two files have the some coding scheme and we obtained a much better:
- 973 artists in both lists
- 4556 in either one of the two lists