Merging artist names between BDMC and musicapopular

I have been done a preliminary testing in merging my artist name data from the BDMC and MB with a 17% of recognized artists. I am also searching in the MB alias of the artists, and I am comparing the lowercase version of the names in order to avoid different capitalization styles.

Before continuing this research on name-entities, which is a hot one in the [Music-IR] list, I have been comparing the artist name data between musicapopular and the BDMC. Using the scrapped data in PEOPLE_ARTIST I obtained:

  • There is a total of 4313 artists
  • 468 artist names are recognized in both databases
  • 3845 appear in one or the other database

However, when using ARTIST_INFO:

  • 764 in both databases
  • 4975 in either one of the two lists
  • 5712 total artist

What is the difference of the data in PEOPLE_ARTIST and ARTIST_INFO? When comparing both lists, there are:

  • 1636 artist in both lists
  • 1723 in either one of the two lists
Also, an artist like ‘Marco Aurelio’ appears in ARTIST_INFO but it does not appear in PEOPLE_ARTIST.
I checked the methods and realized/remembered than PEOPLE_ARTIST was done for extracting all the people that has worked over the years in an specific group, so that is way ‘Marco Aurelio’ does not appear in the PEOPLE_ARTIST file.
In other words, for future work comparing files the ARTIST_INFO .txt file must be used.

 

Also, there are no intersected artists with accents, so there is something wrong with the way both lists, from musicapopular and BDMC, are encoded.

I tested the BDMC file, exported as a windows-formatted, tab-delimited txt file, and the ARTIST_INFO txt file. The encoders for each one of them was different. I figured out that the encoder for the BDMC file was CSISOLATIN1, and for ARTIST_INFO was UTF-8 (see previous post). So now, the two files have the some coding scheme and we obtained a much better:

  • 973 artists in both lists
  • 4556 in either one of the two lists

The merging problem (by Thierry B Mahieux)

  • When you integrate different sources of data you start to add error because you match data from different sources, which is different in some percentage (from Jamendo to Musicbrainz (e.g., 10% error), from MB to DBPedia (e.g. 30%)
  • MB by Thierry: “MB is a database of music knowledge”
  • for matching artists from different sources:
    • everything lowercase, remove spaces and creating all possible comparisons
    • Aerosmith – Run D.M.C -> aerosmith run dmc ->aerosmithrundmc etc
  • “Matching is imperfect, period” (Thierry B Mahieux), so we have big issues to solve on big databases:
    • Improve the matching algorithms
    • Deal with the noise
  • Also, matching is a trade-off, so start merging!
  • Under merge: we might miss connections with other resources
  • Over-merge: we don’t know which artist is which anymore
  • Maybe something to try would be to use fingerprinting in order to find matches (G. Tzanetakis)

Very first numbers…

I was granted with access to the BDMC (“La Base de Datos de la Música Chilena”, compiled by the SCD, the Chilean Copyright Society). Here are some numbers related to the amount of information that this database have:

bdmc

  • 40132 total songs
  • 32569 different songs (so, 7563 cover songs or with same name?)
  • 3342 different artists
  • 3085 different albums (some noise, though, as in the case of “Obras Sinfónicas en Vivo CD1″ and “Obras Sinfónicas en Vivo CD2″, and some possible identical names between releases)
  • 79 different genres (tags)
  • 432 different record labels
However, there is some noise in this data because entries with different styles appear as different things (e.g.,  “DJ Méndez y Yoan Amor” and “DJ Méndez – Yoan Amor”, “A ti”, “A Ti”, and “A tí”). A process of normalization of the data is required for further processing!

It is interesting to see how the BDMC has a different scope when comparing it with other sources of Chilean music information, as in musicapopular.cl, mus.cl, portaldisc.cl, and vccl.tv. BDMC has in it only songs that already have generated some copyrights for its authors, so most of the songs have been air played.

I have already scraped the data from all other sites, preliminary numbers are:

mus.cl

  • 502 album reviews
  • 332 interviews
  • 564 concert review

musicapopular.cl

  • 3353 artist biographies (I still need to extract the full discographies)

portaldisc.cl

  • 3634 album reviews (although there is some noise because there are some non-Chilean artists)

vccl.tv

  • 1661 videoclips