The merging problem (by Thierry B Mahieux)

  • When you integrate different sources of data you start to add error because you match data from different sources, which is different in some percentage (from Jamendo to Musicbrainz (e.g., 10% error), from MB to DBPedia (e.g. 30%)
  • MB by Thierry: “MB is a database of music knowledge”
  • for matching artists from different sources:
    • everything lowercase, remove spaces and creating all possible comparisons
    • Aerosmith – Run D.M.C -> aerosmith run dmc ->aerosmithrundmc etc
  • “Matching is imperfect, period” (Thierry B Mahieux), so we have big issues to solve on big databases:
    • Improve the matching algorithms
    • Deal with the noise
  • Also, matching is a trade-off, so start merging!
  • Under merge: we might miss connections with other resources
  • Over-merge: we don’t know which artist is which anymore
  • Maybe something to try would be to use fingerprinting in order to find matches (G. Tzanetakis)

