Merging artist names between BDMC and musicapopular

I have been done a preliminary testing in merging my artist name data from the BDMC and MB with a 17% of recognized artists. I am also searching in the MB alias of the artists, and I am comparing the lowercase version of the names in order to avoid different capitalization styles.

Before continuing this research on name-entities, which is a hot one in the [Music-IR] list, I have been comparing the artist name data between musicapopular and the BDMC. Using the scrapped data in PEOPLE_ARTIST I obtained:

  • There is a total of 4313 artists
  • 468 artist names are recognized in both databases
  • 3845 appear in one or the other database

However, when using ARTIST_INFO:

  • 764 in both databases
  • 4975 in either one of the two lists
  • 5712 total artist

What is the difference of the data in PEOPLE_ARTIST and ARTIST_INFO? When comparing both lists, there are:

  • 1636 artist in both lists
  • 1723 in either one of the two lists
Also, an artist like ‘Marco Aurelio’ appears in ARTIST_INFO but it does not appear in PEOPLE_ARTIST.
I checked the methods and realized/remembered than PEOPLE_ARTIST was done for extracting all the people that has worked over the years in an specific group, so that is way ‘Marco Aurelio’ does not appear in the PEOPLE_ARTIST file.
In other words, for future work comparing files the ARTIST_INFO .txt file must be used.

 

Also, there are no intersected artists with accents, so there is something wrong with the way both lists, from musicapopular and BDMC, are encoded.

I tested the BDMC file, exported as a windows-formatted, tab-delimited txt file, and the ARTIST_INFO txt file. The encoders for each one of them was different. I figured out that the encoder for the BDMC file was CSISOLATIN1, and for ARTIST_INFO was UTF-8 (see previous post). So now, the two files have the some coding scheme and we obtained a much better:

  • 973 artists in both lists
  • 4556 in either one of the two lists

Encoding scheme

I have been trying different encoding across the application that I am using and the one that works better is Western (MacRoman). Using Spanish, it opens all accents, special characters and the ñ. Hopefully I will be able to use in the DDBB.

 

It has been also hard to figure out how to properly export files from the spreadsheet, until now, the best file format and encoding scheme has been ‘windows_formatted’

Unicode equivalencies

It has been hard to work with spanish characters (such as accents and ñ) when scrapping the websites and working with a mySQL database. These are some of the equivalencies from Unicode for different characters:

\xc3 it seems that this one refers to the utf-8 encoding, so part of the conversion scheme is:

\xe1 á
\xe9 é
\xed í
\xf3 ó
\xfa ú
\xf1 ñ

When printing these characters to screen, Python does the job, however, when writing a .csv file, it is not able to handle the actual character, so I need to have a look if the mySQL is able to convert those codes to the actual characters.

It seems that the best thing to do will be to export the utf-8 file and then run a script over the file. By doing this I will be able to populate the database with the proper encoding.

I have been doing more research in the encodings and found this behaviour with the data coming from both databases:

mus_pop : : BDMC
\xc3\xa1 : á : \xe1
\xc3\xa9 : é : \xe9
\xc3\xad : í : \xed
\xc3\xb3 : ó : \xf3
\xc3\xba : ú : \xfa
\xc3\xb1 : ñ : \xf1

It should be noticed that mus_pop data was encoded as UTF-8, but I don’t know what encoder is being used for BDMC by Excel.

After lots of digging, I found the iconv library for converting to and from different encodings, and figured out that the encoding of Excel for the tab-delimited, Windows-Formatted files are of type CSISOLATIN-1. So the command-line for converting the files exported by Excel is

iconv -t UTF8 -f CSISOLATIN1 < ./input_file.txt > ./output_file.txt