Extracting and parsing spanish-formatted dates

We have collected birth and death dates for many of the Chilean artists that belong to our databases. Most of this data comes from musicapopular.cl. However, this data doesn’t have any formatting rules and it seems that different people entered data with different criteria. In other words, there are different styles of spanish dates.

As MB supports fields for date periods in the form YYYY-MM-DD, we developed a script using regular expression to parse all dates to this format. Done.

Linking tables and finding duplicates

I have been working in parsing the already scrapped websites and I figured out that I need to know in advance the structure of the file that I want to generate. This is important because it must be delineated by the structure of the tables in the database.

The parsing outcome of the websites will be a .csv file all the data for a specific website. This is the lower-level moment with the data structured, so it is a good moment for assigning an id for the entities. If the id is assigned later, it will be harder to solve problems because we will be already in the structure of the database.

We could also use a script to ensure that there is no duplicates in the table, and if it finds duplicates we could take two options:

  • to write a second csv file with all the duplicate entries, while in the first one there are only the unique ones
  • to write a unique csv file where all the duplicate entries can be marked in the id column by a special character that can be fixed manually (e.g., id = *1234)

Recursion problems using Python

I have been dealing for a couple of days with recursion problems when dumping data using pickle. The solution that I have used has been pretty simple but effective: just to increment the recursion limit by doing:

sys.setrecursionlimit(10000)

Since I have set up this new recursion limit, I have not had any recursion problems again.