Linking tables and finding duplicates

I have been working in parsing the already scrapped websites and I figured out that I need to know in advance the structure of the file that I want to generate. This is important because it must be delineated by the structure of the tables in the database.

The parsing outcome of the websites will be a .csv file all the data for a specific website. This is the lower-level moment with the data structured, so it is a good moment for assigning an id for the entities. If the id is assigned later, it will be harder to solve problems because we will be already in the structure of the database.

We could also use a script to ensure that there is no duplicates in the table, and if it finds duplicates we could take two options:

  • to write a second csv file with all the duplicate entries, while in the first one there are only the unique ones
  • to write a unique csv file where all the duplicate entries can be marked in the id column by a special character that can be fixed manually (e.g., id = *1234)

Database design brainstorming

I’ve been working in the design of the tables for representing all the entities that I would like to make available in my database. For the moment, what I can see for the repository are the following tables and the relations between them:
However, after a conversation with Corina MacDonald, she suggested me that all the album reviews, interviews, pictures, and any possible future material could be considered as a RESOURCE. Also, a composer, photographer, director, journalist and any other person could be part of a PEOPLE table. I have been trying to put these ideas into one schema, and the result, still fuzzy is:
After parsing some people linked to artists, I have been refining a little bit the DB structure. In the PEOPLE_ARTIST table I will be storing an array with the instruments that a person played in a specific artist, also the type of relations (e.g., one-to-many, many-to-many, etc) is now declared.