I have been working in parsing the already scrapped websites and I figured out that I need to know in advance the structure of the file that I want to generate. This is important because it must be delineated by the structure of the tables in the database.
The parsing outcome of the websites will be a .csv file all the data for a specific website. This is the lower-level moment with the data structured, so it is a good moment for assigning an id for the entities. If the id is assigned later, it will be harder to solve problems because we will be already in the structure of the database.
We could also use a script to ensure that there is no duplicates in the table, and if it finds duplicates we could take two options:
- to write a second csv file with all the duplicate entries, while in the first one there are only the unique ones
- to write a unique csv file where all the duplicate entries can be marked in the id column by a special character that can be fixed manually (e.g., id = *1234)