- After discovering old csv file doesn't have good index, we focused on processing list files from [[../../../../BayesSD/code/pipeline/data/others/frankfrut_datab/fr17 imdb overall]], which were different from [[../../../../BayesSD/code/pipeline/data/imdb22/imdb22]] or old csv file Johan shared.
- I used [this](https://github.com/dedeler/imdb-data-parser) library to process `.list` files to `.tsv` in the output folder as below:
- ![[Pasted image 20230228141853.png]]
- Johan requested the following
```
1) Create a template for what the final processed data files (TARGET) would look like: Create TSV files, showing what the cleaned data files should look like. Include a hundred lines from the data properly cleaned.
2) Create a list of the .list files you will use (SOURCE) to create these TARGET files
3) Create a conceptual list of steps showing how you will process SOURCE files into TARGET files; Indicate which steps are already done, how long you think each of the other steps will take, and any obstacles you see for executing each step. Also show any INTERMEDIATE file format examples, where relevant.
4) Email these to me before the meeting. In particular, please email me the TARGET templates as soon as you can, and I will look at them and adjust them and send you corrections before our meeting if possible
All of the above can be at the conceptual level. But if you have time, you can start experimenting with code to do some of the steps/create TARGET templates.
```
which I replied:
```
1. are you ok with constructing (76 by 3m) matrix described below?
2. is dropbox IMDB 2017 data you shared the same one as [https://ftp.fu-berlin.de/pub/misc/movies/database/frozendata/](https://ftp.fu-berlin.de/pub/misc/movies/database/frozendata/)"?
3. do you want me to separate movie (feature film) and tv-series? (I recall the movie started with " "?) If so can you share how the modeling should be done with these?
> please email me the TARGET templates as soon as you can, and I will look at them and adjust them and send you corrections before our meeting if possible
Below is my draft.
> 1) Create a template for what the final processed data files (TARGET) would look like: Create TSV files, showing what the cleaned data files should look like. Include a hundred lines from the data properly cleaned.
`age` (`yr_since_debut`), n_lead_roles, n_supporting_roles (separated by feature film and series)
There are 3m person, and 1m title, and I'm thinking of constructing matrix (76 by 3m). Row is 76 as the longest career `age` is [Norman Lloyd](https://streaklinks.com/BXjG3_DFxXlyXB5vIgSzlzOI/https%3A%2F%2Fwww.imdb.com%2Fname%2Fnm0516093%2F?email=hyunji.moonb%40gmail.com), 2015-1939 =76 yrs, column 3m is for each actors.
JUST FOR the visual image of the matrix, refer to the attached age_actor_event_matrix.png, row as age and column as person_id. The higher the number of events (person-title at each year), the brighter the color. (IGNORE 1,2..,5 + for our data, every colored square should start from age 0 for our data.)
> 2) Create a list of the .list files you will use (SOURCE) to create these TARGET files
actor, actress.list + some list file that might differentiate series and film (hopefully)
```
I asked:
> i assume you would want the role; but how would you differentiate between leading vs supporting role? e.g. do you know "the drug-addict" (from the picture without credit number < >) is leading or not?
![[Pasted image 20230228141737.png]]

Johan answer:
> Again, take a step back. Looking at that particular file, I’d start with just putting all the information in that file into a nice data structure. So looking at the snippet you sent, T1 could look like: Name, Title, TitleType, Year, Season, Episode, Role, RoleType, CreditRank