8 Commits

Author SHA1 Message Date
James Taylor
f706689a56 extract_item_info: Don't extract author, author_id, etc. for channel items
Philosophically, a channel doesn't create itself.
2019-12-24 13:11:21 -08:00
James Taylor
3200d66d88 Fix extract_approx_int not working for non-approx ints, make extract_int more robust
For example, "354 subscribers" wasn't being extracted correctly be extract_approx_int.
Make extract_approx_int and extract_int only extract integers that are words.
So e.g. 342 will not be extracted from internetuser342
2019-12-24 13:07:12 -08:00
James Taylor
7a6bcb6128 Rewrite channel extraction with proper error handling and new extraction names. Extract subscriber_count correctly.
Don't just shove english strings into info['stats']. Actually give semantic names for the stats.
2019-12-21 15:45:01 -08:00
James Taylor
3936310e7e Fix extract_approx_int. Fixes incorrect subscriber count on channels.
It wasn't working because decimals such as 15.1M weren't considered, so it was extracting "1M"
2019-12-21 15:44:03 -08:00
James Taylor
a9f67d4630 Fix regression: date extraction broken. Move constants to correct file in yt_data_extract 2019-12-20 18:48:40 -08:00
James Taylor
4a3529df95 Extraction: Move stuff around in files and put underscores in front of internal helper function names
Move get_captions_url in watch_extraction to bottom next to other exported, public functions
2019-12-19 20:12:37 -08:00
James Taylor
d1d908d5b1 Extraction: Move html post processing stuff from yt_data_extract to util 2019-12-19 19:48:53 -08:00
James Taylor
76376b29a0 Extraction: Split yt_data_extract.py into multiple files 2019-12-19 19:29:47 -08:00