Scraping without scraping: making my own manga recommender system

news
code
analysis
Author

Espen Jütte

Published

December 18, 2022

With an abundance of excellent manga available, ranging from stories about a bartender using the art of mixology to solve his patrons’ problems to tales of a tournament centered around overclocking, the challenge lies in discovering these treasures. As you wade through the countless isekai manga featuring yet another pharmacy in a parallel world, you may find yourself wondering: is there a more efficient way to locate the gems amidst the vast sea of options?

A robot in a room full of manga (by StableDiffusion)

As a data scientist, it is natural to consider building a recommendation engine. Some manga enthusiasts have attempted this using techniques such as TF-IDF on manga descriptions. However, the key data needed for a more effective recommendation system is information on which manga specific users have read, liked, and favorited. While this type of data is often closely guarded, the manga community tends to be more open, with multiple sites where users can publicly log their favorite manga. For the purpose of our recommender system, we will be using MyAnimeList as a basis. Our goal is to provide the system with either a single manga or a set of manga and generate a list of recommended reading based on recommender system principles.

Scraping without scraping (Common Crawl)

While webscraping is extremely usefull for gathering data it tends to be very unpopular with big sites that specifically earn a living keeping a database. So to minimize the number of queries we send to myanimelist we will instead use a wonderfull service called Common Crawl that has already done the hard scraping work for us.

This is a three step process: First we need to find what pages are in common crawls index, then we need to fetch the pages and last we need to parse the pages to make a user X manga dataframe.

Getting the index

Firsly we look up what pages are in the common crawl index: 1

cdx_indexs_df = getCCIndexes()
cdx_selectd = cdx_indexs_df.query("year == '2022' or year == '2021' or year == '2020' or year == '2019' or year == '2018' or year == '2017'")
index_list = cdx_selectd["yearweek"].tolist()
search_pattern = "myanimelist.net%2Fmangalist%2F*"
records = search_domain(search_pattern,index_list)

The function CCindexes() downloads a list of all indexes, we then add a search patterns so we only find myanimelist.net mangalists and lastly we run the command to fetch a list of all.

Next we process the records:

records_df = pd.DataFrame(records)
records_df = records_df.query("status == '200'") # keep only 200 okay
records_df["mal_status"] = records_df.url.str.extract("status\=([1-9])")
records_df["mal_user"] = records_df.url.str.extract("mangalist\/([^\?]+)(?:\??|)")
records_df = records_df.query("mal_status != mal_status | mal_status == '7'") # keep only complete records
records_df = records_df.sort_values('timestamp').groupby('mal_user').tail(1) # get only newest record
records_df = records_df.reset_index()

Fetching the pages - some background

Now that we know where all the files are we need to download them from common crawl. Common crawl has a publically accessible HTTPS-interface we can use. What is important is that pages are stored in WARC-files, gzip compressed HTML files with multiple pages in them. From the index we get a range of bytes that contain our page of interest. We can then ask for only those bytes, expand the gzip-data and then load the HTML as usual.

Parsing myanimelist

Parsing myanimelist is an absolute horror, as there are many ways for users to customize their lists of manga. There are three main types of pages: JSON data in the html table-object, a “new” version of the mangalist and a “old” version of the mangalist that allows for full customization by the user. Each one requires a different parser.

The first step is to build a system that can detect what kind of page was currently feteched. Then we dispatch to three different parsing-functions that all return the same data structure.

(Post in progress)

Footnotes

  1. search_domain code initially from https://www.bellingcat.com/resources/2015/08/13/using-python-to-mine-common-crawl/↩︎