Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
July 2, 2019 07:34 pm PDT

Using machine learning to pull Krazy Kat comics out of giant public domain newspaper archives

Jol Franusic became obsessed with Krazy Kat, but was frustrated by the limited availability and high cost of the books anthologizing the strip (some of which were going for $600 or more on Amazon); so he wrote a scraper that would pull down thumbnails from massive archives of pre-1923 newspapers and then identified 100 pages containing Krazy Kat strips to use as training data for a machine-learning model.

After a couple of false starts, which Franusic documents, he was able to train a model by feding the 100 "krazy"-containing thumbnails and a set without Krazy Kat thumbs that he labeled as "negative" to a Microsoft Custom Vision algorithm. He shelled out $180 for Microsoft's "Advanced Training" to be applied to his data, then set the model it produced loose on the remaining thumbnails.

The model crunched through the remaining thumbnails, then Franusic automated the download of full-sized scans from pages identified as likely to contain a Krazy Kat comic. When the dust settled, he had hundreds of Krazy Kat comics in a folder, including one strip that does not appear in any published book that Franusic was able to find.

Franusic has done an excellent job of summarizing his process notes, including source code, and has offered to share a complete set of notes with anyone who wants to build on his work. He's also produced a set of recommendations for people trying this kind of work in future, as well as a wishlist for newspaper archivists who are hoping that projects like this will surface interesting things in their archives. Read the rest


Original Link: http://feeds.boingboing.net/~r/boingboing/iBag/~3/Zx2wrOBllc4/roll-over-george-herriman.html

Share this article:    Share on Facebook
View Full Article