# Huginn

Published on Monday, December 26th, 2016

Yahoo! terminated Yahoo! Pipes on June 4, 2015. It breaks my heart to see another good service dying. However, I recently found another project which has an ability just like Yahoo! Pipes: Huginn

Huginn seems to be more robust than Yahoo! Pipes, actually. The downside is that it is much more difficult to set things up. Also, it’s just an app. I need to either set up a local server to run it or host it somewhere. Since I’m too lazy to maintain the local server, I hosted it with Heroku. With the free plan, there are a lot of restrictions, but this should be enough for the basic tasks I want to do.

Here comes an example:

I am following a Thai manga named EXEcutional. For other manga, I would need to wait until I get back to Thailand and read them. Luckily, EXEcutional has an e-book version, so I could just buy it online. I however want to be notified whenever a new issue is released.

First, I curl the webpage to see if there is any data I could scrape. To my disappointment, there is no data at all. This means the data is likely loaded later by JavaScript, so I try to find the relevant code.

https://www.mebmarket.com/index.php?action=SeriesDetail&series_id=563&page_no=1
 1 2 3 4 5 6 7 8 ... ... 

Cool! Next I curl this JavaScript file, then I find:

https://www.mebmarket.com/web/Assets/Scripts/Templates/series_details.js
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 $.ajax({ type: "POST", url: "Ajax.php?action=CallWrapper", data: ({ api_call : 'Store', method_call : 'userGetCacheBooks', token : token, Option :{ value: series_id, condition:'all', }, pageCache : 'series', app_platform : 'WEB', page_no : page_no, result_per_page : result_per_page, app_id : app_id }),  Even though the type of request is POST, but perhaps GET will work too? Let’s try...  1 2 $ curl "https://www.mebmarket.com/Ajax.php?action=CallWrapper&api_call=Store&method_call=userGetCacheBooks&Option%5Bvalue%5D=563&Option%5Bcondition%5D=all&pageCache=series&page_no=1" {"status":{"success":true,"message":"STOREUsGetCachSuccessGetCache"},"data":{"page_count":2,"count":33,"result_per_page":20,"books":[{"book_id":"49151","series_id":"563","book_publisher":"Siam Inter Comics","category_id":"12","category_name":"\u0e01\u0e32\u0e23\u0e4c\u0e15\u0e39\u0e19","book_name":"EXEcutional \u0e21\u0e2b\u0e32\u0e2a\u0e07\u0e04\u0e23\u0e32\u0e21\u0e2d\u0e2d\u0e19\u0e44\u0e25\u0e19\u0e4c\u0e16\u0e25\u0e48\u0e21\u0e08\u0e31\u0e01\u0e23\u0e27\u0e32\u0e25 \u0e40\u0e25\u0e48\u0e21 32"... 

Exactly what I want! Now, we will turn this to RSS feed by using Huginn.

To scrape this data, create a “Website Agent” with the following configurations:

  1 2 3 4 5 6 7 8 9 10 11 { "expected_update_period_in_days": "2", "url": "https://www.mebmarket.com/Ajax.php?action=CallWrapper&api_call=Store&method_call=userGetCacheBooks&Option%5Bvalue%5D=563&Option%5Bcondition%5D=all&pageCache=series&page_no=1", "type": "json", "mode": "on_change", "extract": { "name": { "path": "\$..book_name" } } } 

This uses a query language named JSONPath to scrape all book_name.

Next, we will export this as a feed, so create a “Data Output Agent” with the following configurations:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 { "secrets": [ "THIS_IS_SUPER_SECRET" ], "expected_receive_period_in_days": "2", "template": { "title": "EXEcutional", "description": "EXEcutional", "item": { "title": "{{name}}", "link": "https://www.mebmarket.com/index.php?action=SeriesDetail&series_id=563&page_no=1" } } } 

And set it to read source from the first agent. This completes the setup in Huginn.

Huginn then provides us a URL to access the data. The XML one in particular is in RSS format (it’s actually Atom, but I will call them interchangably), so copy that one and paste it in an RSS reader.