Use Google cache and curl

Often Google cache has a full copy of an article.

We use curl to check the existence of the cache for a given URL as some URLs may not be cached.

This is not an accurate science. Sometimes the cache is needed and sometimes it is not, but the more information we bring in (such as if the cache exists), the better, as this leads to a more informed decision.

Make the curl-firefox script. We call this script from emacs

Vanilla curl always returns 404 from Google’s cache.

We need to add the user agent with -A.

1
2
3
#!/bin/bash

/usr/bin/curl -A "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:59.0) Gecko/20100101 Firefox/59.0" "$@"

Write the test functions in emacs-lisp

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
;; See this article for ~sh-notty~
;; https://mullikine.github.io/posts/macro-tutorial/

(defmacro sh-notty-true (cmd &rest sh-notty-args)
  "Returns t if the shell command exists with 0"
  `(let ((result (sh-notty ,cmd ,@sh-notty-args)))
     (string-equal b_exit_code "0")))

(defun url-is-404 (url)
  "URL is 404"
  (sh-notty-true (concat "curl-firefox -s -I " (q url) " | grep -q \"404 Not Found\"")))

(defun url-cache-is-404 (url)
  "URL cache is 404"
  (url-is-404 (concat "http://webcache.googleusercontent.com/search?q=cache:" url)))

Try them out

1
2
3
4
5
(url-is-404 "https://medium.com/riselab/functional-rl-with-keras-and-tensorflow-eager-7973f81d6345")
(url-cache-is-404 "https://medium.com/riselab/functional-rl-with-keras-and-tensorflow-eager-7973f81d6345")
(url-is-404 "https://www.cryptocompare.com/coins/nas/overview")
(url-is-404 "https://news.ycombinator.com/")
(url-cache-is-404 "https://news.ycombinator.com/")

Add some advice to the eww command which expands URLs just before they are loaded

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
;; This makes it so for certain urls, the google cache is loaded instead
(defun eww--dwim-expand-url-around-advice (proc &rest args)
  (let* ((url (car args))
         (cached_url (replace-regexp-in-string "^" "http://webcache.googleusercontent.com/search?q=cache:" url)))
    (if (and (or (string-match-p "towardsdatascience" url)
                 (string-match-p "medium.com" url))
             (not (string-match-p "webcache.google" url))
             (not (url-cache-is-404 cached_url)))
        (setq url cached_url))
    (let ((res (apply proc (list url))))
      res)))
(advice-add 'eww--dwim-expand-url :around #'eww--dwim-expand-url-around-advice)