I have had issues with using python libraries for scraping Google search results.

It appears that pip libraries keep going out of order.

Therefore, I switched to scraping with emacs.

I have developed a faith in the eww browser due to its reliability.

Create the elisp function which uses xurls to scrape URLs from text

1
2
3
4
5
(defun google-scrape-after-loaded ()
  ;; (new-buffer-from-string (sh/ptw/uniqnosort (sh/ptw/xurls (format "%S" (buffer-string)))) "*google-results*")
  (let ((results (sh/ptw/uniqnosort (sh/ptw/xurls (format "%S" (buffer-string))))))
    (write-string-to-file results "/tmp/eww-scrape-output.txt")
    (new-buffer-from-string results)))

Hook into the post-render hook

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
(defun eww-browse-url-then (url thenproc)
  (setq eww-after-render-hook (list
                               thenproc
                               (lm
                                (setq eww-after-render-hook '())
                                ;; This doesn't work for some reason
                                ;; (setq eww-after-render-hook ,oldhook)
                                ;; Unfortunately, this still runs
                                (add-hook 'eww-after-render-hook 'finished-loading-page))))
  (eww-browse-url url))

Create the function for browsing and scraping google

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
(defun google-scrape-results (query &rest runafter)
  (interactive (list (read-string "query:")))

  (if (not (string-empty-p query))
      (let* ((encodedquery (sh/ptw/urlencode query))
             (url (concat "http://www.google.com/search?ie=utf-8&oe=utf-8&q=" encodedquery))
             ;; (oldhook eww-after-render-hook)
             )

        (eval `(eww-browse-url-then url (lm (google-scrape-after-loaded) ,@runafter))))))

Create the shell script for interfacing with emacs seamlessly

Do some post-processing at the end to remove URLs we don’t need.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#!/bin/bash
export TTY

( hs "$(basename "$0")" "$@" "#" "<==" "$(ps -o comm= $PPID)" 0</dev/null ) &>/dev/null

CMD="$(cmd "$@")"
: ${CMD:="$(cmd "$@")"}

rm -f /tmp/eww-scrape-output.txt
unbuffer e -e "(google-scrape-results $(aqf "$CMD") (kill-buffer) (delete-frame))" &>/dev/null &
(
    spid=$(sh -c 'echo $PPID')

    timeout 20 inotifywait -m /tmp/ -e create 2>/dev/null | while read path action file; do
        if [[ "$file" =~ $eww-scrape-output.txt$ ]]; then
            kill -KILL "$spid"
            exit 0
        fi
    done
)

if test -f /tmp/eww-scrape-output.txt; then
    cat /tmp/eww-scrape-output.txt |
        sed '0,/advanced_search/d' |
        sed '/accounts.google.com/,$ d' |
        sed -n '/\.com\/url\?/p' |
        sed 's=^http://www.google.com/url?q\===' |
        sed 's=&.*=='
fi
1
cd "$NOTES/ws/blog/posts"; ci emacs-google-scrape-backend site:github.com linear algebra awesome | v

And then a command to call the above but cache the results

1
2
3
4
5
6
#!/bin/bash
export TTY

( hs "$(basename "$0")" "$@" "#" "<==" "$(ps -o comm= $PPID)" 0</dev/null ) &>/dev/null

oci emacs-google-scrape-backend "$@"
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
https://github.com/rossant/awesome-math
https://github.com/zslucky/awesome-AI-books
https://github.com/Empia/awesome-math-1
https://github.com/shenwei356/awesome/blob/master/math.md
https://github.com/wx-chevalier/Awesome-CS-Books/blob/master/DataScienceAI/Mathematics/2017-Fundamentals%2520of%2520Linear%2520Algebra%2520and%2520Optimization.pdf
https://github.com/wx-chevalier/Awesome-CS-Books/blob/master/DataScienceAI/Mathematics/2017-Fundamentals%20of%20Linear%20Algebra%20and%20Optimization.pdf
https://github.com/FrankLoud/awesome-math
https://github.com/nschloe/awesome-scientific-computing
https://github.com/marcosgomesborges/awesome-maths
https://github.com/krishnakumarsekar/awesome-machine-learning-deep-learning-mathematics
https://github.com/zslucky/awesome-AI-books/issues/1

Demonstration

asciinema recording