Tremendous Task: Searching for code on GitHub with BigQuery and GHTorrent
Searching GitHub
for regular expression matches in code is a tremendous task.
Prerequisites
Set up the bq
command
https://cloud.google.com/bigquery/docs/bq-command-line-tool
Lots of money
Sample search
About US$5
per search.
This is cheaper than searching all files.
shell variable | function |
---|---|
$query | a regular expression that searches the contents of files |
$path_re | a regex that matches on the file path |
$path_re_exclude | a regex that matches on the file path for pruning results |
|
|
Full contents search – Search all files
About US$20
per search.
|
|
Using GHTorrent
Using GHTorrent
we can sort repositories according to the number of stars.
We can also reduce the amount of repositories searched with regex by limiting our search to repositories of a specific language.
This will make the results cheaper and more relevant.
shell variable | function |
---|---|
$language | specifying the language |
Join two big query tables with #standardSQL
We will do something like this but for ghtorrent
and bigquery-public-data.github_repos
.
|
|
Results
|
|
Generate the org-mode
document with all results
|
|
Annex
Tables used
|
|
Download the schema
|
|
Thanks for reading!
If this article appears incomplete, it may be intentional. Try prompting for a continuation.
If this article appears incomplete, it may be intentional. Try prompting for a continuation.