Tips on using a git scraper as an easy data pipeline

This post originally appeared on the BikeSpace blog.

At BikeSpace, we have a couple datasets that need to be regularly updated and published for our bike parking map. These are classic “data pipeline” type tasks, which can be handled by solutions ranging in complexity from a python script run by a cron job up to a fully-fledged data workflow platform like Apache Airflow.

For us, we’ve found that using a git scraper has been a great way to manage data updates (our repository for this is here). Simon Willison has written a whole series of blog posts about the concept, but the key ingredients we use are:

A git repository with some python scripts to extract, transform, and save the data to files
The git repository also holds the data (ideally in text formats that diff well)
A Github action runs the scripts on a schedule and commits the new/updated files to the repository

When our front-end code needs the data, it just requests the relevant file(s) from GitHub.

This set-up works pretty well: the code for updating the data can be as simple or complex as you need, you can run the whole thing for free, and you get the benefits of a pretty good task runner in GitHub actions. Since the data is stored in git, you also have a built-in way to see how your data is changing and to traverse back to earlier versions. If you’re already familiar with using Github actions for CI/CD tasks, then you also have the bonus of not having to learn a new tool.

If you want to try something like this, here are some tips we’ve found to be helpful:

Keeping your scripts and your data separate

Most git scrapers I’ve seen keep both the data and the code to update it in the same branch. For simple situations this works fine, but there are some drawbacks:

Hard to implement branch protection (e.g. requiring code review to make changes to the scripts)
If you update your data often, your git history is full of data update commits, making it harder to see when the script code has changed
More likely to mix code/development documentation with data documentation

To keep our scripts and data separate, we use two different branches: the main branch holds the scripts, and the data branch holds… the data. You could also do it the other way around with the data kept in main, though there might be some additional steps required since many workflows assume your production code is on the main branch.

For us, the (simplified) file organization works like this:

# main branch - scripts go here, e.g.
src/bikespace_data/bicycle_parking/update_data.py
 
# data branch - data goes here, e.g.
bicycle_parking/data.json

To update the data, we do a couple things in our workflow file to commit the changes and bring them from the main branch over to data:

name: Run Update (Sensor)
on:
  schedule:
    - cron: "00 9 * * *" # all days at 9:00am UTC (4:00am ST or 5:00am DT in Toronto)
permissions:
  contents: write
jobs:
  update-bicycle-parking-data:
    runs-on: ubuntu-latest
    steps:
      # check out repo, run scripts, set up git, etc.
      - name: Build commit on main (not pushed)
        run: |
          git add bicycle_parking
          git commit -m "BOT - ran data pipeline update (bicycle_parking)"
      - name: Cherry pick and push commit to data branch
        run: |
          # go over to the data branch
          git switch data
          # get up to date with the data branch to prevent any conflicts
          git pull origin data  
          # bring over the commit you made earlier and prefer its values for any merge conflicts
          git cherry-pick main --strategy-option=theirs
          # push the update to the data branch
          git push origin data

With this kind of workflow, what will happen is:

Code on the main branch updates the data and saves it with the location/filename expected by the data branch
Commit the changes, but don’t push to main (since the action runner is ephemeral, it’ll just get thrown away once the workflow is complete)
Switch to the data branch and make sure it’s up to date
Cherry pick the commit with the updated data from main and use --strategy-option=theirs to resolve any merge conflicts. This tells git to assume the cherry-picked commit is correct in any case where it’s over-writing the existing data.
Push the update to the data branch – now your changes are live!

One other key thing – earlier in the git setup part of the workflow, we run git fetch --depth=1 to make sure that the workflow has access to all the branches, and not just main. The workflow file is here, if you want to see a full example.

Skip committing if nothing has changed

In some of our scripts, we have it set so that no files are output if the source data has not been updated. In these cases, we want to skip the workflow steps where we would commit and push the new files, since there is nothing to commit, and git would otherwise exit with an error.

To do this, we add the following to our workflow:

#before trying to commit
- name: Check if there are changes that need to be committed
  run: |
    if [ -z "$(git status --porcelain)" ]; then
      echo "should_commit=False" >> "$GITHUB_ENV"
    else
      echo "should_commit=True" >> "$GITHUB_ENV"
    fi
 
# add to the subsequent steps
if: ${{ env.should_commit == 'True' }}

This takes advantage of the variables and conditional logic available in Github actions to check if there is anything to commit and to skip it if there are no changes. The -z bash command checks to see if the git status message (with --porcelain to make it more script-friendly) is blank (think -z = zero characters).

If you wanted, you could also make the commit conditional on a certain file or files having changed by using git diff to compare the main and data branches, though I haven’t played around much with this option yet.

What to save as outputs

In addition to the data file we want to use for our app, we also save:

Input files used by the script, e.g. files from the open data portal, query results from OpenStreetMap, etc. This helps with debugging and also gives you the option of re-running your data transformation from any point in time with improvements to your code.

Intermediate outputs that may be helpful for debugging – e.g. for a key filtering step, we add attributes that indicate whether each data point passed or failed certain logical tests. If you wanted to dig into which data points got through the filter or not, you could load up that file and look through the results to see exactly why each point was included or not in the final dataset.

Weekly “archive” copies of the data files. You can also use a tool like git-history but saving archive copies simplifies being able to do analysis on data from different points in time. We save our archive files in .parquet format to try to save some space.

Updates to a status log tracking key stats (e.g. number of data points in input datasets). We use a CSV file for this, and additional rows are appended each time the data is updated. It’s a nice, simple way to track metadata for updates and it’s easy to see how key stats change over time by just looking through the status file on the data branch.

Keeping outputs diff-friendly

To make the most of saving files to git, you want to make sure they diff well. This might mean making some tweaks to your scripts to format your output files. Some things we pay attention to:

Consistent line breaks and indentation (e.g. we use .to_json() for GeoDataFrames instead of .to_file() to have more control over indentation and to ensure diff-friendly line breaks.
Remove arbitrary indexes (e.g. for a lot of pandas outputs, there’s an index=False argument you can add)
Sort output data (especially csv outputs) to prevent arbitrary moves from being diffed as changes

In Summary

If you have data updates you want to run regularly, but want to keep things simple and don’t want to worry about managing a database, using the git scraper approach can work really well. If you have your own tips to share, or suggestions for ways we could improve our set-up, let us know!