May 6, 2019

Confessions of a DataHoarder

Building tools to manage my digital habit of saving everything.

I’ve been fond of computers for about as long as I can remember, but I remember the exact date and time when I became a data hoarder, or to be specific, an obsessive of data backup and well archiving a most of my digital existence and experience.

Sometime in March of 2012 during my junior year of high school, I’d been working on finishing my formal research paper for Intel ISEF (which happened to be figuring out how to harvest engergy from server hardware immersed in Novec 7000 - a dielectric phase-change coolant). I’d compiled the entire report and had nearly 400Gb of data and media relevant to the project all sitting on a single 3TB hard disk I’d shucked from an external drive. Lone behold, after taking a break to go downstairs and have dinner, when I returned I noticed my machine had gone to sleep.

The components of my machine were splaid across my glass desk since I was too cheap to actually buy a test bench. When I went to wake up the machine a puff of blue smoke immediately erupted from the 3tb drive. In about 2 seconds I’d lost months of work and a lot of cool media, not to mention the current extent of my personal media archive (wow, did 3Tb feel like a lot of space in 2012).


From that point forward I never trusted a single drive without a few copies to keep my data safe ever again. More specifically, I never purchased another Seagate drive again (Even Backblaze can back this up).

This began my deluge into software raid, ZFS, server hardware and a number of other related topics, but going into the implementation of my data hoarding is for another entry.


By far, movies and other cinematic works take up the most space on my 28Tb file server. This has basically always been the case, since I like to always find the highest quality BluRays and then store them digitally on my server in lossless format. The issue is, since I’ve been archiving BluRays and other sources of media (some open source of CC) I’ve started to loose track of what I actually have. Sometimes I’ll come across a listing for a BluRay or file I’m not sure I have and waste money or time since I already have it in my server. Or when friends ask if I have a copy of something I won’t really be able to give them a straight answer (it’s basically been that way since I surpassed 300+ titles).

Now here’s the thing, I understand that plenty of tools probably exist to organize files, probably some specific to movies. Yes, I also host all of these files on Plex, and their search tool is great. But, Plex doesn’t let you hit their search service with an API and it’s brittle - especially when your media is hosted over an NFS share.

So I decided to build a tool that I’ve named Spook, named after the naval term for an intelligence officer.

This tool is written in Elixir, mostly since I’ve just been using the language for a lot of my current projects.

It’s core functions are to watch a number of NFS shares, keeping track of what I add, when I add it and more importantly granular details about everything I store.

The intent here is to eventually have a tool that not only keeps track of my files and ingests their metadata into a Postgres DB I can quiery with ElasticSearch, but to also have a tool that will allow me to know exactly what I have and what I don’t.

I’ve been hacking on this a few nights a week, and have a few people from /r/datahoarder on reddit who’ve given me support / feedback to keep using my free time to build this monstrosity.

Here’s the source