August 31, 2014

rsync for file sync and backup - storing my peronal media

You do backup all of your important files, right?

TLDR: Dropbox + NAS + Local Storage + rsync = my backup strategy.

I have a lot of data that is important to me. Mostly: documents, code, images, music, and videos. I currently have about 1.2TB stored. This is an overview of how I store and backup my data.


About rsync:

rsync is a great file synchronization and file transfer program. Since being announced in 1996, it has become a standard Linux utility, included in all popular Linux distributions (and other Unix-like systems).


I have a Dropbox with an 11GB quota. This suffices for storing my documents, code, and images. It nicely syncs them to to all my computers. I can easily access this data from outside my LAN on any device I own.

Then I have a NAS as my main storage within my LAN. It is a 2-bay enclosure, with 2 2TB hard drives inside that are in a RAID configuration. I can easily access all of my media from any computer or device on my home network. I also have a secondary NAS attached to my network, purely for backup/disaster situations.

To complete the storage picture... on my main workstation, I have a 2TB hard disk mounted as a secondary data drive.

Storage Overview:

  • Dropbox
  • 2 NAS servers mounted as drives on my main workstation (using SMB)
  • Main workstation with 2TB Hard disk mounted internally
  • Gigabit Ethernet LAN

So, that's the given hardware setup. I use a shell script to actually run my backup. It is initiated from the main workstation (scheduled from cron).

The script follows this workflow:

  • local Dropbox gets rsync'ed to my primary NAS
  • primary NAS gets rsync'ed to my local workstation's data drive
  • primary NAS gets rsync'ed to my backup NAS

I have ~20k files (~1.2TB) in my archive. The first time running the backup job was slow (several hours), but subsequent backups are extremely fast. This is due to the rsync algorithm, delta encoding, and compression. If not many files changed since my last backup/sync, differential backups take only a few seconds or minutes.

Besides not having a offsite copy for disaster recovery, I like this system, and it makes me sleep well knowing my data is safe and recoverable.

Any flaws?

January 11, 2014

Python - Fixing My Photo Library Dates (Exif Metadata)

I have a large image library of photos I've taken or downloaded over the years. They are from various cameras and sources, many with missing or incomplete Exif metadata.

This is problematic because some image viewing programs and galleries use metadata to sort images into timelines. For example, when I view my library in Dropbox Photos timeline, images with missing Exif date tags are not displayed.

To remedy this, I wrote a Python script to fix the dates in my photo library. It uses gexiv2, which is a wrapper around the Exiv2 photo metadata library.

The scipt will:

  • recursively scan a directory tree for jpg and png files
  • get each file's creation time
  • convert it to a timestamp string
  • set Exif.Image.DateTime tag to timestamp
  • set Exif.Photo.DateTimeDigitized tag to timestamp
  • set Exif.Photo.DateTimeOriginal tag to timestamp
  • save file with modified metadata
  • set file access and modified times to file creation time

* Note: it does modifications in-place.

The Code:

November 17, 2013

Gource Visualization of Ubuntu Touch Core Apps Development

TLDR: I made a cool version control visualization of all the Ubuntu Touch Core Apps.

The video: https://www.youtube.com/watch?v=nAmKAgRS0tw

* Warning: abrasive techno music
* To be watched in HD, preferably at maximum volume


Making Gource visualizations of complex software projects is awesome. I love seeing a VCS commit log come to life as blooming trees and swarming workers. Normally, I do a visualization video of a single repository. But in this case, I used a bash script to create a visualization of multiple source code repositories. I wanted to see the progress of the entire stack of Ubuntu Touch Core Apps (17 projects). Ubuntu Touch Core Apps is an umbrella project for all [17] of the core apps that are available in Ubuntu on mobile devices

The Ubuntu Touch Core Apps:

  • Dropping Letters
  • Evernote Online Accounts plugin
  • QtDeclarative bindings for the Grilo media scanner
  • Stock Ticker App
  • Sudoku App
  • Ubuntu Calculator App
  • Ubuntu Calendar App
  • Ubuntu Clock App
  • Ubuntu Document Viewer App
  • Ubuntu E-mail App
  • Ubuntu Facebook App
  • Ubuntu File Manager App
  • Ubuntu Music App
  • Ubuntu Phone Commons
  • Ubuntu RSS Feed Reader App
  • Ubuntu Terminal App
  • Ubuntu Weather App

Making the visualization:

Assuming you have a bunch of source code repositories already branched/cloned locally, here is a general version of the script to generate visualization videos of multiple projects/repositories: https://gist.github.com/cgoldberg/7488521

The script I used to create the Ubuntu Touch Core Apps video: https://gist.github.com/cgoldberg/7516510

October 22, 2013

deadsnakes - Using Old Versions of Python on Ubuntu

How do you install an older version of Python on Ubuntu without building it yourself?

The Python packages in the official Ubuntu archives generally don't go back all that far, but people might still need to develop and test against these old Python interpreters. Felix Krull maintains a PPA (package archive) of older Python versions that are easy to install on Ubuntu.

see: https://launchpad.net/~fkrull/+archive/deadsnakes

Currently supported Python releases: 2.4, 2.5, 2.6, 2.7, 3.1, 3.2, 3.3


Instructions:

Add the deadsnakes repository:

$ sudo add-apt-repository ppa:fkrull/deadsnakes

Run Update:

$ sudo apt-get update

Install an older version of Python:

$ sudo apt-get install python2.6 python2.6-dev

June 22, 2013

Generating Audio Spectrograms in Python

A spectrogram is a visual representation of the spectrum of frequencies in a sound sample.

more info: wikipedia spectrogram

Spectrogram code in Python, using Matplotlib:
(source on GitHub)

"""Generate a Spectrogram image for a given WAV audio sample.

A spectrogram, or sonogram, is a visual representation of the spectrum
of frequencies in a sound.  Horizontal axis represents time, Vertical axis
represents frequency, and color represents amplitude.
"""


import os
import wave

import pylab


def graph_spectrogram(wav_file):
    sound_info, frame_rate = get_wav_info(wav_file)
    pylab.figure(num=None, figsize=(19, 12))
    pylab.subplot(111)
    pylab.title('spectrogram of %r' % wav_file)
    pylab.specgram(sound_info, Fs=frame_rate)
    pylab.savefig('spectrogram.png')


def get_wav_info(wav_file):
    wav = wave.open(wav_file, 'r')
    frames = wav.readframes(-1)
    sound_info = pylab.fromstring(frames, 'Int16')
    frame_rate = wav.getframerate()
    wav.close()
    return sound_info, frame_rate


if __name__ == '__main__':
    wav_file = 'sample.wav'
    graph_spectrogram(wav_file)

Spectrogram code in Python, using timeside:
(source on GitHub)

"""Generate a Spectrogram image for a given audio sample.

Compatible with several audio formats: wav, flac, mp3, etc.
Requires: https://code.google.com/p/timeside/

A spectrogram, or sonogram, is a visual representation of the spectrum
of frequencies in a sound.  Horizontal axis represents time, Vertical axis
represents frequency, and color represents amplitude.
"""


import timeside


audio_file = 'sample.wav'

decoder = timeside.decoder.FileDecoder(audio_file)
grapher = timeside.grapher.Spectrogram(width=1920, height=1080)
(decoder | grapher).run()
grapher.render('spectrogram.png')

happy audio hacking.

June 10, 2013

Python - concurrencytest: Running Concurrent Tests

Add parallel testing to your unit test framework.

In my previous post, I described running concurrent tests using nose as a loader and runner.

On a similar note, let's look at building concurrency into your own test framework built on Python's unittest.

Have a look at this module: concurrencytest

(Thanks to bits and concepts taken from testtools and bzrlib)


An Example:

Say you have a 'TestSuite' of tests loaded. You could run them with the standard 'TextTestRunner' like this:

runner = unittest.TextTestRunner()
runner.run(suite)

That would run the tests in your suite sequentially in a single process.

By adding the concurrencytest module, you can use a 'ConcurrentTestSuite' instead, by adding:

from concurrencytest import ConcurrentTestSuite, fork_for_tests

concurrent_suite = ConcurrentTestSuite(suite, fork_for_tests(4))
runner.run(concurrent_suite)

That would run the same tests split across 4 processes (workers).

Note: this relies on 'os.fork()' which only works on Unix systems.


There's no way to understand this better than looking at some contrived examples!

This first example is totally unrealistic, but shows off concurrency perfectly. The test cases it loads each sleep for 0.5 seconds and then exit.

The Code:

Output:

Loaded 50 test cases...

Run tests sequentially:
..................................................
----------------------------------------------------------------------
Ran 50 tests in 25.031s

OK

Run same tests across 50 processes:
..................................................
----------------------------------------------------------------------
Ran 50 tests in 0.525s

OK

nice!

Now another example that shows concurrency with CPU-bound test cases. The test cases it loads each calculate fibonacci of 31 (recursively!) and then exit. We can see how it performs on my 8-core machine (Core2 i7 quad, hyperthreaded).

The Code:

Output:

Loaded 50 test cases...

Run tests sequentially:
..................................................
----------------------------------------------------------------------
Ran 50 tests in 21.941s

OK

Run same tests with 2 processes:
..................................................
----------------------------------------------------------------------
Ran 50 tests in 11.081s

OK

Run same tests with 4 processes:
..................................................
----------------------------------------------------------------------
Ran 50 tests in 5.862s

OK

Run same tests with 8 processes:
..................................................
----------------------------------------------------------------------
Ran 50 tests in 4.743s

OK

happy hacking.

June 9, 2013

Python - Nose: Running Concurrent Tests

TLDR:
To enable multiprocessing with N workers,
run nose with:

$ nosetests --processes=N

When writing tests in Python, I start with TestCase's derived from unittest.TestCase, and standard test discovery. When I need more complex test discovery/loading or output reports, I often use nose and its assortment of plugins as my test loader/runner.

One nice feature of nose is the multiprocess plugin. It allows you to run your tests suites concurrently rather than sequentially, spread across a number of worker processes. Running tests in parallel like this can potentially give you a large speedup in your test run times.

from the nose multiprocess docs:

"You can parallelize a test run across a configurable number of worker processes. While this can speed up CPU-bound test runs, it is mainly useful for IO-bound tests that spend most of their time waiting for data to arrive from someplace else and can benefit from parallelization."

Normally, you run tests from nose with:

$ nosetests

To run the same tests split across 4 processes (workers), you would just do:

$ nosetests --processes=4

Assuming your tests are properly isolated, everything should run normally, and you can benefit from a speedup on a multiprocessor machine.

However, Beware.

"Not all test suites will benefit from, or even operate correctly using, this plugin. For example, CPU-bound tests will run more slowly if you don't have multiple processors."
"But the biggest issue you will face is probably concurrency. Unless you have kept your tests as religiously pure unit tests, with no side-effects, no ordering issues, and no external dependencies, chances are you will experience odd, intermittent and unexplainable failures and errors when using this plugin. This doesn't necessarily mean the plugin is broken; it may mean that your test suite is not safe for concurrency."

April 1, 2013

Squeezelite - Headless Squeezebox Emulator

Use Squeezebox, without buying a Squeezebox...

Recently, Logitech discontinued most Squeezebox streaming music players. However, the media server is Open Source, so it looks like some form of Logitech Media Server (LMS) will live on, no matter what Logitech eventually does with it.

I've been a user of Squeezebox network music player since it was released by SlimDevices (SliMP3/SlimServer), and throughout the transfer to Logitech. I've owned 3 Squeezebox models over the years... currently enjoying the Squeezebox Touch, with music streamed from Logitech Media Server.

It works flawlessly for streaming my own music collection (FLAC/MP3/etc), and streaming radio (Pandora/Slacker/Sirius/etc), to my HiFi. I use the digital (S/PDIF) outputs, and sometimes the DAC/analog (RCA) outputs.

Now... with the release of Squeezelite, you can build your own Squeezebox, or use an existing computer/laptop with digital output as a Squeezebox.

Squeezelite is a cross-platform, headless, LMS client that supports playback synchronization, gapless playback, direct streaming, and playback at various sampling rates. It runs on Linux using ALSA audio output and other platforms using PortAudio. It is aimed at supporting high quality audio.

I gave Squeezelite 1.0 a try on Ubuntu 12.04, with S/PDIF optical output to my DAC. It worked like a charm!

Squeezelite info:
https://code.google.com/p/squeezelite/

Squeezelite download (precompiled binaries for x86/amd64/arm):
https://code.google.com/p/squeezelite/downloads/list

Enjoy the music.

March 28, 2013

Python - Re-tag FLAC Audio Files (Update Metadata)

I had a bunch of FLAC (.flac) audio files together in a directory. They are from various sources and their metadata (tags) were somewhat incomplete or incorrect.

I managed to manually get all of the files standardized in "%Artist% - %Title%.flac" file name format. However, What I really wanted was to clear their metadata and just save "Artist" and "Title" tags, pulled from file names.

I looked at a few audio tagging tools in the Ubuntu repos, and came up short finding something simple that covered my needs. (I use Audio Tag Tool for MP3's, but it has no FLAC file support.)

So, I figured the easiest way to get this done was a quick Python script.

I grabbed Mutagen, a Python module to handle audio metadata with FLAC support.

This is essentially the task I was looking to do:

#!/usr/bin/env python

import glob
import os
from mutagen.flac import FLAC

for filename in glob.glob('*.flac'):
    artist, title = os.path.splitext(filename)[0].split(' - ', 1)
    audio = FLAC(filename)
    audio.clear()
    audio['artist'] = artist
    audio['title'] = title
    audio.save()

It iterates over .flac files in the current directory, clearing the metadata and rewriting only the artist/title tags based on each file name.


I created a repository with a slightly more full-featured version, used to re-tag single FLAC files:
https://github.com/cgoldberg/audioscripts/blob/master/flac_retag.py

January 28, 2013

Python - verify a PNG file and get image dimensions

useful snippet for getting .png image dimensions without using an external imaging library.

#!/usr/bin/env python

import struct


def get_image_info(data):
    if is_png(data):
        w, h = struct.unpack('>LL', data[16:24])
        width = int(w)
        height = int(h)
    else:
        raise Exception('not a png image')
    return width, height


def is_png(data):
    return (data[:8] == '\211PNG\r\n\032\n'and (data[12:16] == 'IHDR'))


if __name__ == '__main__':
    with open('foo.png', 'rb') as f:
        data = f.read()

    print is_png(data)
    print get_image_info(data)

/headnods:
getimageinfo.py source, Portable_Network_Graphics (Wikipedia)