May 30, 2008

Python - rrdpy - Round Robin Databases (RRDTool)

(all scripts and code referenced in this post can be found here: http://code.google.com/p/rrdpy/source/browse/trunk)
"RRDtool is the Open Source industry standard, high performance data logging and graphing system for time series data. It stores the data in a very compact way that will not expand over time, and it can create beautiful graphs."
If you are developing tools that need a data repository and graphing capabilities, RRDTool provides you both. You create an RRD and then you insert data values at regular intervals. You then call the graphing API to have a graph displayed. The cool thing about this data storage is its “round robin” nature. You define various time spans and the granularity at which you want them stored. A fixed binary file is created, and this never grows in size over time. As you insert more data, it is inserted into each span. As results are collected, they are averaged and rolled into successive time spans. It makes a much more efficient system than using your own complex data structures, relational databases, or file system storage.

Recently, I started a small project on Google Code (rrdpy) to create a set of Python tools to make dealing with Round Robin Databases (RRD) less painful. Setting up RRD's can be tough if you don't know what you are doing.

Below are some example ways to use these tools.


First, I will create a Round Robin Database (RRD) using the rrd_maker.py script, which uses my rrd.py class. This script creates an RRD named test.rrd, which is expecting to be updated every 10 seconds:
import rrd

interval = 10
rrd_file = 'test.rrd'

my_rrd = rrd.RRD(rrd_file, vertical_label='value')
my_rrd.create_rrd(interval)
Now that I have my RRD created, what can I do with it? For a quick example, I can use the rrd_feeder_rand.py script to feed in random numbers and generate a graph every 10 seconds. The graph generated will show the past 60 mins of data. The code for that would look like:
import rrd
import random
import time


interval = 10
rrd_file = 'test.rrd'
           
my_rrd = rrd.RRD(rrd_file)

while True:
    rand = random.randint(1, 100)
    my_rrd.update(rand)
    my_rrd.graph(60)
    time.sleep(interval)
OK, pretty boring. How about an HTTP website monitor? I created the rrd_feeder_http.py script to show how this is done. This script will send HTTP GET requests to a specified url. A request is sent every 10 seconds and a graph of response times for the past hour is generated.
import rrd
import time
import httplib


host = 'www.python.org'
path = '/'
use_ssl = False

interval = 10
rrd_file = 'test.rrd'
            
            
def main():            
    my_rrd = rrd.RRD(rrd_file, 'Response Time')
    while True:   
        start_time = time.time()
        if send(host):
            end_time = time.time()
            raw_latency = end_time - start_time
            expire_time = (interval - raw_latency)
            latency = ('%.3f' % raw_latency)
            my_rrd.update(latency)
            my_rrd.graph(60)
            print latency
        else:
            expire_time = interval
        if expire_time > 0:
            time.sleep(expire_time)
                

def send(host):
    if use_ssl:
        conn = httplib.HTTPSConnection(host)
    else:
        conn = httplib.HTTPConnection(host)
    try:
        conn.request('GET', path)
        body = conn.getresponse().read()
        return True
    except:
        print 'Failed request'
        return False
        
        
if __name__ == '__main__':
    main()
The output from running the HTTP website monitor script looks like this:

May 29, 2008

Python - Yahoo Stock Quotes - Historical Pricing

I recently received a patch to my ystockquote.py module for retrieving historical stock prices. It takes a start and end date (YYYMMDD) and a ticker symbol, and returns pricing data (a nested list) for the time period specified. For example:
import ystockquote

ticker = 'GOOG'
start = '20080520'
end = '20080523'

data = ystockquote.get_historical_prices(ticker, start, end)

for dat in data:
    print dat
Output:
['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Clos']
['2008-05-23', '546.96', '553.00', '537.81', '544.62', '4431500', '544.6']
['2008-05-22', '551.95', '554.21', '540.25', '549.46', '5076300', '549.4']
['2008-05-21', '578.52', '581.41', '547.89', '549.99', '6468100', '549.9']
['2008-05-20', '574.63', '582.48', '572.91', '578.60', '3313600', '578.6']

New Blog - Switched To Blogger - First Post

I was sick of the hassles of hosting my own blog, so I just switched to Blogger. All of my old feeds should remain the same... so if you are already subscribed, you are all set.

The old blog will remain up in read-only mode, so its info is still indexed and searchable.

Coming soon:
more Python tricks, more performance testing tips, etc.

Visit the new blog here: http://coreygoldberg.blogspot.com or Subscribe here: http://feeds.feedburner.com/goldblog