December 30, 2009

New Year’s Python Meme

What’s the coolest Python application, framework or library you have discovered in 2009?

Multiprocessing, which was included in the standard library starting with Python 2.6. It is uses the same API as the threading module, but spawns separate OS processes, rather than threads. You can use it as a basis to build some very powerful concurrent programs that fully utilize multi-core systems.


What new programming technique did you learn in 2009?

Better concurrency with threads and processes. Up until a few years ago, I mostly dealt with uniprocessor systems. 1 processor, 1 core, concurrency was simple since you didn't have to think about it beyond using threads for non-blocking IO. Now everything is multi-way and multi-core and you have to think about true concurrency. Python gives you some primitives to deal with that, and I have been poking around and trying to learn what I can so I can build concurrent and scalable systems.

I also started dabbling in GUI programming with the new ttk (Tk Themed Widgets) in Python 2.7/3.1.


What’s the name of the open source project you contributed the most in 2009? What did you do?

www.pylot.org

Pylot is a performance/load testing tool for http and web services. I am the author and maintainer.

www.goldb.org
corey-projects

I have also been releasing snippets of open source code at my new Google code repository and on my web site.


What was the Python blog or website you read the most in 2009?

Planet Python gets you most everything in the Python world. But if I had to pick, I'd say my favorite Python blogs are from Jesse Noller and Michael Foord (Voidspace). Both of them write about topics that I am very interested in. (They are both Python Core maintainers).

I also like the regular geek sites: Reddit Programming, Slashdot, Digg. Then there is a big crew of Python developers on Twitter that are fun to follow (follow me: @cgoldberg).

I also really got into StackOverflow this year. It is a really useful question/answer site. The wealth of Python information on there is staggering. I mean, where else can you ask a Python question and have people like Alex Martelli answering it 5 mins later? ... with others fighting to provide even better answers. It's great!


What are the three top things you want to learn in 2010?

- Selenium/WebDriver 2.0 - the new WebDriver browser driver and API. The Selenium guys have been doing a lot of great work and this is becoming the industry standard tool for UI testing of browser based apps. 2.0 is a big change from 1.x and is the merger of 2 projects: Selenium and WebDriver.

- Python 3.1 - I have already been writing some code in Py3k, and to be honest it's barely any different than 2.x. It feels right at home, with some minor syntactical changes and reorganization of some standard library modules.

- Better copy writing - I produce lots of code in the form of snippets or larger projects. Most of these often get lost because I don't have the words to go along with them. If I was a better writer and could express myself better, I would be able to post more often and get my ideas out there.

December 17, 2009

Python - PyQt4 - Hello World

basic "Hello World" GUI application using Python and Qt (PyQt4)


#!/usr/bin/env python
 
import sys
from PyQt4 import Qt

app = Qt.QApplication(sys.argv)
lbl = Qt.QLabel('Hello World')
lbl.show()
app.exec_()

December 16, 2009

Python - Send Email With smtplib

This is mostly just for my own reference


sending an email using Python's smtplib:


import smtplib
from email.MIMEText import MIMEText

smtp_server = 'exchangeserver.foo.com'
recipients = ['corey@foo.com',]
sender = 'corey@foo.com'
subject = 'subject goes here'
msg_text = 'message body goes here'


msg = MIMEText(msg_text)
msg['Subject'] = subject
s = smtplib.SMTP()
s.connect(smtp_server)
s.sendmail(sender, recipients, msg.as_string())
s.close()

Book Sample: Matplotlib for Python Developers - Plotting Data


Today's post is a sample from the new book: "Matplotlib for Python Developers" by Sandro Tosi. I will review the book in an upcoming post.

-Corey



Disclosure: Packt Publishing sent me a free copy of this book to review.




Matplotlib for Python Developers


http://www.packtpub.com/matplotlib-python-development/book





Matplotlib for Python Developers


In this two-part article, by Sandro Tosi you will see several examples of Matplotlib usage in real-world situations, showing some of the common cases where we can use Matplotlib to draw a plot of some values.

There is a common workflow for this kind of job:

  1. Identify the best data source for the information we want to plot
  2. Extract the data of interest and elaborate it for plotting
  3. Plot the data

Usually, the hardest part is to extract and prepare the data for plotting. Due to this, we are going to show several examples of the first two steps.


The examples are:

  • Plotting data from a database
  • Plotting data from a web page
  • Plotting the data extracted by parsing an Apache log file
  • Plotting the data read from a comma-separated values (CSV) file
  • Plotting extrapolated data using curve fitting
  • Third-party tools using Matplotlib (NetworkX and mpmath)

Let's begin

Plotting data from a database

Databases often tend to collect much more information than we can simply extract and watch in a tabular format (let's call it the "Excel sheet" report style).

Databases not only use efficient techniques to store and retrieve data, but they are also very good at aggregating it.

One suggestion we can give is to let the database do the work. For example, if we need to sum up a column, let's make the database sum the data, and not sum it up in the code. In this way, the whole process is much more efficient because:

  • There is a smaller memory footprint for the Python code, since only the aggregate value is returned, not the whole result set to generate it
  • The database has to read all the rows in any case. However, if it's smart enough, then it can sum values up as they are read
  • The database can efficiently perform such an operation on more than one column at a time

The data source we're going to query is from an open source project: the Debian distribution. Debian has an interesting project called UDD , Ultimate Debian Database, which is a relational database where a lot of information (either historical or actual) about the distribution is collected and can be analyzed.

On the project website http://udd.debian.org/, we can fi nd a full dump of the database (quite big, honestly) that can be downloaded and imported into a local PostgreSQL instance (refer to http://wiki.debian.org/UltimateDebianDatabase/CreateLocalReplica for import instructions

Now that we have a local replica of UDD, we can start querying it:

# module to access PostgreSQL databases
import psycopg2
# matplotlib pyplot module
import matplotlib.pyplot as plt

Since UDD is stored in a PostgreSQL database, we need psycopg2 to access it. psycopg2 is a third-party module available at http://initd.org/projects/psycopg

# connect to UDD database
conn = psycopg2.connect(database="udd")
# prepare a cursor
cur = conn.cursor()

We will now connect to the database server to access the udd database instance, and then open a cursor on the connection just created.

# this is the query we'll be making
query = """
select to_char(date AT TIME ZONE 'UTC', 'HH24'), count(*)
from upload_history
where to_char(date, 'YYYY') = '2008'
group by 1
order by 1"""


We have prepared the select statement to be executed on UDD. What we wish to do here is extract the number of packages uploaded to the Debian archive (per hour) in the whole year of 2008.

  • date AT TIME ZONE 'UTC': As date field is of the type timestamp with time zone, it also contains time zone information, while we want something independent from the local time. This is the way to get a date in UTC time zone.
  • group by 1: This is what we have encouraged earlier, that is, let the database do the work. We let the query return the already aggregated data, instead of coding it into the program.
# execute the query
cur.execute(query)
# retrieve the whole result set
data = cur.fetchall()

We execute the query and fetch the whole result set from it.

# close cursor and connection
cur.close()
conn.close()

Remember to always close the resources that we've acquired in order to avoid memory or resource leakage and reduce the load on the server (removing connections that aren't needed anymore).

# unpack data in hours (first column) and
# uploads (second column)
hours, uploads = zip(*data)

The query result is a list of tuples, (in this case, hour and number of uploads), but we need two separate lists—one for the hours and another with the corresponding number of uploads. zip() solves this with *data, we unpack the list, returning the sublists as separate arguments to zip(), which in return, aggregates the elements in the same position in the parameters into separated lists. Consider the following example:

In [1]: zip(['a1', 'a2'], ['b1', 'b2'])
Out[1]: [('a1', 'b1'), ('a2', 'b2')]

To complete the code:

# graph code
plt.plot(hours, uploads)
# the the x limits to the 'hours' limit
plt.xlim(0, 23)
# set the X ticks every 2 hours
plt.xticks(range(0, 23, 2))
# draw a grid
plt.grid()
# set title, X/Y labels
plt.title("Debian packages uploads per hour in 2008")
plt.xlabel("Hour (in UTC)")
plt.ylabel("No. of uploads")

The previous code snippet is the standard plotting code, which results in the following screenshot:

From this graph we can see that in 2008, the main part of Debian packages uploads came from European contributors. In fact, uploads were made mainly in the evening hours (European time), after the working days are over (as we can expect from a voluntary project).


Plotting data from the Web

Often, the information we need is not distributed in an easy-to-use format such as XML or a database export but for example only on web sites.

More and more often we find interesting data on a web page, and in that case we have to parse it to extract that information: this is called web scraping .

In this example, we will parse a Wikipedia article to extracts some data to plot. The article is at http://it.wikipedia.org/wiki/Demografia_d'Italia and contains lots of information about Italian demography (it's in Italian because the English version lacks a lot of data); in particular, we are interested in the population evolution over the years.

Probably the best known Python module for web scraping is BeautifulSoup ( http://www.crummy.com/software/BeautifulSoup/). It's a really nice library that gets the job done quickly, but there are situations (in particular with JavaScript embedded in the web page, such as for Wikipedia) that prevent it from working.

As an alternative, we find lxml quite productive (http://codespeak.net/lxml/). It's a library mainly used to work with XML (as the name suggests), but it can also be used with HTML (given their quite similar structures), and it is powerful and easy–to-use.

Let's dig into the code now:

# to get the web pages
import urllib2
# lxml submodule for html parsing
from lxml.html import parse
# regular expression module
import re
# Matplotlib module
import matplotlib.pyplot as plt

Along with the Matplotlib module, we need the following modules:

  • urllib2: This is the module (from the standard library) that is used to access resources through URL (we will download the webpage with this).
  • lxml: This is the parsing library.
  • re: Regular expressions are needed to parse the returned data to extract the information we need. re is a module from the standard library, so we don't need to install a third-party module to use it.
# general urllib2 config
user_agent = 'Mozilla/5.0 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent' : user_agent }
url = "http://it.wikipedia.org/wiki/Demografia_d'Italia"

Here, we prepare some configuration for urllib2, in particular, the user_agent header is used to access Wikipedia and the URL of the page.

# prepare the request and open the url
req = urllib2.Request(url, headers=headers)
response = urllib2.urlopen(req)

Then we make a request for the URL and get the HTML back.

# we parse the webpage, getroot() return the document root
doc = parse(response).getroot()

We parse the HTML using the parse() function of lxml.html and then we get the root element. XML can be seen as a tree, with a root element (the node at the top of the tree from where every other node descends), and a hierarchical structure of elements.

# find the data table, using css elements
table = doc.cssselect('table.wikitable')[0]

We leverage the structure of HTML accessing the first element of type table of class wikitable because that's the table we're interested in.

# prepare data structures, will contain actual data
years = []
people = []

Preparing the lists that will contain the parsed data.

# iterate over the rows of the table, except first and last ones
for row in table.cssselect('tr')[1:-1]:

We can start parsing the table. Since there is a header and a footer in the table, we skip the first and the last line from the lines (selected by the tr tag) to loop over.

# get the row cell (we will use only the first two)
data = row.cssselect('td')

We get the element with the td tag that stands for table data: those are the cells in an HTML table.

# the first cell is the year
tmp_years = data[0].text_content()
# cleanup for cases like 'YYYY[N]' (date + footnote link)
tmp_years = re.sub('[.]', '', tmp_years)

We take the first cell that contains the year, but we need to remove the additional characters (used by Wikipedia to link to footnotes).

# the second cell is the population count
tmp_people = data[1].text_content()
# cleanup from '.', used as separator
tmp_people = tmp_people.replace('.', '')

We also take the second cell that contains the population for a given year. It's quite common in Italy to separate thousands in number with a '.' character: we have to remove them to have an appropriate value.

# append current data to data lists, converting to integers
years.append(int(tmp_years))
people.append(int(tmp_people))

We append the parsed values to the data lists, explicitly converting them to integer values.

# plot data
plt.plot(years,people)
# ticks every 10 years
plt.xticks(range(min(years), max(years), 10))
plt.grid()
# add a note for 2001 Census
plt.annotate("2001 Census", xy=(2001, people[years.index(2001)]),
xytext=(1986, 54.5*10**6),
arrowprops=dict(arrowstyle='fancy'))

Running the example results in the following screenshot that clearly shows why the annotation is needed:

In 2001, we had a national census in Italy, and that's the reason for the drop in that year: the values released from the National Institute for Statistics (and reported in the Wikipedia article) are just an estimation of the population. However, with a census, we have a precise count of the people living in Italy.




Plotting data by parsing an Apache log file

Plotting data from a log file can be seen as the art of extracting information from it.

Every service has a log format different from the others. There are some exceptions of similar or same format (for example, for services that come from the same development teams) but then they may be customized and we're back at the beginning.

The main differences in log files are:

  • Fields orders: Some have time information at the beginning, others in the middle of the line, and so on
  • Fields types: We can find several different data types such as integers, strings, and so on
  • Fields meanings: For example, log levels can have very different meanings

From all the data contained in the log file, we need to extract the information we are interested in from the surrounding data that we don't need (and hence we skip).

In our example, we're going to analyze the log file of one of the most common services: Apache. In particular, we will parse the access.log file to extract the total number of hits and amount of data transferred per day.

Apache is highly configurable, and so is the log format. Our Apache configuration, contained in the httpd.conf file, has this log format:

"%h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-Agent}i""

In this way, we extract the day of the request along with the response size from every line.

# prepare dictionaries to contain the data
day_hits = {}
day_txn = {}

These dictionaries will store the parsed data.

# we open the file
with open('<location of the Apache>/access.log') as f:
# and for every line in it
for line in f:

we open the file (we have to select the proper location of access.log as it differs between operating systems and/or installation), and for every line in it

# we pass the line to regular expression
m = apa_line.match(line)
# and we get the 2 values matched back
day, call_size = m.groups()

We parse the line and take the resulting values.

# if the current day is already present
if day in day_hits:
# we add the call and the size of the request
day_hits[day] += 1
day_txn[day] += int(call_size)

else if the current day is already present in the dictionaries, then add the hit and the size to the respective dictionaries.

else:
# else we initialize the dictionaries
day_hits[day] = 1
day_txn[day] = int(call_size)

If the current day is not present, then we need to initialize the dictionaries for a new day.

# prepare a list of the keys (days)
keys = sorted(day_hits.keys())

We prepare a sorted list of dictionary keys, since we need to access it several times.

# prepare a figure and an Axes in it
fig = plt.figure()
ax1 = fig.add_subplot(111)

We prepare a Figure and an Axes in it.

# bar width
width = .4

We define the bars width.

# for each key (day) and it's position
for i, k in enumerate(keys):

Then we enumerate the item in keys so that we have a progressive number associated with each key.

# we plot a bar
ax1.bar(i - width/2, day_hits[k], width=width, color='y')

Now we can plot a bar at position i (but shifted by width/2 to center the tick) with its height proportional to the number of hits, and width set to width. We set the bar color to yellow.

# for each label for the X ticks
for label in ax1.get_xticklabels():
# we hide it
label.set_visible(False)

We hide the labels on the X-axis to avoid its superimposition with the other Axes labels (for transfer size).

# add a label to the Y axis (for the first plot)
ax1.set_ylabel('Total hits')

We set a label for the Y-axis.

# create another Axes instance, twin of the previous one
ax2 = ax1.twinx()

We now create a second Axes, sharing the X-axis with the previous one.

# plot the total requests size
ax2.plot([day_txn[k] for k in keys], 'k', linewidth=2)

We plot a line for the transferred size, using the black color and with a bigger width.

# set the Y axis to start from 0
ax2.set_ylim(ymin=0)

We let the Y-axis start from 0 so that it can be congruent with the other Y-axis.

# set the X ticks for each element of keys (days)
ax2.set_xticks(range(len(keys)))
# set the label for them to keys, rotating and align to the right
ax2.set_xticklabels(keys, rotation=25, ha='right')

We set the ticks for the X-axis (that will be shared between this and the bar plot) and the labels, rotating them by 25 degrees and aligning them to the right to better fit the plot.

# set the formatter for Y ticks labels
ax2.yaxis.set_major_formatter(FuncFormatter(megabytes))
# add a label to Y axis (for the second plot)
ax2.set_ylabel('Total transferred data (in Mb)')

Then we set the formatter for the Y-axis so that the labels are shown in megabytes (instead of bytes).

# add a title to the whole plot
plt.title('Apache hits and transferred data by day

Finally, we set the plot title.

On executing the preceding code snippet, the following screenshot is displayed:

The preceding screenshot tells us that it is not a very busy server, but it still shows what's going on.


Continue Reading Plotting data using Matplotlib: Part 2

December 14, 2009

http_rrd_profiler - HTTP Request Profiler with RRD Storage and Graphing

http_rrd_profiler is an HTTP testing/monitoring tool written in Python. It uses Python's httplib to send a GET request to a web resource and adds performance timing instrumentation between each step of the request/response. RRDTool is used for data storage and graph generation. RRDtool is an open source, high performance logging and graphing system for time-series data.

The RRD graphs it produces:



The code is located in my repository. You will need the following files:
http_rrd_profiler.py
rrd.py


Usage:

http_rrd_profiler.py <host> <interval in secs>

You invoke it from the command like like this:

> python http_rrd_profiler.py www.goldb.org 5

This will send an HTTP GET to http://www.goldb.org/ every 5 seconds.

Timing results are printed to the console and stored in a Round Robin Database (RRD). During each interval, a graph (png image) is generated.

Console output:

----------------
216 request sent
337 response received
456 content transferred (8044 bytes)
----------------
208 request sent
332 response received
453 content transferred (8044 bytes)
----------------
212 request sent
332 response received
452 content transferred (8044 bytes)
----------------
210 request sent
329 response received
448 content transferred (8044 bytes)
----------------
214 request sent
333 response received
452 content transferred (8044 bytes)

* tested on Ubuntu Karmic and WinXP with Python 2.6

December 9, 2009

Python 3 - tkinter (Tk Widgets) - BusyBar Busy Indicator

Here is another example GUI application using Python 3.1.

For this example, I used BusyBar.py, a module for creating a "busy indicator" (like the knight-rider light). To do this, I first had to port BusyBar to Python 3.1 from 2.x. The new code for this module can be found here:

BusyBar.py

It renders on Ubuntu (with Gnome) like this:

Code:

#!/usr/bin/env python
# Python 3


import tkinter
from tkinter import ttk
import BusyBar


class Application:
    def __init__(self, root):
        self.root = root
        self.root.title('BusyBar Demo')
        ttk.Frame(self.root, width=300, height=100).pack()
        
        bb = BusyBar.BusyBar(self.root, width=200)
        bb.place(x=40, y=20)
        bb.on()
    

if __name__ == '__main__':
    root = tkinter.Tk()
    Application(root)
    root.mainloop()

December 8, 2009

Python 3 - tkinter ttk (Tk Themed Widgets) - Blocking Command Example

Here is another example GUI application with ttk in Python 3.1.

This example shows usage of the Button, Text, and Scrollbar widgets, along with some basic layout management using grid().

To use it, enter a URL and click the button. It will send an HTTP GET request to the URL and insert the response headers into the Text widget. You will notice that the GUI is blocked (frozen) while the URL is fetched. This is because all of the work is done in the main event thread. I will show how to use threads for non-blocking events in a future post.

It renders on Ubuntu (with Gnome) like this:

Code:

#!/usr/bin/env python
# Python 3


import tkinter
from tkinter import ttk
import urllib.request



class Application:
    def __init__(self, root):
        self.root = root
        self.root.title('Blocking Command Demo')
        
        self.init_widgets()
            
            
    def init_widgets(self):
        self.btn = ttk.Button(self.root, command=self.get_url, text='Get Url', width=8)
        self.btn.grid(column=0, row=0, sticky='w')
        
        self.entry = ttk.Entry(self.root, width=60)
        self.entry.grid(column=0, row=0, sticky='e')
        
        self.txt = tkinter.Text(self.root, width=80, height=20)
        self.txt.grid(column=0, row=1, sticky='nwes')
        sb = ttk.Scrollbar(command=self.txt.yview, orient='vertical')
        sb.grid(column=1, row=1, sticky='ns')
        self.txt['yscrollcommand'] = sb.set
        

    def get_url(self):
        url = self.entry.get()
        headers = urllib.request.urlopen(url).info()
        self.txt.insert(tkinter.INSERT, headers)
    
    

if __name__ == '__main__':
    root = tkinter.Tk()
    Application(root)
    root.mainloop()

December 7, 2009

Python 3 - tkinter ttk (Tk Themed Widgets) - Button/Text Example

Here is some boilerplate code for setting up a basic GUI application with ttk in Python 3.1.

This example shows usage of the Frame, Button, and Text widgets. When you click the button, "Hello World" is inserted into the Text widget.

#!/usr/bin/env python
# Python 3


import tkinter
from tkinter import ttk


class Application:
    def __init__(self, root):
        self.root = root
        self.root.title('Button Demo')
        ttk.Frame(self.root, width=250, height=100).pack()
        
        self.init_widgets()
            
    def init_widgets(self):
        ttk.Button(self.root, command=self.insert_txt, text='Click Me', width='10').place(x=10, y=10)
        self.txt = tkinter.Text(self.root, width='15', height='2')
        self.txt.place(x=10, y=50)
        
    def insert_txt(self):
        self.txt.insert(tkinter.INSERT, 'Hello World\n')


if __name__ == '__main__':
    root = tkinter.Tk()
    Application(root)
    root.mainloop()

It renders on Ubuntu (with Gnome) like this:

December 6, 2009

Python 3 - tkinter ttk (Tk Themed Widgets) - GUI Hello World

Here is some boilerplate code for setting up a basic GUI application with ttk in Python 3.1:

#!/usr/bin/env python
# Python 3


import tkinter
from tkinter import ttk

class Application:
    def __init__(self, root):
        self.root = root
        self.root.title('Hello World')
        ttk.Frame(self.root, width=200, height=100).pack()
        ttk.Label(self.root, text='Hello World').place(x=10, y=10)

if __name__ == '__main__':
    root = tkinter.Tk()
    Application(root)
    root.mainloop()

This example shows usage of the Frame and Label widgets, and some basic layout management using pack() and place(). It renders on Ubuntu (with Gnome) like this:

December 3, 2009

Python - Accurate Cross Platform Timers - time.time() vs. time.clock()

In Python, you can add simple timers around any action using the clock() or time() functions from the time module. time.clock() gives the best timer accuracy on Windows, while the time.time() function gives the best accuracy on Unix/Linux. You should use whichever one is more appropriate for your system.

Inside the timeit module's source code (in Python's standard library), you can see an example of this. The timer type is specified by the platform:

if sys.platform == "win32":
    # On Windows, the best timer is time.clock()
    default_timer = time.clock
else:
    # On most other platforms, the best timer is time.time()
    default_timer = time.time

This seems like a nice way to create a timer that is accurate on either platform.

Here is an example:

import sys
import time

# choose timer to use
if sys.platform.startswith('win'):
    default_timer = time.clock
else:
    default_timer = time.time

start = default_timer()
# do something
finish = default_timer()
elapsed = (finish - start)