Category Archives: Python

Tornado Internals

Well, what a piece of technology and every time you read more about it the better you know about it and can better appreciate it. Yes, I’m talking about the very own Tornado Web Server. My ‘attempt’ here is to tell you about the workflow of Tornado internals…

Tornado is a non blocking web server as we all know and understand. But how does it get this non blocking thing going its way?

Let’s first understand that Tornado is event driven web server. But what does event driven programming mean? Well it means your application thrives an event loop (single threaded) that keeps polling for events, identifying them as wanted and then handling them. Tornado also works on similar principles.

Tornado web server runs an ioloop, a single threaded main event loop and is at the crust of async nature of Tornado. ioloop.add_handler(fd, handlers, events)  maintains a list of file descriptors, events to be watched and the corresponding handlers for each of these fd.

ioloop is a user space construct but who listens to events on the fd’s. That should be a kernel library and Tornado uses epoll, kqueue(BSD) – libraries that provide event notifications in a non-blocking way. epoll has three main functions:

  • epoll_create – creates an epoll object
  • epoll_ctl – controls the file descriptors and events to be watched
  • epoll_watch – waits until registered event occurs or wait till timeout

epoll thus watches file descriptors (sockets) and returns needed (READ, WRITE & ERROR) events.

As described above, Tornado’s ioloop consumes these events (for the file descriptors) and run associated handlers for these events.

tornado.IOStream works as an abstraction layer on top of sockets. It provides three methods:

  • read_until() – reads the socket until it finds empty line delimiter that suggests completion of HTTP headers
  • read_bytes() – reads N number of bytes from socket
  • write() – write a buffer to socket

All of these methods can call a callback when their job is done.

tornado.httpserver is a non blocking http server that accepts connections from clients on a defined port by adding the sockets to the ioloop.

  • http_server = httpserver.HTTPServer(handle_request)
  • http_server.listen(8888)
  • ioloop.IOLoop.instance().start()

handler argument as mentioned in ioloop is a callback accepts the new connection, creates a IOStream, and creates a HTTPConnection object of httpserver class that is now responsible handling all client requests.

Selenium with Python bindings

After a lot of posts on Tornado web server and understanding BDD, lets get to testing our website. What better than to you selenium. Lets go through the setup and create our first test..

Prerequisites

1. Python bindings for Selenium – Go to, selenium site and download the package

Install as:

  • tar xvf selenium-2.25.0.tar.gz
  • cd selenium-2.25.0
  • sudo python setup.py install

2. Java Server – Download the server from here

Run as:

  • java -jar selenium-server-standalone-2.25.0.jar

Here we discuss the usage of Selenium 2.0 Web Driver, with/without selenium server. Below are the examples of each of these:

Just a bit of history first… Web Driver aims to improve Selenium 1.0 Remote Control. The distinguishing factors being:

  • Object Oriented APIs
  • More features
  • Web Driver uses the APIs exported by the browser for automated testing while Selenium Remote Control injects Javascript to run the test

Web Driver without selenium server

from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
import time

browser = webdriver.Firefox() # Get local session of firefox
browser.get("http://www.yahoo.com") # Load page
assert "Yahoo!" in browser.title

Web Driver with selenium server – WebDriver Remote

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
driver = webdriver.Remote(
   command_executor='http://127.0.0.1:4444/wd/hub',
   desired_capabilities=DesiredCapabilities.FIREFOX)

driver.get("http://www.python.org")
driver.close()

BDD in Python with lettuce

Behavior Driven Development, also known as BDD, is a concept developed by Dan North and is based on a popular and well adopted TDD. As in Dan’s words -

‘BDD is a second-generation, outside–in, pull-based, multiple-stakeholder, multiple-scale, high-automation, agile methodology. It describes a cycle of interactions with well-defined outputs, resulting in the delivery of working, tested software that matters.’

BDD provides a framework where QA, Business Analysts and other stake-holders communicate and collaborate on sotware development. While TDD emphasized on developing tests for unit piece of code. BDD insists on developing tests for business scenarios or use cases or behavioral specification of software being developed. According to Dan, BDD tests should be written as user stories ‘As a [role] I want [feature] so that [benefit]’ and Acceptance criteria should be defined as ‘Given [initial context], when [event occurs], then [ensure some outcomes].

lettuce is typically used in Python to implement BDD. This blog covers the installation of lettuce on Ubuntu and its application with an example of fibonacci function

Installation

buntu@ubuntu:~$ sudo pip install lettuce
[sudo] password for buntu:
Downloading/unpacking lettuce
Downloading lettuce-0.2.9.tar.gz (40Kb): 40Kb downloaded
Running setup.py egg_info for package lettuce
Downloading/unpacking sure (from lettuce)
Downloading sure-1.0.6.tar.gz
Running setup.py egg_info for package sure
Downloading/unpacking fuzzywuzzy (from lettuce)
Downloading fuzzywuzzy-0.1.tar.gz
Running setup.py egg_info for package fuzzywuzzy
Installing collected packages: fuzzywuzzy, lettuce, sure
Running setup.py install for lettuce
Installing lettuce script to /usr/local/bin
Running setup.py install for sure
Running setup.py install for fuzzywuzzy
Successfully installed lettuce

Setup

Let’s first create a directory structure that looks like this

buntu@ubuntu:~$ tree lettucetests/
lettucetests/
|– features
|   |– fib.feature
|   |– test.py
`– test.feature

1 directory, 3 files

Define Features

Write Tests

Django setup on Ubuntu

 

Setting up a django website calls for (though not always):

  • django installation
  • configuring Apache
  • mod_wsgi
  • others like database servers, static file server etc

Now if you are developing a small scale website, you may not want to go the Apache, mod_wsgi way.. Django helps here by providing a development web server, so that you can get your website up and running rapidly.

This blog talks about setting django website on Ubuntu 10.04:

Step1: Get python-pip

buntu@ubuntu:~$ sudo apt-get install python-pip 
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following packages were automatically installed and are no longer required:
  libtext-glob-perl libcompress-bzip2-perl libparams-util-perl libfile-chmod-perl libdata-compare-perl libfile-pushd-perl libfile-which-perl
  libcpan-inject-perl libfile-find-rule-perl libcpan-checksums-perl libnumber-compare-perl
Use 'apt-get autoremove' to remove them.
The following extra packages will be installed:
  python-setuptools
The following NEW packages will be installed:
  python-pip python-setuptools
0 upgraded, 2 newly installed, 0 to remove and 186 not upgraded.
Need to get 262kB of archives.
After this operation, 1,192kB of additional disk space will be used.
Do you want to continue [Y/n]? 
Get:1 http://us.archive.ubuntu.com/ubuntu/ lucid/main python-setuptools 0.6.10-4ubuntu1 [213kB]
Get:2 http://us.archive.ubuntu.com/ubuntu/ lucid-updates/universe python-pip 0.3.1-1ubuntu2.1 [49.8kB]
Fetched 262kB in 5s (50.6kB/s)      
Selecting previously deselected package python-setuptools.
(Reading database ... 124327 files and directories currently installed.)
Unpacking python-setuptools (from .../python-setuptools_0.6.10-4ubuntu1_all.deb) ...
Selecting previously deselected package python-pip.
Unpacking python-pip (from .../python-pip_0.3.1-1ubuntu2.1_all.deb) ...
Processing triggers for man-db ...
Setting up python-setuptools (0.6.10-4ubuntu1) ...

Processing triggers for python-central ...
Setting up python-pip (0.3.1-1ubuntu2.1) ...

Step 2: Install django

buntu@ubuntu:~$ sudo pip install django
Downloading/unpacking django
  Downloading Django-1.4.1.tar.gz (7.7Mb): 7.7Mb downloaded
  Running setup.py egg_info for package django
Installing collected packages: django
  Running setup.py install for django
    changing mode of build/scripts-2.6/django-admin.py from 644 to 755
    changing mode of /usr/local/bin/django-admin.py to 755
Successfully installed django

Step 3: Check for django installation

buntu@ubuntu:~$ python -c "import django; print(django.get_version())"
1.4.1

Step 4: Create a project site

buntu@ubuntu:~$ django-admin.py startproject mysite
buntu@ubuntu:~$ tree mysite
mysite
|-- manage.py
`-- mysite
    |-- __init__.py
    |-- settings.py
    |-- urls.py
    `-- wsgi.py

1 directory, 5 files

Step 5: Start django development server

buntu@ubuntu:~$ cd mysite
buntu@ubuntu:~/mysite$ python manage.py runserver
Validating models...

0 errors found
Django version 1.4.1, using settings 'mysite.settings'
Development server is running at http://127.0.0.1:8000/
Quit the server with CONTROL-C.

Step 6: Browse to http://127.0.0.1:8000/ home page

[09/Oct/2012 01:42:52] "GET / HTTP/1.1" 200 1957

 

 

Abstraction in Search

Another great event I attended this year.. PyCon India 2012, was better organized, had better talks, more audiences, job fair and more fun than ever before.. :) Not to forget the evening dinner for speakers ;) I loved every bit of it…. talking to experts, talking to Python enthusiasts, answering their Qs and wondering why I was not like the younger folks when I was younger? :P

Vishal and I delivered a talk on ‘Rapid development of website search in Python’.. We spoke about,

  • Why Search is imperative in web sites
  • How is the Schema defined and Analyzers chosen
  • How indexing, and searching works with appropriate flowcharts
  • How search can be easily integrated with your web application
  • What are the design and development considerations for implementing it

We also shared our observations on facets of a good search solution, It should be:

  • Integral to the website development
  • Decoupled from the web framework used for website  development
  • Adaptable (scale and requirements of website)
  • And most importantly it should be rapidly developed and deployed

This talk provoked a new design concept of Abstraction in Search (never tried before as we know of) and contributed to the Python community at large…

Preface

We all understand no same solution fit for two different problems. The same phenomena applies for search engines as well.. A search engine may have high indexing, committing capabilities but slower searching algorithm when compared to an equivalently feature rich engine. Hence a search engine is deemed to be the best solution for one website but maybe an utter unfit for other…

Problem

Now developing search with one particular algorithm or a particular engine, and plugging it into any website that you develop is no less than digging your own grave! Why the h**l would you assume that the one search solution that you’ve develop for your large scale website is suitable for other small or medium scaled or sized websites?

Solution

We propose development of customized search engines that are adaptable to the small/medium and large scaled & sized websites. Once you have the search engine implementations, develop an Abstraction Layer over these engines. Abstraction Layer would ensure:

  • Freedom to choose an engine based on applicability and adaptability to the website
  • Develop once and reuse as many times
  • Call to a search engine can be decided at run time

The abstraction layer could be implemented in a well know facade pattern way!

Design

We propose a simple to understand SVC model (based on MVC model). SVC stands for Search View Controller. In SVC, the Controller, calls search.py with appropriate search engine to find the search results for user input keywords. search.py is an abstraction developed on the search engines implementations that can adapt to small, mid and large scaled & sized websites. The decision to call a search solution from search.py abstraction is dependent on the website developer (as s/he understands the requirements of website and the search solution for it). Selected search engine then generates the search results for input query terms and passes onto the controller via search.py. Controller then applies the search results to the View (templates) and renders the results to the user..

Prototype Implementation

We’ve developed a prototype for the idea discussed above (termed as fsMgr). fsMgr assumes that the webpages that need to be search are already available (or scrapped) in a tree structure.

search.py of fsMgr abstracts Whoosh and pyLucene search engines. By doing this, we demonstrate, how either of these engines can be leveraged for website search based on the website requirements.

We use Tornado Web Server of Python as Controller that provides us request handling capabilities so that we can export simple search and advanced search capabilities (such as highlighted search, didyoumean spell-checker and morelikethis document searcher) to the users.

Tornado’s template capabilities are used as Views in this prototype.

Code

Source code of this prototype implementation at fsMgr

SVC Architecture

Tornado – Whoosh – Highlighted Search

So common use case of a Search Engine.. Don’t you observe Google highlights the keywords the user searches for? That is what we achieve in this example below

Search results for highlighted search

 

 

Tornado – Whoosh – DidYouMean

Have tried to search a word in google and you got a response from google saying, ‘Did You Mean’ when the word you have typed is spelled incorrectly? Something like this? And you want to implement this feature in your engine?

Well, Whoosh search engine is capable of performing didyoumean operation on the queires presented by the user. Didyoumean essentially presents suggestions to the users on mis-typed or mis-spelled queries based on the key terms present in the index. Whoosh currently  works more of typo checker or corrector as it doesn’t have the capabilities of handling phonetics well enough…

For correction Whoosh looks up for correct words in:

  • Created Index
  • File with words list

With Whoosh, developers can define Schema fields that would be used for spell-checker. For instance, if you were to perform spell-check on contents, simply define Schema with the field ‘content’ as ‘spelling=True.’

Here’s an example of Whoosh’s didyoumean capability with Tornado Web Server

Did You Mean input query form

Tornado Web Server handling spell-checker requests

 

In this example, if user searches for word ‘Torando’ he gets suggestion for Tornado and if he tries for ‘piethon’ he gets Python

Tornado – Whoosh – MoreLike and MoreLikeThis

Like other search engines, Whoosh too provides more_like() and more_like_this() methods to find similar documents in the index, Typically, morelikethis doesnt execute any special query to get the list of similar documents to the one specified, but in fact it searches all other documents in the index relative to the document content that is specified. Here’s a example of more_like() method of Whoosh integrated with Tornado

User enters the document path and submits it to the index which then presents the similar morelike documents. User form code here

In the code below:

  • document_number(path=path) gets the document number of the specified document path in the index
  • more_like(docnum, ‘content’) method then find documents *like* the specified document based on content
  • more_like_this(“content”, top=1) method searches the top 1 sub-hits

Tornado – Whoosh

 

Search functionality is often considered as “good to have” feature in website development, but search plays a crucial role of locating relevant information to the website visitors. Serach capabilities can be built into a website easily with modules such as lucene, Solr, ElasticSearch and Haystack among others.

This blog discusses about whoosh and how it can be integrated with Tornado web server.

Whoosh is a fast search engine developed by Matt Chaput that supports field based full text index search, storage, text analysis, posting formats and scoring algorithm. You can benefit from services like highlighted search, fuzzy search, document based search (more like this) and with spell checker (did you mean). Whoosh APIs are pythonic and are developed in pure python.. :)

Let’s take an example of blogger with the code snippet below

In the above code,

  • class Search provides searching capabilities with Whoosh
  • __init__ method accepts the indexdir (directory where serach index gets created) and searchstr  (string that needs to be searched)
  • searcher() method first defines a document schema (this is how a blog would look), and creates an index in indexdir based on the schema. It then creates a writer object that is used to add blogs and commit those. Finally search() method returns the search results with a max limit of 50 searches

  • On submitting the search word, GET request is sent to http://localhost:8888/search.
  • class Srch handles this request and in turn calls Search class that implements search functionality with Whoosh

When the user searches for tornado we get this output below which suggests that the word tornado was found in 1 document and it took .0004 secs to search for it

<1/1 Results for Term('content', u'tornado', boost=1.0) runtime=0.000482797622681>

 

Tornado – File Uploads

Quite often we’re in need of providing file upload mechanism on our website. Be it logs management or user profile management, support for file upload is a must. This blog describes how uploads be achieved with Tornado web server.

Example code:

In this code snippet;

  • When user browses to http://localhost:8888/, he is presented with file upload form (code below)
  • On browsing and selecting the appropriate file, the user clicks on upload
  • The file gets uploaded and the user gets a message with filename & the uploaded location

In the upload form, its important to note the usage of below tags for file uploads:

  • enctype=”multipart/form-data”
  • input type=”file

As a side note, if you print fileinfo variable, you would observe a dictionary with contents and meta-data of file being uploaded

fileinfo is {'body': 'This is a file upload test for Tornado!!\n', 'content_type': u'application/octet-stream', 'filename': u'fileuploadtest'}
Follow

Get every new post delivered to your Inbox.