Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Network Programming in Python

Tech May 10 2

HyperText Transfer Protocol - HTTP

The HyperText Transfer Protocol is described in the following document: RFC 2616. It is a long and complex 176-page document with a lot of detail. If you are interested, feel free to read it all. However, if you look around page 36 of RFC 2616, you will find the syntax for the GET request. To request a document from a web server, we make a connection to the www.py4inf.com server on port 80, and then send a line of the form:

GET http://www.py4inf.com/code/romeo.txt HTTP/1.0

The second parameter is the web page we are requesting, and then we also send a blank line. The web server will respond with some header information about the document and a blank line followed by the document content.

The following program makes a connection to a web server, follows the rules of the HTTP protocol to request a document, and displays what the server sends back:

import socket

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('www.py4inf.com', 80))
mysock.send('GET http://www.py4inf.com/code/romeo.txt HTTP/1.0\n\n')

while True:
    data = mysock.recv(512)
    if len(data) < 1:
        break
    print(data)

mysock.close()

Output:

HTTP/1.1 200 OK
Date: Sat, 12 Dec 2015 14:22:51 GMT
Server: Apache
Last-Modified: Fri, 04 Dec 2015 19:05:04 GMT
ETag: "e103c2f4-a7-526172f5b5d89"
Accept-Ranges: bytes
Content-Length: 167
Cache-Control: max-age=604800, public
Access-Control-Allow-Origin: *
Access-Control-Allow-Headers: origin, x-requested-with, content-type
Access-Control-Allow-Methods: GET
Connection: close
Content-Type: text/plain

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fai
r sun and kill the envious moon
Who is already sick and pale with grief

Description:

First, the program makes a connection to port 80 on the server www.py4inf.com. Since our program is playing the role of a "web browser", the HTTP protocol requires that we send the GET command followed by a blank line. Once we send that blank line, we write a loop that receives data in 512-character chunks from the socket and prints the data out until there is no more data to read (i.e., recv() returns an empty string).

The output starts with headers that the web server sends to describe the document. For example, the Content-Type header indicates that the document is a plain text document (text/plain). After the server sends us the headers, it adds a blank line to signal the end of the headers, and then sends the actual data of the file romeo.txt.

Retrieving an Image over HTTP

The following program accumulates the data in a string, trims off the headers, and then saves the image data to a file:

import socket
import time

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('www.py4inf.com', 80))
mysock.send('GET http://www.py4inf.com/cover.jpg HTTP/1.0 \n\n')

count = 0
picture = ""

while True:
    data = mysock.recv(5120)
    if len(data) < 1:
        break
    # time.sleep(0.25)
    count += len(data)
    print(len(data), count)
    picture += data

mysock.close()

# Look for the end of the header
pos = picture.find("\r\n\r\n")
print('Header length:', pos)
print(picture[:pos])

# Skip past the header and save the picture data
picture = picture[pos+4:]

with open("stuff.jpg", "wb") as fhand:
    fhand.write(picture)

Output:

5120 5120
5120 10240
5120 15360
1920 17280
5120 22400
5120 27520
4160 31680
5120 36800
5120 41920
4160 46080
2880 48960
5120 54080
5120 59200
5120 64320
5120 69440
863 70303
Header length 242
HTTP/1.1 200 OK
Date: Sat, 12 Dec 2015 15:32:56 GMT
Server: Apache
Last-Modified: Fri, 04 Dec 2015 19:05:04 GMT
ETag: "b294001f-111a9-526172f5b7cc9"
Accept-Ranges: bytes
Content-Length: 70057
Connection: close
Content-Type: image/jpeg

Desrciption:

You can see that for this URL, the Content-Type header indicates that the body of the document is an image (image/jpeg). As the program runs, you can see that we do not always get 5120 characters each time we call recv(). We get as many characters as have been transferred across the network to us by the web server at the moment we call recv(). In this example, we typically get 1460 or 2920 characters each time we request up to 5120 characters of data. Your results may vary depending on your network speed. Also note that on the last call to recv(), we get fewer bytes (e.g., 863), which is the end of the stream, and the next call returns an empty string, indicating that the server has closed its end of the socket and there is no more data forthcoming.

We can slow down our successive recv() calls by uncommenting the time.sleep(0.25). This way, we wait a quarter of a second after each call so that the server can "get ahead" of us and send more data before we call recv() again. With the delay, the program executes as follows:

1460 1460
5120 6580
5120 11700
...
5120 62900
5120 68020
2281 70301
Header length 240
HTTP/1.1 200 OK
Date: Sat, 02 Nov 2013 02:22:04 GMT
Server: Apache
Last-Modified: Sat, 02 Nov 2013 02:01:26 GMT
ETag: "19c141-111a9-4ea280f8354b8"
Accept-Ranges: bytes
Content-Length: 70057
Connection: close
Content-Type: image/jpeg

Retrievign Web Pages with urllib

With urllib, you can treat a web page much like a file. You simply indicate which web page you would like to retrieve, and urllib handles all of the HTTP protocol and header details.

import urllib

fhand = urllib.urlopen('http://www.py4inf.com/code/romeo.txt')
for line in fhand:
    print(line.strip())

Output:

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief

We only see the output of the contents of the file. The headers are still sent, but the urllib code consumes the headers and only returns the data to us.

Here is an example that retrieves the data for romeo.txt and computes the frequency of each word:

import urllib

fhand = urllib.urlopen('http://www.py4inf.com/code/romeo.txt')
word_count = {}

for line in fhand:
    words = line.split()
    for word in words:
        word_count[word] = word_count.get(word, 0) + 1

print(word_count)

Output:

{'and': 3, 'envious': 1, 'already': 1, 'fair': 1, 'is': 3, 'through': 1, 'pale': 1, 'yonder': 1, 'what': 1, 'sun': 2, 'Who': 1, 'But': 1, 'moon': 1, 'window': 1, 'sick': 1, 'east': 1, 'breaks': 1, 'grief': 1, 'with': 1, 'light': 1, 'It': 1, 'Arise': 1, 'kill': 1, 'the': 3, 'soft': 1, 'Juliet': 1}

Parsing HTML and Scraping the Web

The following program uses regular expressions to extract all links from an HTML page:

import urllib
import re

url = raw_input('Enter URL: ')
html = urllib.urlopen(url).read()
links = re.findall('href="(http://.+?)"', html)
for link in links:
    print(link)

Example Output:

Enter URL: http://www.dr-chuck.com/page1.htm
http://www.dr-chuck.com/page2.htm

Enter URL: http://www.py4inf.com/book.htm
http://amzn.to/1KkULF3
http://amzn.to/1KkULF3
http://amzn.to/1hLcoBy
http://amzn.to/1KkV42z
http://amzn.to/1fNOnbd
http://amzn.to/1N74xLt
http://do1.dr-chuck.com/py4inf/EN-us/book.pdf
http://do1.dr-chuck.com/py4inf/ES-es/book.pdf
http://www.xwmooc.net/python/
http://fanwscu.gitbooks.io/py4inf-zh-cn/
http://itunes.apple.com/us/book/python-for-informatics/id554638579?mt=13
http://www-personal.umich.edu/~csev/books/py4inf/ibooks//python_for_informatics.ibooks
http://www.py4inf.com/code
http://www.greenteapress.com/thinkpython/thinkCSpy/
http://allendowney.com/

Reading Binary Files Using urllib

Here is a simple way to download an image file:

import urllib

img = urllib.urlopen('http://www.py4inf.com/cover.jpg').read()
with open('cover.jpg', 'w') as fhand:
    fhand.write(img)

However, if this is a large audio or video file, this program may crash or at least run extremely slowly when your computer runs out of memory. To avoid running out of memory, we retrieve the data in chunks (or buffers) and then write each chunk to disk before retrieving the next chunk. This way, the program can read any size file without using up all of the memory in your computer.

import urllib

img = urllib.urlopen('http://www.py4inf.com/cover.jpg')
with open('cover.jpg', 'w') as fhand:
    size = 0
    while True:
        info = img.read(100000)
        if len(info) < 1:
            break
        size += len(info)
        fhand.write(info)
        print(size, 'characters copied.')

Using Web Services

There are two common formats that we use when exchanging data across the web. The "eXtensible Markup Language" (XML) has been in use for a very long time and is best suited for exchanging document-style data. When programs just want to exchange dictionaries, lists, or other internal information with each other, they use JavaScript Object Notation (JSON). We will look at both formats.

XML

XML looks very similar to HTML, but XML is more structured. Here is a sample XML document:

<person>
    <name>Chuck</name>
    <phone type="intl">
      +1 734 303 4456
    </phone>
    <email hide="yes"/>
</person>

Often it is helpful to think of an XML document as a tree structure, where there is a top-level tag (person), and other tags such as phone are drawn as children of their parent nodes.

Here is a simple application that parses some XML and extracts data elements:

import xml.etree.ElementTree as ET

data = '''
<person>
    <name>Chuck</name>
    <phone type="intl">
        +1 734 303 4456
    </phone>
    <email hide = "yes"/>
</person>'''

tree = ET.fromstring(data)
print('Name:', tree.find('name').text)
print('Attr:', tree.find('email').get('hide'))

Output:

Name: Chuck
Attr: yes

Calling fromstring converts the string representation of XML into a tree of XML nodes. Once the XML is in a tree, we have a series of methods we can call to extract portions of data from the XML.

The find function searches the XML tree and retrieves a node that matches the specified tag. Each node can have text, attributes (like hide), and child nodes. Each node can be the top of a tree of nodes.

Using an XML parser such as ElementTree has the advantage that although the XML in this example is simple, there are many rules regarding valid XML, and using ElementTree allows us to extract data without worrying about those rules.

Often XML has multiple nodes, and we need to write a loop to process all of them:

import xml.etree.ElementTree as ET

data = '''
<stuff>
    <users>
        <user x="2">
            <id>001</id>
            <name>Chuck</name>
        </user>
        <user x="7">
            <id>009</id>
            <name>Brent</name>
        </user>
    </users>
</stuff>'''

stuff = ET.fromstring(data)
lst = stuff.findall('users/user')
print('User count:', len(lst))

for item in lst:
    print('Name:', item.find('name').text)
    print('Id:', item.find('id').text)
    print('Attribute:', item.get('x'))

Output:

User count: 2
Name: Chuck
Id: 001
Attribute: 2
Name: Brent
Id: 009
Attribute: 7

The findall method retrieves a Python list of subtrees that represent the user structures in the XML tree. Then we can write a for loop that looks at each user node and prints the name, id, and x attribute.

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.