News

Welcome to End Point’s blog

Ongoing observations by End Point people

Python string formatting and UTF-8 problems workaround

Recently I worked on a program which required me to filter hundred of lines of blog titles. Throughout the assignment I stumbled upon a few interesting problems, some of which are outlined in the following paragraphs.

Non Roman characters issue

During the testing session I missed one title and investigating why it happened, I found that it was simply because the title contained non-Roman characters.

Here is the code's snippet that I was previously using:

for e in results:                                                                                                                        
    simple_author=e['author'].split('(')[1][:-1].strip()                                                             
    if freqs.get(simple_author,0) < 1:                                                                                               
        print parse(e['published']).strftime("%Y-%m-%d") , "--",simple_author, "--", e['title']

And here is the fixed version

for e in results:                                                                                                                        
    simple_author=e['author'].split('(')[1][:-1].strip().encode('UTF-8')                                                             
    if freqs.get(simple_author,0) < 1:                                                                                               
        print parse(e['published']).strftime("%Y-%m-%d") , "--",simple_author, "--", e['title'].encode('UTF-8') 

To fix the issue I faces I added .encode('UTF-8') in order to encode the characters with the UTF-8 encoding. Here is an example title that would have been otherwise left out:

2014-11-18 -- Unknown -- Novo website do Liquid Galaxy em Português!

Python 2.7 uses ASCII as its default encoding but in our case that wasn't sufficient to scrape web contents which often contains UTF-8 characters. To be more precise, this program fetches an RSS feed in XML format and in there it finds UTF-8 characters. So when the initial Python code I wrote met UTF-8 characters, while using ASCII encoding as the default sets, it was unable to identify them and returned an error.

Here is an example of the parsing error it gave us while fetching non-roman characters while using ASCII encoding:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xea' in position 40: ordinal not in range(128)

Right and Left text alignment


In addition to the error previously mentioned, I also had the chance to dig into several ways of formatting output.
The following format is the one I used as the initial output format:

print("Name".ljust(30)+"Age".rjust(30))
Name                                                     Age

Using "ljust" and "rjust" method


I want to improve the readability in the example above by left-justify "Name" by 30 characters and "Age" by another 30 characters distance.

Let's try with the '*' fill character. The syntax is str.ljust(width[, fillchar])

print("Name".ljust(30,'*')+"Age".rjust(30))
Name**************************                           Age

And now let's add .rjust:

print("Name".ljust(30,'*')+"Age".rjust(30,'#'))
Name**************************###########################Age

By using str, it counts from the left by 30 characters including the word "Name" which has four characters
and then another 30 characters including "Age" which has three letters, by giving us the desired output.

Using "format" method


Alternatively, it is possible to use the same indentation approach with the format string method:

print("{!s:<{fill}}{!s:>{fill}}".format("Name", "Age",fill=30))
Name                                                     Age

And with the same progression, it is also possible to do something like:

print("{!s:*<{fill}}{!s:>{fill}}".format("Name", "Age",fill=30))
Name**************************                           Age
print("{!s:*<{fill}}{!s:#>{fill}}".format("Name", "Age",fill=30))
Name**************************###########################Age

"format" also offers a feature to indent text in the middle. To put the desired string in the middle of the "fill" characters trail, simply use the ^ (caret) character:
print("{!s:*^{fill}}{!s:#^{fill}}".format("Age","Name",fill=30))
*************Age**************#############Name#############

Feel free to refer the Python's documentation on Unicode here:
https://docs.python.org/2/howto/unicode.html

And for the "format" method it can be referred here:
https://www.safaribooksonline.com/library/view/python-cookbook-3rd/9781449357337/ch02s13.html

2 comments:

Lukáš said...

This is not the correct solution. Instead of placing '.encode('utf-8') everywhere, you should be working with unicode strings in the first place. That means decoding the input to your program with .decode(your_encoding) and then you would not have any problems with non-ascii characters as they would be represented by their unicode markers.

See this: http://nedbatchelder.com/text/unipain.html

Muhammad Najmi Ahmad Zabidi said...

Thanks for your observation.

I also found out that the .encode('UTF-8') could be omitted when our locale environment is already in UTF-8 (say, our Linux shell - LC_ALL=en_US.UTF-8
).

The error that happened previously could be replicated if the environment is in LC_ALL=C