A way to remove special characters from a text file

Hi everyone! I have been busy for the last months with various duties related to my university classes and research. But now I have separated a little time to write another post! Soon I will bring more contents related to my university environment.

Last month, the campus party happened in my city. So my friends and I decided to participate in two hackathons that happened simultaneously during the event just for fun, which later showed to be a great challenge. The first hackathon involved data science directed to the Brazilian health, and the second one was intended to accelerate the national overall judgment process. In the picture below I am presenting my groups work related to the data science hackathon, the TV shows a correlation matrix made from various databases data.

word

Jumping now to this post purpose!

While trying to develop a solution for one of the hackathons, we had to process the text of pdf files.  To do so, we used the PDFMiner package. The next line illustrates its usage:

pdf2txt.py -o 'output file' 'pdf file'

The application seemed to extract the pdf text pretty well, with just one little peculiarity: There were special/control characters (“Ctrl+l”) inserted whenever there was a new page at the original pdf. The picture below illustrates the character with the VIM text editor:

word2

This character got in our way to process the file for our final purpose. At first, we tried to just remove it with ordinary python functions and with different inputs representing the character, without success at the beginning. So we found a pretty interesting solution!


text_file = open('text.txt','r')
text_file = text_file.read()
text_file = repr(text_file)
text_file = text_file.replace("\\x0c","")
text_file = literal_eval(text_file)

The trick consists of using the “repr()” function which returns the completely “raw” string. From there, the special character can be easily identified an removed. In the end, just return the string to the original form with the “literal_eval()” function! The relation of “\x0c” and “Ctrl+l” was found by analyzing the raw text file with the “repr()” function.

I believe this approach can be used to similarly solve other issues involving different characters and programming languages.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s