Recent Changes - Search:

Cookbook

PmWiki

pmwiki.org

ConvertTABLE

Summary: How to convert HTML pages with tables full of data into advanced tables
Version: 17 September 2005
Prerequisites:
Status:
Maintainer: Brooks Kelley
Categories: Editing, Administration

Question

How can I make it easier to convert HTML pages with tables full of data into advanced tables?

Answer

I had this problem with AzRepeaters.Net. I needed to convert about 20 pages filled with tables of data into a PmWiki's advanced table markup

Currently, my solution is a Linux one using the Bash command line and not a PmWiki cookbook recipe.

So, I ran this command line on a bash console on the HTML pages I needed to convert....

linuxmachine> cat filecontainingtabledata.html | sed '/^$/d' |
sed 's/[ \t]*$//' | tr -d [:cntrl:] | tr -s [:blank:] |
sed 's/</\n</g' | sed 's#^<[Tt][Rr].*>#(:cellnr:)#g' |
sed 's#^<[Tt][Dd].*>#(:cell:)#g' |
sed -e :a -e 's/<[^>]*>//g;/</N;//ba' | tr -d [:cntrl:] |
sed 's/(:cell/\n(:cell/g' | grep "^(:cell" > data_in_pmwiki_markup.txt

Let me explain what is going on with this command line.

First, be very careful with the syntax. The symbol you see above " | " is the key on your keyboard just about the enter key. It creates a pipe that streams data from one command to another until you get a final output. In other words, it is a necessary part of the command string.

I start the stream by using cat filecontainingtabledata.html. I am doing this to illustrate that it is streaming the contents of an HTML page into a series of commands.

After creating the data stream, I then pipe it into sed to delete some empty lines and pipe into sed again to delete empty trailing spaces ( sed '/^$/d' | sed 's/[ \t]*$//' ).

Then it is piped into tr where I delete all the newlines and carriage returns ( tr -d [:cntrl:] ). After that, it is piped into tr again to squeeze out repeating blanks ( tr -s [:blank:] ).

Then I pipe it into sed to get every < to begin a new line. This was the part that Patrick thought was clever. Anyway, it is done with sed 's/</\n</g'.

The reason it might be considered clever is that sed works on one line at a time. Even though a lot of data is being shipped thru the pipe, it still is parsed one line at a time.

This makes it easier to do line be line editing with a streaming editor like sed. And, as you will see later, I use it to do little but important tricks as the data streams thru each pipe.

Then I pipe it again into sed and convert every <tr> into (:cellnr:) and pipe it into sed again and covert every <td> into (:cell:) with sed 's#^<[Tt][Rr].*>#(:cellnr:)#g'=] and then sed 's#^<[Tt][Dd].*>#(:cell:)#g' .

O.K., now that is done, we can pipe it again into sed to remove all the HTML tags with sed -e :a -e 's/<[^>]*>//g;/</N;//ba'.

Then, I want to run everything together as one line. You may see why in a moment. I continue piping and use tr -d [:cntrl:] to do this step.

Now that everything is together. I can pipe it into the next step and set up each occurence of (:cell to start a new line. I do that with sed 's/(:cell/\n(:cell/g'.

Now, I finally pipe it for the last time where I just grep for for every line that starts with (:cell and use > to redirect the output into a file called data_in_pmwiki_markup.txt.

I then edited it in a text editor. I know that Patrick talked about making this into a another PmWiki recipe since most of the cookbook is that. But this will give you a start until that is done.

Neat Little Add On to make it work better

By the way, in case you do run this script, you will find that you create an extra (:cell:) just after the (:cellnr:). The way to get rid of that extra (:cell:) is to add this to the script by piping the stream into the commands tr -d [:cntrl:] | sed 's/(:cellnr:) (:cell:)/(:cellnr:)/g' | sed 's/(:cell/\n(:cell/g'.

You will have to adjust the spacing of the blanks in sed 's/(:cellnr:) (:cell:)/(:cellnr:)/g' to get it to delete the extra (:cell:).

Caveat and work around!

My script does remove HTML tags that you might want to include in the final data. I had that problem too. I adjusted the script with a sed command that took advantage of the fact that every tag has its own line.

This means you can change tags like <A HREF=http://www.somedomain.com to [[http://www.somedomain.com. Then on the next to final command, you can change the final bit of the tag left over of > to ]] .

I had to do this also because AzRepeater.Net has links showing where repeaters are on a map.

Notes and Comments

''The newest the latest. And please don't forget to end by date and name. Thanks)

Versions

date of publication : 2005-09-02 : name of the cookbook - version 00007

  • This recipe was last tested on PmWiki version:
  • This recipe requires at least PmWiki version: and (any other recipes)

See Also

Contributors

Brooks Kelley 17 September 2005


Comments

I had this same issue, but was using WinXP and SED for Windows. What's worse is, I'd only used SED once before and that was a few years ago, so forgive me if this isn't very elegant. Below is a .BAT file (you can also use .CMD as the extension). It expects a single parameter, a filename to process. The commands are a combination of SED and some DOS batch commands, so you must have SED loaded and running on the Windows path before this .BAT file will work.

The other caveat is that this process expects a "DOS CSV file" as saved by Excel. After a bit of flogging, I figured out that it was much simpler to start with a very simple format and then use some fairly simple SED commands to format. So, you can use Excel to open an HTML file, then isolate the data you want, and make sure you delete everything else on the spreadsheet you don't want, then do the "Save As..." and save the data as a DOS CSV file. This creates a very simple comma-delimited file.

At that point, just run the ConvertCSVtoPMWikiTABLE.bat program with the file name as the sole parameter. When it finishes, Notepad will open with the converted results, ready to be cut-n-pasted into PmWiki.

 if %1'==' goto err1
 goto start

 :err1
 echo off
 cls
 echo ---------------------------
 echo ConvertCSVtoPMWikiTABLE.BAT
 echo ---------------------------
 echo '
 echo You must input the name of the CSV file as the one and only parameter, e.g.
 echo '
 echo '    ConvertCSVtoPMWikiTABLE ips.csv
 echo '
 echo This process assumes you have imported a table into Excel, then exported
 echo the rows/columns that you want as a MS-DOS CSV file.
 echo '
 echo '
 echo ' written by Robert Laird, 12/20/2005
 pause
 goto end


 :start

 rem -----3 steps to this SED pipe...
 rem -----  Step 1: replace all comma's with a comma plus a HTML hard space
 rem -----  Step 2: append (:cellnr:) to the beginning of each line
 rem -----  Step 3: replace all comma's with (:cell:)
 rem -----
 sed "s/,/&\&nbsp;/g"  %1 | sed "s/^/(:cellnr:)/" | sed s/,/\n(:cell:)/g > temp1.csv 

 rem -----Add the table header
 echo (:table border=1 cellpadding=0:) > temp1hdr.txt
 copy temp1hdr.txt+temp1.csv temp2.csv

 rem -----Add the table end
 echo (:tableend:) > temp1ftr.txt
 copy temp2.csv+temp1ftr.txt temp3.csv

 rem -----Clean up
 del temp1.csv
 del temp2.csv

 rem -----Display the final CSV file for cut-n-paste into PmWiki
 notepad temp3.csv

 :end
Edit - History - Print - Recent Changes - Search
Page last modified on March 22, 2007, at 08:56 PM