[Air-L] Fwd: Re: Tool to convert Website to PDF

7 Sep 2014

      ---------- Forwarded message ----------
From: "Elijah Wright" <elijah.wright@gmail.com>
Date: Sep 7, 2014 1:57 PM
Subject: Re: [Air-L] Tool to convert Website to PDF
To: "Leonie Tanczer" <ltanczer01@qub.ac.uk>
Cc:

Something like this:

https://sites.google.com/site/torisugari/commandlineprint2

Having a browser engine print to file from cli is only one way to attack
this.  My first inclination was to suggest piping the output from the lynx
text browser through Ghostscript.  If you don't care about any of the
visual affordances of technology on the web, that would be a heck of a lot
quicker. (Eg, do you really just want the words?)

You will likely want to use wget or similar to spider the site, then
extract the list of urls,  then in a second pass use a heavier renderer
(big fat firefox....) to convert to pdf.

It's going to take a very long time, if you really have 10k+ pages.

You might consider truncating the scope of your research to a narrower
subset - or otherwise expect to become quite practiced at solving this
style of problem.

Asking the maxqda folks to enhance their software would be another way to
proceed - it's not just you who runs into this perennial problem.  :-)

--e
On Sep 7, 2014 10:56 AM, "Leonie Tanczer" <ltanczer01@qub.ac.uk> wrote:
...
Dear All,
I am currently looking for software to extract the whole content of a
website to automatically convert each site of this webpage to a PDF.
I am aware of Acrobat XI Pro. However, after multiple attempts I
encountered the problem that it is limited to 10.000 levels (even when
indicating to extract the whole site) and the programme crashes before
finishing the job. As I am working with governmental websites, with huge
amounts of content, this is not sufficient.
I am also aware of GNU Wget, yet it only exports the sites in HTML format.
As I would like to analyse the content in a Qualitative Data Analysis
Software, specifically MAXQDA, which does not allow the import of HTML
data, I am struggling here as well.
I was wondering if anyone has ever conducted a research with a similar
technique before and if you are aware of software which could support my
data collection/extraction process.
Any advice would be greatly appreciated!
Thank you,
Leonie
___________________________________________
Leonie Maria Tanczer
PhD Candidate
School of Politics, International Studies & Philosophy
Queen's University Belfast
Twitter: @leotanczt
http://bit.ly/1d7O7kj <http://bit.ly/1d7O7kj>
_______________________________________________
The Air-L@listserv.aoir.org mailing list
is provided by the Association of Internet Researchers http://aoir.org
Subscribe, change options or unsubscribe at:
http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
Join the Association of Internet Researchers:
http://www.aoir.org/