Archiving web sites

3 Oct 2002

      The canonical software to powerfully archive sites is WGET ("the same thing
the internet archive uses").

It's a command line, which means you might do a little more work up front,
but you can save it as a batch file and do it automatically.

The site for it all is at
http://wget.sunsite.dk/

with documentation at
http://www.gnu.org/manual/wget/

In particular, it has a feature known as "recursive retrieval", in which it
follows links.
Note it has a few specific sections that may be of particular use:
* It can go recursively by typing -r
* By default, it always retrieves from exactly one site
* If you give it
        -Ddomain
it retrieves from only that domain.

So, let's say you just want to grab a single web page.
wget http://fly.cc.fer.hr/

does it.

Now, let's say you've typed up a little file with the list of all URLs you
want to get.
wget -i filelist
does it.
Let's say you want the top three levels below http://www.aoir.org
wget -r -l3 http://www.aoir.org

Danyel Fisher

tags

participants (1)