wget -e robots=off
It is as simple as the above and this is something one must watch out carefully when using wget because you may think you have archived or downloaded content when you never did due to a nofollow/robots.txt statement.........
There are a few ways of doing this and all basically involve using the reverse proxy or "ProxyPass" feature of Apache to accomplish it.
1.) Create a normal vhost and simply symlink the root directory of the site you want to mirror.
Eg. originalsite.com and newsite.com
/vhosts/originalsite.com/httpdocs
You would symlink like this:
ln -s /vhosts/originalsite.com/httpdocs vhosts/originalsite.com/........
This is very handy if you're too busy and don't have time to download whatever files you need.
The -D specifies the domains allowed, this is because I specified -H which means foreign hosts are allowed, if you don't restrict them you'll end up going to the whole internet via ads and other links just like a search Engine would follow.
-l 0 specifies to go deep, to as many levels as possible/as exist.
-e robots=off is important because robots.txt often says you can't vie........