Using wget to recursively fetch an image directory
This might be irrelevant but since there are lots of geeks here, I will just try my luck.
What I am trying to do is simple: using wget to fetch a directory in a website and its sub-directories. For example, headline.nycweb.io is a Joomla website, and under its document root, there is an "image" directory containing lots of images. I want to use wget to fetch the whole "image" directory and its contents to my own server.
I have read this SO post: https://stackoverflow.com/questions/273743/using-wget-to-recursively-fetch-a-directory-with-arbitrary-files-in-it/273776#273776
but when I tried
wget --recursive --no-parent -e robots=off http://headline.nycweb.io/images/
I am always getting an index.html file. So what did I do wrong? And is what I am trying to do possible at all?
By the way, I have total control over both the source website and the destination server.
Never tried to do this with
wget before, but I thought I'd take a look to try and get the ball rolling.
I did a little surfing and for a second I thought you might want to try adding a
--reject "index.html*" to your
wget before the download URL, but upon further review it looks like this would just exclude
index.html from the other files that are meant to be here.
Maybe it's something with Apache that's preventing access to the files? I'm getting
403 Forbidden when trying to access, for instance,
headline.nycweb.io/images/Demo but not when I go directly to
$ curl -ILl headline.nycweb.io/images/Demo HTTP/1.1 301 Moved Permanently Date: Wed, 03 Jul 2019 19:51:27 GMT Server: Apache/2.4.18 (Ubuntu) Location: http://headline.nycweb.io/images/Demo/ Content-Type: text/html; charset=iso-8859-1 HTTP/1.1 403 Forbidden Date: Wed, 03 Jul 2019 19:51:27 GMT Server: Apache/2.4.18 (Ubuntu) Content-Type: text/html; charset=iso-8859-1
$ curl -ILl headline.nycweb.io/images/Demo/blog/business9.jpg HTTP/1.1 200 OK Date: Wed, 03 Jul 2019 19:52:15 GMT Server: Apache/2.4.18 (Ubuntu) Last-Modified: Thu, 29 Sep 2016 12:45:30 GMT ETag: "520e-53da4da4d8e80" Accept-Ranges: bytes Content-Length: 21006 Content-Type: image/jpeg
wget is bailing when it tries to read the subdirectories under
images? Might be worth looking at the permissions on those dirs.