A Quick PDF-Downloading Script done in PHP
I love the thirty day challenge but I'm not a big fan of the way Ed Dale waffles. I'm also not a complainer and I don't expect him to change himself for me. (He's a good bloke, after all) So, instead, I wrote a really quick script in PHP that will download all the PDF transcripts from the thirty day challenge. I used these last year and, as a techie who already understands most of the concepts, they were all I needed. So here's the code for you to enjoy as well!
<?php
// make sure our download doesn't time out or get interrupted by closing the browser
set_time_limit(0);
ignore_user_abort(1);
// destination to download to
$file_dir = "sites/default/files/30dc";
// create the destination directory if it doesn't exist
if (!is_dir($file_dir)) mkdir($file_dir);
// go through each day of training (1-31)
for ($i = 1; $i <= 31; $i++) {
// download the HTML contents of the training page for that day
if ($page = file_get_contents("http://www.thirtydaychallenge.com/training/2009day".sprintf("%02d", $i).".php")) {
// provide some feedback on where we are
echo "<b>Day $i:</b><br />";
// flush output to browser - see php.net/flush
flush();
// directory to download the current day's PDFs to
$daydir = $file_dir."/day$i";
// create the directory if it doesn't exist
if (!is_dir($daydir)) mkdir($daydir);
// grab all the URLs to the PDFs (regular expressions are awesome!)
preg_match_all('~(http://media.thirtydaychallenge.com.s3.amazonaws.com/training09/([0-9A-Za-z_]+.pdf))~', $page, $matches);
// go through each url we grabbed above
foreach ($matches[1] as $key => $filename) {
// check if the file already exists (no use in re-downloading PDFs we have)
if (!file_exists($matches[2][$key])) {
// provide some feedback on where we are
echo "Downloading {$matches[2][$key]}.<br />";
// flush output to browser
flush();
// download the pdf and store it locally
file_put_contents("{$daydir}/{$matches[2][$key]}", file_get_contents($matches[1][$key]));
}
}
}
}
?>I've added plenty of comments so you can see what it all means/does but feel free to post questions in the comments if you don't understand a part of the code.
The magic is here:
<?php
preg_match_all('~(http://media.thirtydaychallenge.com.s3.amazonaws.com/training09/([0-9A-Za-z_]+.pdf))~', $page, $matches);
?>This is a regular expression (I'm using tildes (~) as delimiters to avoid having to escape the forward slashes) that basically grabs ALL the PDF URLs and puts them nice and neatly into an array we can use later on. I grab the URL of the PDF as well as just the filename (in the second set of parentheses) so I can download the file using file_get_contents() and then write it using it's original filename, using file_put_contents().
This particular regular expression allows uppercase and lowercase letters, numbers and underscores. To learn more about PCRE (perl-compatible regular expressions) in PHP, go to: http://www.php.net/manual/en/book.pcre.php
Comments
Post new comment