PHP is an interpreted language and it can be use for web scraping. Web Scraping is an process in which software or programs crawls the webpages of websites. So, In this article we will learn how to do Web Scraping in PHP.
How Web Scraping Works in PHP
- Sends the GET Requests to URLs of the Website one by one.
- Extract the HTML Code of Web Page.
- Parse the DOM(HTML Document) of each URL(Web Page).
- Get the required HTML tags like Meta tags, P tags, Video tags, Div tags etc.
- Finally get the content from the selected tags.
Here we will parse the Instagram URL and get the data of Meta tags with OG Properties like og:type, og:video, og:image.
Create a File scraping.php and paste the code given below then run the code. You will see the image from Instagram URL.
<?php
/*
* Web Scraping Using PHP - Get the Image From Instagram URL By reading the Meta tags with og properties
*/
/*set header for image output*/
header("content-type: image/jpeg");
function checkinstaurl($urlhere){
//remove white space
$urlhere = trim($urlhere);
$urlhere = htmlspecialchars($urlhere);
///remove white space
if (get_domain($urlhere) == "instagram.com") {
//getting the meta tag data
$html = file_get_contents_curl($urlhere);
//parsing begins here:
$doc = new DOMDocument();
@$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
//get and display what you need:
$title = $nodes->item(0)->nodeValue;
$metas = $doc->getElementsByTagName('meta');
$mediatype = null;
$description = null;
for ($i=0; $i < $metas->length; $i++) {
$meta = $metas->item($i);
if($meta->getAttribute('property') == 'og:type')
$mediatype = $meta -> getAttribute('content');
if ($mediatype == 'video') {
if ($meta->getAttribute('property') == 'og:video') {
$description = $meta -> getAttribute('content');
}
}else{
if ($meta->getAttribute('property') == 'og:image') {
$description = $meta -> getAttribute('content');
$mediatype = 'photo';
}
}
}
$out['mediatype'] = $mediatype;
$out['descriptionc'] = $description;
return $out;
}
}
/*get file contents*/
function file_get_contents_curl($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
/*get file contents*/
/*get domain name*/
function get_domain($url){
$pieces = parse_url($url);
$domain = isset($pieces['host']) ? $pieces['host'] : $pieces['path'];
if (preg_match('/(?P<domain>[a-z0-9][a-z0-9\-]{1,63}\.[a-z\.]{2,6})$/i',
$domain, $regs)) {
return $regs['domain'];
}
return FALSE;
}
/*get domain name*/
/********Generating Output********/
$ref = (isset($_SERVER['HTTPS']) && $_SERVER['HTTPS'] === 'on' ? "https" : "http") . "://{$_SERVER['HTTP_HOST']}{$_SERVER['REQUEST_URI']}";
$igurl = 'https://www.instagram.com/p/CO_Dk2wD93I/';
$output = checkinstaurl($igurl, $ref);
//echo "<pre>";
//print_r($output);
//die;
readfile($output['descriptionc']);
?>
- In the above code we used the DOM Class to parse the DOM.
- Header with content type image/jpeg to return image data to browser.
- Used CURL to get the html content from URL.
- Used preg_match to get the domain name from URL.
- Created custom functions checkinstaurl, file_get_contents_curl, get_domain,
- Used readfile to get content from image URL.
Read For More Technology and Tutorials about Web Scraping.