PHP Document Object Model (DOM) Parser
Document Object ModelPHP DOM (Document Object Model) Parser is very good at dealing with XML and HTML. It travels based on tree-structure and loads the data into DOM object. The first thing you need to do is to construct a dom document object and then load the html content in it.
// a new dom object
$dom = new domDocument;
// load the html into the object
$dom->loadHTML($html);
// discard white space
$dom->preserveWhiteSpace = false;
Concept of DOM
Everything in a DOM Document is a node. The DOMDocument is a hierarchical tree structure of nodes. It starts with a root node. The root node can have child nodes and child nodes can have child nodes on their own. For example, there is a root element (HTML) with two children (HEAD and BODY).
<title>The Title</title>
It has two nodes - A DOMElement with a DOMText child.
<div class="header">
It has three nodes - the DOMElement with a DOMAttr holding a DOMText.
There are two important functions that can be used to extract contents from the html structure:
-
getElementsByTagName
-
getElementById
1. Get Elements by Tag Name
The function getElementsByTagName returns an array of objects that contains all the elements with a given tag name. This function is useful when you want to read the content, or attribute of multiple HTML elements that have the same tag.
Example: For Getting Tables
$tables = $dom->getElementsByTagName('table');
foreach($tables as $table)
{
echo $dom->saveHTML($table);
}
The saveHTML function gets the exact html inside that particular node. To get the total number of elements, you can use the length attribute.
echo 'Found: ' . $tables->length . ' items';
Example: For Getting Links
$dom = new domDocument;
$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a')
foreach ($links as $node)
{
echo $dom->saveHtml($node);
}
There are four things - tag name, attribute name, attribute value, and enclosed tag content.
1. To get the text values of the node (enclosed tag content):
echo $node->nodeValue;
2. To check if the href attribute exists:
echo $node->hasAttribute('href');
3. To get the href attribute value:
echo $node->getAttribute('href');
4. To change the href attribute value:
$node->setAttribute('href', 'something else');
5. To remove the href attribute and its value:
$node->removeAttribute('href');
2. Get Element by Id
It returns an object that contains the element with a given id, or NULL if the element is not found. This function is useful when you want to read the content, or attribute value of a HTML element with a specified id.
$element = $dom->getElementById('myid');
echo $element->nodeValue;
3. DOMXPath in PHP
The DOMXpath class is part of PHP DOM extension. The XPath uses path expressions to select nodes.
$doc = new domDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
Syntax for XPath Query
- / Selects from the root node
- // Selects nodes in the document from the current node that match the selection no matter where they are
- . Selects the current node
- .. Selects the parent of the current node
- @ Selects attributes
Parse h1 tag text
$contents = $xpath->query('//h1');
if (!is_null($contents))
{
foreach ($contents as $i => $node) {
$heading1 .= ' ' . $node->nodeValue;
}
}
echo("h1: $heading1\n\n");
Parse h3 and h4 tag text
$contents = $xpath->query('//h3 | //h4');
if (!is_null($contents))
{
foreach ($contents as $i => $node) {
$heading3and4 .= ' ' . $node->nodeValue;
}
}
echo("h3 and h4s: $heading3and4\n\n");
Parse meta description
$metaDescription = '';
$contents = $xpath->query('/html/head/meta[@name="description"]/@content');
if ($contents->length != 0)
{
foreach ($contents as $content) {
$metaDescription .= $content->value;
}
}
echo("Meta Description: $metaDescription\n\n");
Parse meta keywords
$contents = $xpath->query('/html/head/meta[@name="keywords"]/@content');
if ($contents->length != 0)
{
foreach ($contents as $content) {
$metaKeywords .= ' ' . $content->value;
}
}
echo("Meta Keywords: $metaKeywords\n\n");
Parse Elements with class Name
$nodeList = $xpath->query("//div[@class='class_name']");
$node = $nodeList->item(0);
// To check the result:
echo "<p>" . $node->nodeValue . "</p>";