BlogContentProcessor API Documentation
Overview
BlogContentProcessor is a high-level content processor that handles reading, parsing, and transforming markdown files into HTML with automatic table of contents generation and heading ID assignment.
Namespace
namespace Blog\Renderer;
Class: BlogContentProcessor
Constructor
public function __construct()
Initializes the content processor with a new BlogRenderer instance.
Public Methods
processMarkdownFile(string $filePath): array
Processes a markdown file and extracts title, HTML content, and table of contents.
Parameters:
$filePath(string): Path to the markdown file
Returns:
- (array): Processed data with keys:
- title (string): Extracted article title
- content (string): HTML content with heading IDs
- toc (array): Table of contents array
Title Extraction Priority:
- YAML front matter
title:field - First H1 heading in markdown content
- Empty string if neither found
YAML Front Matter Format:
---
title: Custom Title
description: Custom description
---
# This heading will be ignored
Content goes here...
Features:
- Reads file from disk
- Extracts and removes YAML front matter
- Parses markdown to HTML
- Adds IDs to headings without them
- Generates table of contents
- Returns empty content if file doesn't exist
Example:
$processor = new BlogContentProcessor();
// Process a markdown file
$result = $processor->processMarkdownFile('articles/my-post.md');
echo $result['title']; // Article title
echo $result['content']; // HTML content
print_r($result['toc']); // Table of contents
Example YAML Front Matter:
---
title: My Custom Title
author: John Doe
date: 2024-01-15
---
# Regular Content
This content will be processed normally.
extractTableOfContents(string $html): array
Extracts table of contents from HTML by finding headings with IDs.
Parameters:
$html(string): HTML content to extract TOC from
Returns:
- (array): TOC items, each with keys:
- level (int): Heading level (1-6)
- id (string): Heading ID attribute
- text (string): Heading text content
Heading Pattern:
- Matches
<h1>through<h6>tags - Requires
idattribute - Strips HTML tags from heading text
- Decodes HTML entities
Example:
$html = '<h1 id="intro">Introduction</h1>
<h2 id="setup">Setup</h2>';
$processor = new BlogContentProcessor();
$toc = $processor->extractTableOfContents($html);
// Result:
// [
// ['level' => 1, 'id' => 'intro', 'text' => 'Introduction'],
// ['level' => 2, 'id' => 'setup', 'text' => 'Setup']
// ]
addIdsToHeadings(string $html): string
Adds ID attributes to headings that don't have them.
Parameters:
$html(string): HTML content
Returns:
- (string): HTML with IDs added to headings
ID Generation:
- Converts text to lowercase
- Replaces non-alphanumeric characters with hyphens
- Collapses multiple hyphens to single hyphen
- Trims leading/trailing hyphens
- Uses 'section' as fallback for empty IDs
- Preserves existing IDs
Example:
$html = '<h1>Hello World</h1>';
$processor = new BlogContentProcessor();
$result = $processor->addIdsToHeadings($html);
// Result: '<h1 id="hello-world">Hello World</h1>'
Example with Existing ID:
$html = '<h1 id="custom-id">Heading</h1>';
$processor = new BlogContentProcessor();
$result = $processor->addIdsToHeadings($html);
// Result: '<h1 id="custom-id">Heading</h1>' (unchanged)
getRenderer(): BlogRenderer
Returns the internal BlogRenderer instance.
Returns:
- (BlogRenderer): The BlogRenderer instance used by this processor
Example:
$processor = new BlogContentProcessor();
$renderer = $processor->getRenderer();
// Use renderer directly
$renderer->setConfig(['title' => 'My Blog']);
$html = $renderer->renderDocument([...]);
Private Methods
extractTitle(string $content, string $frontMatter = ''): string
Extracts article title from content or front matter.
Priority:
title:field in YAML front matter- First
#heading in markdown content - Empty string if neither found
Parameters:
$content(string): Markdown content (with front matter removed)$frontMatter(string): YAML front matter string
Returns:
- (string): Extracted title
generateId(string $text): string
Generates a URL-friendly ID from heading text.
Transformation Steps:
- Convert to lowercase
- Replace non-alphanumeric Unicode characters with hyphens
- Collapse multiple hyphens to single hyphen
- Trim leading/trailing hyphens
- Return 'section' if empty
Parameters:
$text(string): Plain text heading
Returns:
- (string): URL-friendly ID
Examples:
generateId('Hello World') // 'hello-world'
generateId(' Test ') // 'test'
generateId('Hello--World') // 'hello-world'
generateId('123') // '123'
generateId('') // 'section'
generateId('你好世界') // '你好世界'
Usage Workflow
Basic Workflow
<?php
use Blog\Renderer\BlogContentProcessor;
$processor = new BlogContentProcessor();
// 1. Process markdown file
$result = $processor->processMarkdownFile('article.md');
// 2. Extract data
$title = $result['title'];
$content = $result['content'];
$toc = $result['toc'];
// 3. Get renderer for HTML generation
$renderer = $processor->getRenderer();
// 4. Render complete document
$html = $renderer->renderDocument([
'rootPath' => './',
'title' => $title,
'content' => $content,
'toc' => $toc,
]);
Complete Example
<?php
use Blog\Renderer\BlogContentProcessor;
$processor = new BlogContentProcessor();
// Process article with YAML front matter
$articleData = $processor->processMarkdownFile('posts/my-article.md');
// Output metadata
echo "Title: " . $articleData['title'] . "\n";
echo "TOC Items: " . count($articleData['toc']) . "\n";
// Get renderer for custom configuration
$renderer = $processor->getRenderer();
$renderer->setConfig([
'title' => 'My Tech Blog',
'description' => 'Technical articles',
]);
// Render with custom breadcrumbs
$html = $renderer->renderDocument([
'rootPath' => '../',
'title' => $articleData['title'],
'content' => $articleData['content'],
'toc' => $articleData['toc'],
'breadcrumbs' => [
['link' => '../index.php', 'name' => 'Home'],
['link' => 'index.php', 'name' => 'Posts'],
]
]);
// Output HTML
file_put_contents('posts/my-article.html', $html);
YAML Front Matter Support
Supported Fields
Currently, only title is automatically extracted from front matter, but you can include any custom metadata:
---
title: My Article Title
author: John Doe
date: 2024-01-15
category: PHP
tags: [php, markdown, tutorial]
---
# Article Content
The article content goes here.
Note: Only the title field is used by default. You can extend the class to extract other fields as needed.
Integration with BlogRenderer
BlogContentProcessor uses BlogRenderer internally for markdown parsing. You can access the renderer instance to:
- Update configuration
- Render documents
- Parse markdown directly
- Use other renderer features
$processor = new BlogContentProcessor();
// Access renderer
$renderer = $processor->getRenderer();
// Configure renderer
$renderer->setConfig(['title' => 'Custom Blog']);
// Parse markdown directly
$html = $renderer->parseMarkdown('# Hello');
Error Handling
The processor handles missing files gracefully:
$result = $processor->processMarkdownFile('nonexistent.md');
// Returns:
// [
// 'title' => 'nonexistent',
// 'content' => '',
// 'toc' => []
// ]
Table of Contents Structure
The TOC array is structured as follows:
[
[
'level' => 1, // Heading level (1-6)
'id' => 'section-id', // Heading ID attribute
'text' => 'Section Title' // Plain text heading content
],
[
'level' => 2,
'id' => 'subsection-id',
'text' => 'Subsection Title'
],
// ... more items
]
This structure is compatible with the toc parameter in BlogRenderer::renderDocument().
See Also
- BlogRenderer - Main rendering class
- SecurityFilter - Security filtering
- Parsedown - Markdown parsing