How to get how many pages in a PDF? I read PDF spec. V1.6 and find this:
PDF set a "Page Tree Node" to define the ordering of pages in the document. The tree structure allows PDF applications, using little memory to quickly open a document containing thousands of pages.
If a PDF have 63 pages, the page tree node will like this...
2 0 obj
<< /Type /Pages
/Kidsn [ 4 0 R
10 0 R
]
/Count 63 <---- YES, got it
>>
endobj
[P.S] a PDF may not only a pages tree node, The right answer is in "root page tree node", if /Count XX with /Parent XXX node, it not "root page tree node"
SO, You must find the node with /Count XX and Without /Parent terms, and you'll get total pages of PDF
%PDF-1.0 ~ %PDF-1.5 all works
Alex form Taipei,Taiwan
Funciones PDF
Obsevaciones sobre Funciones Obsoletas de PDFlib
Desde PHP 4.0.5, la extensión de PHP para PDFlib está oficialmente soportada por PDFlib GmbH. Esto significa que todas las funciones descritas en el Manual de Referencia de PDFlib están soportadas por PHP 4 con exáctamente el mismo significado y los mismos parámetros. Sin embargo, con la Versión 5.0.4 o posterior de PDFlib todos los parámetros deben ser especificados. Por razones de compatibilidad, esta adaptación para PDFlib todavía soporta la mayoría de las funciones obsoletas, pero deberían ser reemplazadas por sus versiones nuevas. PDFlib GmbH no tendrá soporte para cualquier problema surgido del uso de estas funciones obsoletas. La documentación de esta sección indica las funciones antiguas como "Obsoletas" y otorga la función sustituta para que se use en su lugar.
Tabla de contenidos
- PDF_activate_item — Activar un elemendo de estructura u otro elemento de contenido
- PDF_add_annotation — Añadir una anotación [obsoleta]
- PDF_add_bookmark — Añadir un marcador a la página actual [obsoleta]
- PDF_add_launchlink — Añadir una anotación de lanzamiento a la página actual [obsoleta]
- PDF_add_locallink — Añadir una anotación de enlace a la página actual [obsoleta]
- PDF_add_nameddest — Crear un destino con nombre
- PDF_add_note — Establecer una anotación para la página actual [obsoleta]
- PDF_add_outline — Añadir un marcador a la página actual [obsoleta]
- PDF_add_pdflink — Añadir una anotación de enlace de página a la página actual [obsoleta]
- PDF_add_table_cell — Añadir una celda a una tabla nueva o ya existente
- PDF_add_textflow — Crear un Textflow o añadir texto al Textflow existente
- PDF_add_thumbnail — Añadir una imagen en miniatura a la página actual
- PDF_add_weblink — Añadir un enlace web a la página actual [obsoleta]
- PDF_arc — Dibujar un segmento de arco circular en el sentido contrario a las agujas del reloj
- PDF_arcn — Dibujar un segmento de arco circular en el sentido de las agujas del reloj
- PDF_attach_file — Añadir un documento adjunto a la página actual [obsoleta]
- PDF_begin_document — Crear un nuevo archivo PDF
- PDF_begin_font — Iniciar una definición de una fuente Type 3
- PDF_begin_glyph — Iniciar la definición de un glifo para una fuente Type 3
- PDF_begin_item — Abrir un elemento de estructura u otro elemento de contenido
- PDF_begin_layer — Iniciar una capa
- PDF_begin_page_ext — Inicia una nueva página
- PDF_begin_page — Iniciar una nueva página [obsoleta]
- PDF_begin_pattern — Iniciar una definición de patrón
- PDF_begin_template_ext — Iniciar una definición de plantilla
- PDF_begin_template — Inicia la definición de plantilla [obsoleta]
- PDF_circle — Dibujar un círculo
- PDF_clip — Recortar el trazado actual
- PDF_close_image — Cerrar un imagen
- PDF_close_pdi_page — Cerrar un gestor de página
- PDF_close_pdi — Cerrar el documento PDF de entrada [obsoleta]
- PDF_close — Cerrar un recurso pdf [obsoleta]
- PDF_closepath_fill_stroke — Cerrar, rellenar y contornear el trazado actual
- PDF_closepath_stroke — Cerrar y contornear un trazado
- PDF_closepath — Cerrar el trazado actual
- PDF_concat — Concatenar una matriz a la CTM
- PDF_continue_text — Imprimir texto en la siguiente línea
- PDF_create_3dview — Crear una vista 3D
- PDF_create_action — Crear una acción para objetos o eventos
- PDF_create_annotation — Crear una anotación rectangular
- PDF_create_bookmark — Crear un marcador
- PDF_create_field — Crear un campo de formulario
- PDF_create_fieldgroup — Crear un grupo de campos de formulario
- PDF_create_gstate — Crear un objeto de estado
- PDF_create_pvf — Crear un archivo virtual PDFlib
- PDF_create_textflow — Crear un objeto textflow
- PDF_curveto — Dibujar una curva de Bézier
- PDF_define_layer — Crear una definición de capa
- PDF_delete_pvf — Borrar un archivo PDFlib virtual
- PDF_delete_table — Borrar un objeto tabla
- PDF_delete_textflow — Borrar un objeto textflow
- PDF_delete — Borrar un objeto PDFlib
- PDF_encoding_set_char — Añadir un nombre de glifo y/o un valor Unicode
- PDF_end_document — Cerrar un archivo PDF
- PDF_end_font — Finalizar la definición de una fuente Type 3
- PDF_end_glyph — Finalizar la definición de un glifo para una fuente Type 3
- PDF_end_item — Cerrar un elemento de estructura u otro elemento de contenido
- PDF_end_layer — Desactivar todas las capas activas
- PDF_end_page_ext — Finalizar una página
- PDF_end_page — Finalizar un página
- PDF_end_pattern — Finalizar un patrón
- PDF_end_template — Finalizar una plantilla
- PDF_endpath — Finalizar el trazado actual
- PDF_fill_imageblock — Rellenar un bloque de imagen con información variable
- PDF_fill_pdfblock — Rellenar un bloque PDF con información variable
- PDF_fill_stroke — Rellenar y contornear un trazado
- PDF_fill_textblock — Rellenar un bloque de texto con infomación variable
- PDF_fill — Rellenar el trazado actual
- PDF_findfont — Preparar una fuente para un uso posterior [obsoleta]
- PDF_fit_image — Colocar una imagen o plantilla
- PDF_fit_pdi_page — Colocar una página PDF importada
- PDF_fit_table — Colocar una tabla en una página
- PDF_fit_textflow — Formatear un textflow en un área rectangular
- PDF_fit_textline — Colocar una simple línea de texto
- PDF_get_apiname — Obtener el nombre de una función API que falló
- PDF_get_buffer — Obtener el buffer de salida de PDF
- PDF_get_errmsg — Obtener el texto del error
- PDF_get_errnum — Obtener el número de error
- PDF_get_font — Obtener una fuente [obsoleta]
- PDF_get_fontname — Obtener el nombre de una fuente [obsoleta]
- PDF_get_fontsize — Manejo de fuentes [obsoleta]
- PDF_get_image_height — Obtener el alto de una imagen [obsoleta]
- PDF_get_image_width — Obtener el ancho de una imagen [obsoleta]
- PDF_get_majorversion — Obtener el número de la versión mayor [obsoleta]
- PDF_get_minorversion — Obtener el número de versión menor [obsoleta]
- PDF_get_parameter — Obtener un parámetro de cadena
- PDF_get_pdi_parameter — Obtener un parámetro de cadena de un PDI [obsoleta]
- PDF_get_pdi_value — Obtener un parámetro númerico de un PDI [obsoleta]
- PDF_get_value — Obtener un parámetro numérico
- PDF_info_font — Preguntar información detallada acerca de una fuente cargada
- PDF_info_matchbox — Preguntar sobre la información de un matchbox
- PDF_info_table — Recuperar la información de una tabla
- PDF_info_textflow — Preguntar por el estado de un textflow
- PDF_info_textline — Llevar a cabo el formateo textline y preguntar por las métricas
- PDF_initgraphics — Reiniciar el estado de un gráfico
- PDF_lineto — Dibujar una línea
- PDF_load_3ddata — Cargar un modelo 3D
- PDF_load_font — Buscar y preparar una fuente
- PDF_load_iccprofile — Buscar y preparar un perfil ICC
- PDF_load_image — Abrir un archivo de imagen
- PDF_makespotcolor — Crea un color de impresión
- PDF_moveto — Establecer el punto actual
- PDF_new — Crear un objeto PDFlib
- PDF_open_ccitt — Abrir una imagen CCITT en bruto [obsoleta]
- PDF_open_file — Crear un archivo PDF [obsoleta]
- PDF_open_gif — Abrir una imagen GIF [obsoleta]
- PDF_open_image_file — Leer una imagen desde un archivo [obsoleta]
- PDF_open_image — Usar información de una imagen [obsoleta]
- PDF_open_jpeg — Abrir una imagen JPEG [obsoleta]
- PDF_open_memory_image — Abrir una imagen creada con las funciones de imagen de PHP [no soportada]
- PDF_open_pdi_document — Preparar un documento pdi
- PDF_open_pdi_page — Preparar una página
- PDF_open_pdi — Abrir un archivo PDF [obsoleta]
- PDF_open_tiff — Abrir una imagen TIFF [obsoleta]
- PDF_pcos_get_number — Obtener el valor de una ruta pCOS con el tipo number o boolean
- PDF_pcos_get_stream — Obtener el contenido de una ruta pCOS con el tipo stream, fstream, o string
- PDF_pcos_get_string — Obtener el valor de una ruta pCOS con el tipo name, string, o boolean
- PDF_place_image — Colocar una imagen en una página [obsoleta]
- PDF_place_pdi_page — Colocar una página PDF [obsoleta]
- PDF_process_pdi — Procesar un documento PDF importado
- PDF_rect — Dibujar un rectángulo
- PDF_restore — Restablecer el estado de gráficos
- PDF_resume_page — Reanudar un página
- PDF_rotate — Rotar el sistema de coordenadas
- PDF_save — Guardar el estado de gráficos
- PDF_scale — Escalar el sistema de coordenadas
- PDF_set_border_color — Establecer el color del borde de las anotaciones [obsoleta]
- PDF_set_border_dash — Establecer el estilo de borde discontinuo de las anotaciones [obsoleta]
- PDF_set_border_style — Establecer el estilo de borde de las anotaciones [obsoleta]
- PDF_set_char_spacing — Establecer el espacio entre caracteres [obsoleta]
- PDF_set_duration — Establecer la duración entre páginas [obsoleta]
- PDF_set_gstate — Activar un objeto de estado de gráficos
- PDF_set_horiz_scaling — Establecer la escala de texto horizontal [obsoleta]
- PDF_set_info_author — Rellenar el campo de información del autor de un documento [obsoleta]
- PDF_set_info_creator — Rellenar el campo de información del creador de un documento [obsoleta]
- PDF_set_info_keywords — Rellenar el campo de información de las palabras clave de un documento [obsoleta]
- PDF_set_info_subject — Rellenar el campo de información del asunto de un documento [obsoleta]
- PDF_set_info_title — Rellenar el campo de información del título de un documento [obsoleta]
- PDF_set_info — Rellenar el campo de información de un documento
- PDF_set_layer_dependency — Definir la relación entre capas
- PDF_set_leading — Esteblecer la distancia entre líneas de texto [obsoleta]
- PDF_set_parameter — Establecer un parámetro de cadena
- PDF_set_text_matrix — Establecer la matriz del texto [obsoleta]
- PDF_set_text_pos — Establecer la posición del texto
- PDF_set_text_rendering — Determinar la renderización del texto [obsoleta]
- PDF_set_text_rise — Establecer la elevación de texto [obsoleta]
- PDF_set_value — Establecer un parámetro numérico
- PDF_set_word_spacing — Establecer el espacio entre palabras [obsoleta]
- PDF_setcolor — Establecer el color de relleno y contorno
- PDF_setdash — Establecer el patrón de discontinuidad simple
- PDF_setdashpattern — Establecer el patrón de discontinuidad
- PDF_setflat — Establecer el parámetro flatness
- PDF_setfont — Establecer un fuente
- PDF_setgray_fill — Establecer el color de relleno a gris [obsoleta]
- PDF_setgray_stroke — Establecer el color de contorno a gris [obsoleta]
- PDF_setgray — Establecer el color a gris [obsoleta]
- PDF_setlinecap — Establecer el parámetro linecap
- PDF_setlinejoin — Establecer el parámetro linejoin
- PDF_setlinewidth — Establecer el ancho de línea
- PDF_setmatrix — Establecer la matriz de transformación actual
- PDF_setmiterlimit — Establecer el límite del inglete
- PDF_setpolydash — Establecer un patrón de discontinuidad complicado [obsoleta]
- PDF_setrgbcolor_fill — Establecer los valores de color rgb del relleno [obsoleta]
- PDF_setrgbcolor_stroke — Establecer los valores de color rgb del contorno [obsoleta]
- PDF_setrgbcolor — Establecer los valores de color rgb del relleno y del contorno [obsoleta]
- PDF_shading_pattern — Definir un patrón de sombreado
- PDF_shading — Definir un mezcla
- PDF_shfill — Rellenar un área con sombreado
- PDF_show_boxed — Imprimir un texto en una caja [obsoleta]
- PDF_show_xy — Imprimir un texto en una posición dada
- PDF_show — Imrpimir un texto en la posición actual
- PDF_skew — Torcer el sistema de coordenadas
- PDF_stringwidth — Devolver el ancho de un texto
- PDF_stroke — Contornear un trazado
- PDF_suspend_page — Suspender una página
- PDF_translate — Establecer el origen del sistema de coordenadas
- PDF_utf16_to_utf8 — Convertir una cadena de UTF-16 a UTF-8
- PDF_utf32_to_utf16 — Convertir una cadena de UTF-32 a UTF-16
- PDF_utf8_to_utf16 — Convertir una cadena de UTF-8 a UTF-16
<?php
//getting new instance
$pdfFile = new_pdf();
PDF_open_file($pdfFile, " ");
//document info
pdf_set_info($pdfFile, "Auther", "Ahmed Elbshry");
pdf_set_info($pdfFile, "Creator", "Ahmed Elbshry");
pdf_set_info($pdfFile, "Title", "PDFlib");
pdf_set_info($pdfFile, "Subject", "Using PDFlib");
//starting our page and define the width and highet of the document
pdf_begin_page($pdfFile, 595, 842);
//check if Arial font is found, or exit
if($font = PDF_findfont($pdfFile, "Arial", "winansi", 1)) {
PDF_setfont($pdfFile, $font, 12);
} else {
echo ("Font Not Found!");
PDF_end_page($pdfFile);
PDF_close($pdfFile);
PDF_delete($pdfFile);
exit();
}
//start writing from the point 50,780
PDF_show_xy($pdfFile, "This Text In Arial Font", 50, 780);
PDF_end_page($pdfFile);
PDF_close($pdfFile);
//store the pdf document in $pdf
$pdf = PDF_get_buffer($pdfFile);
//get the len to tell the browser about it
$pdflen = strlen($pdfFile);
//telling the browser about the pdf document
header("Content-type: application/pdf");
header("Content-length: $pdflen");
header("Content-Disposition: inline; filename=phpMade.pdf");
//output the document
print($pdf);
//delete the object
PDF_delete($pdfFile);
?>
I am trying to extract the text from PDF files and use it to feed a search engine (Intranet tool). I tried several functions "PDF2TXT" posted below, but not they do not produce the expected result. At least, all words need to be separated by spaces (then used as keywords), and the "junk" codes removed (for example: binary data, pictures...). I start modifying the interesting function posted by Swen, and here is the my current version that starts to work quite well (with PDF version 1.2). Sorry for having a quite different style of programming. Luc
<?php
// Patch for pdf2txt() posted Sven Schuberth
// Add/replace following code (cannot post full program, size limitation)
// handles the verson 1.2
// New version of handleV2($data), only one line changed
function handleV2($data){
// grab objects and then grab their contents (chunks)
$a_obj = getDataArray($data,"obj","endobj");
foreach($a_obj as $obj){
$a_filter = getDataArray($obj,"<<",">>");
if (is_array($a_filter)){
$j++;
$a_chunks[$j]["filter"] = $a_filter[0];
$a_data = getDataArray($obj,"stream\r\n","endstream");
if (is_array($a_data)){
$a_chunks[$j]["data"] = substr($a_data[0],
strlen("stream\r\n"),
strlen($a_data[0])-strlen("stream\r\n")-strlen("endstream"));
}
}
}
// decode the chunks
foreach($a_chunks as $chunk){
// look at each chunk and decide how to decode it - by looking at the contents of the filter
$a_filter = split("/",$chunk["filter"]);
if ($chunk["data"]!=""){
// look at the filter to find out which encoding has been used
if (substr($chunk["filter"],"FlateDecode")!==false){
$data =@ gzuncompress($chunk["data"]);
if (trim($data)!=""){
// CHANGED HERE, before: $result_data .= ps2txt($data);
$result_data .= PS2Text_New($data);
} else {
//$result_data .= "x";
}
}
}
}
return $result_data;
}
// New function - Extract text from PS codes
function ExtractPSTextElement($SourceString)
{
$CurStartPos = 0;
while (($CurStartText = strpos($SourceString, '(', $CurStartPos)) !== FALSE)
{
// New text element found
if ($CurStartText - $CurStartPos > 8) $Spacing = ' ';
else {
$SpacingSize = substr($SourceString, $CurStartPos, $CurStartText - $CurStartPos);
if ($SpacingSize < -25) $Spacing = ' '; else $Spacing = '';
}
$CurStartText++;
$StartSearchEnd = $CurStartText;
while (($CurStartPos = strpos($SourceString, ')', $StartSearchEnd)) !== FALSE)
{
if (substr($SourceString, $CurStartPos - 1, 1) != '\\') break;
$StartSearchEnd = $CurStartPos + 1;
}
if ($CurStartPos === FALSE) break; // something wrong happened
// Remove ending '-'
if (substr($Result, -1, 1) == '-')
{
$Spacing = '';
$Result = substr($Result, 0, -1);
}
// Add to result
$Result .= $Spacing . substr($SourceString, $CurStartText, $CurStartPos - $CurStartText);
$CurStartPos++;
}
// Add line breaks (otherwise, result is one big line...)
return $Result . "\n";
}
// Global table for codes replacement
$TCodeReplace = array ('\(' => '(', '\)' => ')');
// New function, replacing old "pd2txt" function
function PS2Text_New($PS_Data)
{
global $TCodeReplace;
// Catch up some codes
if (ord($PS_Data[0]) < 10) return '';
if (substr($PS_Data, 0, 8) == '/CIDInit') return '';
// Some text inside (...) can be found outside the [...] sets, then ignored
// => disable the processing of [...] is the easiest solution
$Result = ExtractPSTextElement($PS_Data);
// echo "Code=$PS_Data\nRES=$Result\n\n";
// Remove/translate some codes
return strtr($Result, $TCodeReplace);
}
?>
Hi,
there is some more fix from luc pdf2text function. It really works at my tasks.
Two fixes:
1) Different platforms set different characters after start "stream" text, for example: "stream\n", "stream\r", "stream\r\n". So, we detect it first.
2) Some non-text blocks are detected as text, so we added a function "FilterNonText".
<?php
function handleV2($data){
// try detecting \n, \r or \r\n variation
$tmp = strpos($data, "stream");
$end_stream_delimiter = substr($data, $tmp+6, 2);
if($end_stream_delimiter != "\r\n") {
$end_stream_delimiter = substr($end_stream_delimiter, 0, 1);
}
//echo bin2hex($end_stream_delimiter); // - debug information
// grab objects and then grab their contents (chunks)
$a_obj = getDataArray($data,"obj","endobj");
foreach($a_obj as $obj){
$a_filter = getDataArray($obj,"<<",">>");
if (is_array($a_filter)){
$j++;
$a_chunks[$j]["filter"] = $a_filter[0];
$a_data = getDataArray($obj,"stream".
$end_stream_delimiter,"endstream");
if (is_array($a_data)){
$a_chunks[$j]["data"] = substr($a_data[0],
strlen("stream".$end_stream_delimiter),
strlen($a_data[0])-
strlen("stream".$end_stream_delimiter)-strlen("endstream"));
}
}
}
// decode the chunks
foreach($a_chunks as $chunk){
// look at each chunk and decide how to decode it - by looking at the contents of the filter
$a_filter = split("/",$chunk["filter"]);
if ($chunk["data"]!=""){
// look at the filter to find out which encoding has been used
if (substr($chunk["filter"],"FlateDecode")!==false){
$data =@ gzuncompress($chunk["data"]);
if (trim($data)!=""){
// CHANGED HERE, before: $result_data .= ps2txt($data);
$result_data .= FilterNonText(PS2Text_New($data));
} else {
//$result_data .= "x";
}
}
}
}
return $result_data;
}
function FilterNonText($data) {
for($i=1;$i<9;$i++) {
if(strpos($data, chr($i)) !== false) {
return ""; // not text, something strange
}
}
return $data;
}
?>
Warning: this is only a patch to "luc at phpt dot org" code. You must use his solution first, then replace function with this patch.
Hi,
To find the page number of a PDF File, i find this :
<?php
public function getNumPagesInPDF(array $arguments = array())
{
@list($PDFPath) = $arguments;
$stream = @fopen($PDFPath, "r");
$PDFContent = @fread ($stream, filesize($PDFPath));
if(!$stream || !$PDFContent)
return false;
$firstValue = 0;
$secondValue = 0;
if(preg_match("/\/N\s+([0-9]+)/", $PDFContent, $matches)) {
$firstValue = $matches[1];
}
if(preg_match_all("/\/Count\s+([0-9]+)/s", $PDFContent, $matches))
{
$secondValue = max($matches[1]);
}
return (($secondValue != 0) ? $secondValue : max($firstValue, $secondValue));
}
?>
/*
Folks, There is an excellent tutorial from Rasmus Lerdorf available at (It does not support I.E.)
http://talks.php.net/show/osconpdf/
Where PHP Mastermind Guru (Father) explained nicely about text, fonts, images and their attributes with working snippets.
Another tutorial can be found at
www.devshed.com/c/a/PHP/Building-PDF-Documents-with-PHP-5
Hence following is the various size of PDF Document.
Origin is at the lower left and the basic unit is the DTP pt.
1 pt = 1/72 inch = 0.35277777778 mm
Some common page sizes
Format Width Height
US-Letter 612 792
US-Legal 612 1008
US-Ledger 1224 792
11x17 792 1224
A0 2380 3368
A1 1684 2380
A2 1190 1684
A3 842 1190
A4 595 842
A5 421 595
A6 297 421
B5 501 709
*/
For those of us that do not want to pay for a commercial license to use PDFlib I suggest TCPDF:
http://tcpdf.sf.net
TCPDF is an Open Source PHP class for generating PDF files on-the-fly without requiring external extensions. This class is already adopted by a large number of php projects such as phpMyAdmin, Drupal, Joomla, Xoops, TCExam, etc.
Starting from 2.1 version TCPDF supports UTF-8 Unicode and bidirectional languages such as Arabic and Hebrew.
To get this to work on Windows do not use escapeshellcmd()
From online help:
Following characters are preceded by a backslash: #&;`|*?~<>^()[]{}$\, \x0A and \xFF. ' and " are escaped only if they are not paired. In Windows, all these characters plus % are replaced by a space instead.
So you are probably passing duff paths to pdf2text.exe
Removing escapeshellcmd worked for me. Just make darned sure you are in control of what is being passed through to your system call.
Here is a function to test whether a file is a PDF without using any external library.
<?php
define('PDF_MAGIC', "\\x25\\x50\\x44\\x46\\x2D");
function is_pdf($filename) {
return (file_get_contents($filename, false, null, 0, strlen(PDF_MAGIC)) === PDF_MAGIC) ? true : false;
}
?>
It's not checking if the whole file is valid, just if the correct header is present at the beginning of the file.
domPDF is also a great PDF creation interface. it basically converts your code to CSS and then builds the PDF from that with the absolute positions, and what not...
I was having trouble with streaming inline PDf's using PHP 5.0.2, Apache 2.0.54.
This is my code:
<?
header("Pragma: public");
header("Expires: Mon, 26 Jul 1997 05:00:00 GMT");
header("Last-Modified: " . gmdate("D, d M Y H:i:s") . " GMT");
header("Cache-Control: must-revalidate");
header("Content-type: application/pdf");
header("Content-Length: ".filesize($file));
header("Content-disposition: inline; filename=$file");
header("Accept-Ranges: ".filesize($file));
readfile($file);
exit();
?>
It would work fine in Mozilla Firefox (1.0.7) but with IE (6.0.2800.1106) it would not bring up the Adobe Reader plugin and instead ask me to save it or open it as a PHP file.
Oddly enough, I turned off ZLib.compression and it started working. I guess the compression is confusing IE. I tried leaving out the content-length header thinking maybe it was unmatched filesize (uncompressed number vs actual received compressed size), but then without it it screws up Firefox too.
What I ended up doing was disabling Zlib compression for the PDF output pages using ini_set:
<?
ini_set('zlib.output_compression','Off');
?>
Maybe this will help someone. Will post over in the PDF section as well.
I was searching for a lowcost/opensource option for combining static html files [as templates] and dynamic output from perl or php routines etc. And the sooner or later I found out that this was the most stable, 'speedest' and customizeable way to produce usable pdf 's with nice formatting :
1] create html page output [perl-> html output, direct html output from any app or php echo's etc. [sort these html files locally]
2] parse all html [inluding webimages links, tables font formatting etc] to [E]PS files with the perl app : html2ps [as mentioned beneath]
http://user.it.uu.se/~jan/html2ps.html [sort all ps files by future pdf page positions]
3] use the free ps2pdf/ps2pdfwr linux application
http://www.ps2pdf.com/convert/index.htm [uses gostscript, ghostview libs and so on etc]
Has great formatting options like headers, footers, numbering etc
[sort pdf files]
4] convert all pdf files to 1 pdf file with : pdftk [pdftoolkit], deliveres optional compressions/encryption, background stamps etc
One should ask why using different scripts :
- combination perl/php is great : perl is speedier at some issues like conversion to ps files in my experience
- ps to pdf is quickier then direct php to pdf [in my exp.!]
- I have total control over every files whenever i change html files as a template I use only editors or other app. for it [online or offline].
p.s. I had to make a opensource solution for creating simpel report analyses that's based on things like :
- first page [name / title / #/ date]
- some static info [like introduction, copyrights etc]
- some dynamic info [outputted from php->dbase queries] combined
with html tags/images etc.
And this all mixed [so seperated in files for transparancy]. Also the 3 way manner : data-> html, html->ps, ps->pdf, is easier and quickier to program or adjust in every step.
Correct me if i'm wrong [mail me to]
ing. Valentijn Langendorff
Design & Technologist
After one hole day understanding how pdflib works i got the conclusion that its enough hard to draw just with words to furthermore for drawing a line maybe you will need something like four lines of code, so i did my own functions to do the life easier and the code more understable to modify and draw. I also made a function that will draw a rect with the corners round and the posibility even to fill it ;)
You can get it from http://www.deulos.com/pdf_php.php
feel free to make suggestions or whatever u like ;o)
Yet another addition to the PDF text extraction code last posted by jorromer. The code only seemed to work for PDF 1.2 (Acrobat 3.x) or below. This pdfExtractText function uses regular expressions to cover cases I have found in PDF 1.3 and 1.4 documents. The code also handles closing brackets in the text stream, which were ignored by the previous version. My regular expression skills are somewhat lacking, so improvements may possible by a more skilled programmer. I'm sure there are still cases that this function will not handle, but I haven't come across any yet...
<?php
function pdf2string($sourcefile) {
$fp = fopen($sourcefile, 'rb');
$content = fread($fp, filesize($sourcefile));
fclose($fp);
$searchstart = 'stream';
$searchend = 'endstream';
$pdfText = '';
$pos = 0;
$pos2 = 0;
$startpos = 0;
while ($pos !== false && $pos2 !== false) {
$pos = strpos($content, $searchstart, $startpos);
$pos2 = strpos($content, $searchend, $startpos + 1);
if ($pos !== false && $pos2 !== false){
if ($content[$pos] == 0x0d && $content[$pos + 1] == 0x0a) {
$pos += 2;
} else if ($content[$pos] == 0x0a) {
$pos++;
}
if ($content[$pos2 - 2] == 0x0d && $content[$pos2 - 1] == 0x0a) {
$pos2 -= 2;
} else if ($content[$pos2 - 1] == 0x0a) {
$pos2--;
}
$textsection = substr(
$content,
$pos + strlen($searchstart) + 2,
$pos2 - $pos - strlen($searchstart) - 1
);
$data = @gzuncompress($textsection);
$pdfText .= pdfExtractText($data);
$startpos = $pos2 + strlen($searchend) - 1;
}
}
return preg_replace('/(\s)+/', ' ', $pdfText);
}
function pdfExtractText($psData){
if (!is_string($psData)) {
return '';
}
$text = '';
// Handle brackets in the text stream that could be mistaken for
// the end of a text field. I'm sure you can do this as part of the
// regular expression, but my skills aren't good enough yet.
$psData = str_replace('\)', '##ENDBRACKET##', $psData);
$psData = str_replace('\]', '##ENDSBRACKET##', $psData);
preg_match_all(
'/(T[wdcm*])[\s]*(\[([^\]]*)\]|\(([^\)]*)\))[\s]*Tj/si',
$psData,
$matches
);
for ($i = 0; $i < sizeof($matches[0]); $i++) {
if ($matches[3][$i] != '') {
// Run another match over the contents.
preg_match_all('/\(([^)]*)\)/si', $matches[3][$i], $subMatches);
foreach ($subMatches[1] as $subMatch) {
$text .= $subMatch;
}
} else if ($matches[4][$i] != '') {
$text .= ($matches[1][$i] == 'Tc' ? ' ' : '') . $matches[4][$i];
}
}
// Translate special characters and put back brackets.
$trans = array(
'...' => '…',
'\205' => '…',
'\221' => chr(145),
'\222' => chr(146),
'\223' => chr(147),
'\224' => chr(148),
'\226' => '-',
'\267' => '•',
'\(' => '(',
'\[' => '[',
'##ENDBRACKET##' => ')',
'##ENDSBRACKET##' => ']',
chr(133) => '-',
chr(141) => chr(147),
chr(142) => chr(148),
chr(143) => chr(145),
chr(144) => chr(146),
);
$text = strtr($text, $trans);
return $text;
}
?>
If you want to display the number of pages (for example: page 1 of 3) then the following code could be helpful:
<?php
...
$pdf->begin_page_ext(842,595 , "");
.. add text,images,...
$pdf->suspend_page("");
$pdf->begin_page_ext(842,595 , "");
.. add text,images,...
$pdf->suspend_page("");
... create all pages
$pdf->resume_page("pagenumber 1");
... add number of pages to page 1
$pdf->end_page_ext("");
$pdf->resume_page("pagenumber 2");
... add number of pages to page 2
$pdf->end_page_ext("");
...
?>
I recently use mattb code below for the extraction of text from PDF files. I modify this code for only extract text fields.
Hope i can help some one
Here is the Function
<?php
$text = pdf2string("file.pdf");
echo $text;
function pdf2string($sourcefile){
$fp = fopen($sourcefile, 'rb');
$content = fread($fp, filesize($sourcefile));
fclose($fp);
$searchstart = 'stream';
$searchend = 'endstream';
$pdfdocument = '';
$pos = 0;
$pos2 = 0;
$startpos = 0;
while( $pos !== false && $pos2 !== false ){
$pos = strpos($content, $searchstart, $startpos);
$pos2 = strpos($content, $searchend, $startpos + 1);
if ($pos !== false && $pos2 !== false){
if ($content[$pos]==0x0d && $content[$pos+1]==0x0a) $pos+=2;
else if ($content[$pos]==0x0a) $pos++;
if ($content[$pos2-2]==0x0d && $content[$pos2-1]==0x0a) $pos2-=2;
else if ($content[$pos2-1]==0x0a) $pos2--;
$textsection = substr($content, $pos + strlen($searchstart) + 2, $pos2 - $pos - strlen($searchstart) - 1);
$data = @gzuncompress($textsection);
$data = ExtractText2($data);
$startpos = $pos2 + strlen($searchend) - 1;
if ($data === false){
return -1;}
$pdfdocument .= $data;}}
return $pdfdocument;}
function ExtractText2($postScriptData){
$sw = true;
$textStart = 0;
$len = strlen($postScriptData);
while ($sw){
$ini = strpos($postScriptData, '(', $textStart);
$end = strpos($postScriptData, ')', $textStart+1);
if (($ini>0) && ($end>$ini)){
$valtext = strpos($postScriptData,'Tj',$end+1);
if ($valtext == $end + 2)
$text .= substr($postScriptData,$ini+1,$end - $ini - 1);}
$textStart = $end + 1;
if ($len<=$textStart) $sw=false;
if (($ini == 0) && ($end == 0)) $sw=false;}
$trans = array("\\341" => "a","\\351" => "e","\\355" => "i","\\363" => "o","\\223" => "","\\224" => "");
$text = strtr($text, $trans);
return $text;
}
?>
I found this info about pdflib scope on a Chinese (I think) site and translated it. I was trying to do pdf_setfont and kept getting the wrong scope error. Turns out it has to be in the Page scope. So pdf_setfont will only work when called between pdf_begin_page and pdf_end_page.
#########################################
When API of the PDFlib is called, the error, Can't - IN 'document' scope occurs
There is a concept of " the scope " in the PDFlib, as for all API of the PDFlib it is called with some scope, the *1 which is decided This error occurs when it is called other than the scope where API is appointed. The chart below in reference, please verify API call position.
Path: PDF_moveto (), PDF_circle (), PDF_arc (), PDF_arcn (), PDF_rect () in each case PDF_stroke (), PDF_closepath_stroke (), PDF_fill (), PDF_fill_stroke (), PDF_closepath_fill_stroke (), PDF_clip (), PDF_endpath () the between
Page: PDF_begin_page () with PDF_end_page () in between outside path
Template: PDF_begin_template () with PDF_end_template () in between outside path
Pattern: PDF_begin_pattern () with PDF_end_pattern () in between outside path
Font: PDF_begin_font () with PDF_end_font () in between outside glyph
Glyph: PDF_begin_glyph () with PDF_end_glyph () in between outside path
Document: PDF_open_* () with PDF_close () in between outside page tempalte and pattern
Object: The PDF_new () with the PDF_delete () it belongs to the other no scope in between the place
Null: Outside object
Any: All scopes other than
##########################################
Hope this helps others as much as it helped me!!!
I recently tested Donatas' code below for the extraction of text from PDF files. After running into a few problems where PDF files were not being read at all, I've modified it somewhat. It still isn't perfect, but should work great for searching. Thanks Donatas.
<?php
$test = pdf2string("<pathtoPDFfile>");
echo "$test";
# Returns a -1 if uncompression failed
function pdf2string($sourcefile)
{
$fp = fopen($sourcefile, 'rb');
$content = fread($fp, filesize($sourcefile));
fclose($fp);
# Locate all text hidden within the stream and endstream tags
$searchstart = 'stream';
$searchend = 'endstream';
$pdfdocument = "";
$pos = 0;
$pos2 = 0;
$startpos = 0;
# Iterate through each stream block
while( $pos !== false && $pos2 !== false )
{
# Grab beginning and end tag locations if they have not yet been parsed
$pos = strpos($content, $searchstart, $startpos);
$pos2 = strpos($content, $searchend, $startpos + 1);
if( $pos !== false && $pos2 !== false )
{
# Extract compressed text from between stream tags and uncompress
$textsection = substr($content, $pos + strlen($searchstart) + 2, $pos2 - $pos - strlen($searchstart) - 1);
$data = @gzuncompress($textsection);
# Clean up text via a special function
$data = ExtractText($data);
# Increase our PDF pointer past the section we just read
$startpos = $pos2 + strlen($searchend) - 1;
if( $data === false ) { return -1; }
$pdfdocument = $pdfdocument . $data;
}
}
return $pdfdocument;
}
function ExtractText($postScriptData)
{
while( (($textStart = strpos($postScriptData, '(', $textStart)) && ($textEnd = strpos($postScriptData, ')', $textStart + 1)) && substr($postScriptData, $textEnd - 1) != '\\') )
{
$plainText .= substr($postScriptData, $textStart + 1, $textEnd - $textStart - 1);
if( substr($postScriptData, $textEnd + 1, 1) == ']' ) // This adds quite some additional spaces between the words
{
$plainText .= ' ';
}
$textStart = $textStart < $textEnd ? $textEnd : $textStart + 1;
}
return stripslashes($plainText);
}
?>
<?PHP
/* A little helpful function to calculate millimeters to points */
function calcToPt($intMillimeter) {
$intPoints = ($intMillimeter*72)/25.4;
$intPoints = round($intPoints);
return $intPoints;
}
/* For example: Create DIN A4 210x297 mm */
pdf_begin_page( $pdf, calcToPt(210), calcToPt(297)); // 595x842 pt
?>
I've been looking for a way to extract plain text from PDF documents (needed to search for text inside 'em). Not being able to find one I wrote the needed functions myself. here you go folks.
<?php
function pdf2string ($sourceFile)
{
$textArray = array ();
$objStart = 0;
$fp = fopen ($sourceFile, 'rb');
$content = fread ($fp, filesize ($sourceFile));
fclose ($fp);
$searchTagStart = chr(13).chr(10).'stream';
$searchTagStartLenght = strlen ($searchTagStart);
while ((($objStart = strpos ($content, $searchTagStart, $objStart)) && ($objEnd = strpos ($content, 'endstream', $objStart+1))))
{
$data = substr ($content, $objStart + $searchTagStartLenght + 2, $objEnd - ($objStart + $searchTagStartLenght) - 2);
$data = @gzuncompress ($data);
if ($data !== FALSE && strpos ($data, 'BT') !== FALSE && strpos ($data, 'ET') !== FALSE)
{
$textArray [] = ExtractText ($data);
}
$objStart = $objStart < $objEnd ? $objEnd : $objStart + 1;
}
return $textArray;
}
function ExtractText ($postScriptData)
{
while ((($textStart = strpos ($postScriptData, '(', $textStart)) && ($textEnd = strpos ($postScriptData, ')', $textStart + 1)) && substr ($postScriptData, $textEnd - 1) != '\\'))
{
$plainText .= substr ($postScriptData, $textStart + 1, $textEnd - $textStart - 1);
if (substr ($postScriptData, $textEnd + 1, 1) == ']') //this adds quite some additional spaces between the words
{
$plainText .= ' ';
}
$textStart = $textStart < $textEnd ? $textEnd : $textStart + 1;
}
return stripslashes ($plainText);
}
?>
Those looking for a free replacement of pdflib may consider
pslib at http://pslib.sourceforge.net which produces PostScript but it can be easily turned into PDF by Acrobat Distiller or ghostscript. The API is very similar and even hypertext functions are supported. There
is also a php extension for pslib in PECL, called ps.
Here is another great tutorial on basic PDF building w/ PHP:
http://hotwired.lycos.com/webmonkey/02/20/index3a.html?tw=programming
=======================
http://myteks.com
Computer Repair & Web Design
=======================
About creating a PDF document based on the content of another document(let's say a text file):
I have tried to send to the PDF-creator page from a link from the sender page the file name of the file I want to read the content from and generate the PDF document containing this content. The idea is is that when I tried to reffer the pdf-creator page via the link your_root/create_pdf.php?filename=$your_file_name, the pdf-creator page does not behave well when before creating the pdf document I have a line like $filename = $_GET["filename"].
I solved this using on the sender page instead of the link a form with a button, so the form has as action "create_pdf.php", as method "post" and a hidden field containing the "filename" value. And it works like this if, on the pdf-creator page I have a line like $filename = $_POST["filename"].
I would like to understand why this way it works and the other way does not.
I hope this helps. Here are the pieces of code I used.
Sender page:
print("<form name='to_pdf' action='see_pdf_file.php' method='post'>");
print("<br/><input type='submit' value='PDF'><input type='hidden' name='filename' value='$filename'></form>");
PDF-creator page:
<?
$filename = $_POST["filename"];
$file_handle = fopen($filename, "r");
$file_content = file_get_contents($filename);
fclose($file_handle);
//
$file_content = wordwrap($file_content,72,"|");
$a_row = explode("|",$file_content);
$i = 0;
//
$pdf = pdf_new();
pdf_open_file($pdf, "");
pdf_begin_page($pdf, 595, 842);
pdf_set_font($pdf, "Times-Roman", 16, "host");
pdf_add_outline($pdf, "Page 1");
pdf_set_value($pdf, "textrendering", 1);
pdf_show_xy($pdf, 'The content of the file:',50,700);
while ($a_row[$i] != "")
{
pdf_continue_text($pdf,$a_row[$i]);
$i++;
}
pdf_end_page($pdf);
pdf_close($pdf);
//
$data = pdf_get_buffer($pdf);
//
header("Content-type: application/pdf");
header("Content-disposition: inline; filename=test.pdf");
header("Content-length: " . strlen($data));
//
echo $data;
?>
PDFLib and PHP 431 used.
Thanks.
Load extension, open a PDF, add a font, modify PDF in memory and send
it to browser:
<?php
// no cache headers:
header("Expires: Mon, 26 Jul 1997 05:00:00 GMT");
header("Last-Modified: ".gmdate("D, d M Y H:i:s")." GMT");
header("Cache-Control: no-store, no-cache, must-revalidate");
header("Cache-Control: post-check=0, pre-check=0", false);
header("Pragma: no-cache");
$ext_name="libpdf_php.so";
// libpdf_php.so is the PDFLIB for SunOS by "PDFlib GmbH"
// visit http://www.pdflib.com
// if the extension is not automatically loaded by Apache
// dl() will try to load it on demand:
if (!extension_loaded($ext_name) && !@dl($ext_name))
{
?>
<table width="100%" border="0"><tr><td align="center">
<table style="border: solid #f0f0f0 2px;"><tr>
<td valign="middle" style="padding: 20px; margin: 0px;">
<p style="font-family: arial; font-size: 12px; ">
<b>Sorry,</b><br>
<br>
A PDF can not be generated right now.<br>
The administrator has been informed and will fix this as
soon as possible.<br>
Please try again later.
</p>
</td></tr></table>
</td></tr></table>
<?php
mail('admin@domain.com','Error: PDFLib not found',
'Called by script:\n '.$SCRIPT_FILENAME.'?'.$QUERY_STRING,
"From: warnings@domain.com\n");
exit;
} // verify that extension is usable
// unique serial number:
srand(microtime()*10000);
$usnr= gmdate("Ymd-His-").rand(1000,9999).'-';
$pdf_file=$usnr.'result.pdf';
$src_file='source.pdf';
// create pdf object
$pdf = pdf_new();
pdf_open_file($pdf);
pdf_set_parameter($pdf, 'serial', 'if-you-have-one');
// fonts to embed, they are in the folder of this file:
pdf_set_parameter($pdf, 'FontAFM', 'TradeGothic=Tg______.afm');
pdf_set_parameter($pdf, 'FontOutline', 'TradeGothic=Tg______.pfb');
pdf_set_parameter($pdf, 'FontPFM', 'TradeGothic=Tg______.pfm');
// load the source file:
$src_doc =pdf_open_pdi($pdf,$src_file,'', 0);
$src_page =pdf_open_pdi_page($pdf,$src_doc,1,'');
$src_width =pdf_get_pdi_value($pdf,'width' ,$src_doc,$src_page,0);
$src_height=pdf_get_pdi_value($pdf,'height',$src_doc,$src_page,0);
pdf_begin_page($pdf, $src_width, $src_height);
{
// place the sourcefile to the background of the actual page:
pdf_place_pdi_page($pdf,$src_page,0,0,1,1);
pdf_close_pdi_page($pdf,$src_page);
// modify the page:
pdf_set_font($pdf, 'TradeGothic', 8, 'host');
pdf_show_xy($pdf, 'Now: '.gmdate("Y-m-d H:i:s"),50,50);
}
pdf_end_page($pdf);
pdf_close($pdf);
// prepare output:
$pdfdata = pdf_get_buffer($pdf); // to echo the pdf-data
$pdfsize = strlen($pdfdata); // IE requires the datasize
// real datatype headers:
header('Content-type: application/pdf');
header('Content-disposition: attachment; filename="'.$pdf_file.'"');
header('Content-length: '.$pdfsize);
echo $pdfdata;
exit; // keep this one so no #13#10 or #32 will be written
?>
Took me some time to find how to add a centered aligned footer, here's how:
<?php
// place footer line, centered. 297.5 is exactly half the width of a A4 page
$p->fit_textline($textline, 297.5, 35, " position=center");
?>
To extend alex's example earlier, you can use a couple of switches inside the pdf doc to give you the total number of pages, without using any ext. I would have added the whole code, however the site keeps on saying "line is too long... yadayada".
Open the doc using fopen("$file", "rb"); (for reading)
Test the first approx 1000b for the following regex
<?php
if(preg_match("/\/N\s+([0-9]+)/", $contents, $found)) {
return $found[1];
}
?>
If that doesn't return anything, you have to read the rest of the file:
<?php
preg_match_all("/\/Type\s*\/Pages\s*\/Kids\s+
\[.*?\]\s*\/Count\s+([0-9]+)/");
?>
This may return more than one, so look through for the highest value, which is the total number of pages in your doc.
The other issue with DOMpdf is that it has some pretty painful flaws.
You have to supply full paths to everything (images, includes, javascript files, etc). And boy, do i mean everything.
Even then, it is not 100% sound. If you have complex sites, it cannot handle it. It instead breaks the design and only provides you with about a million broken images.
Don't get me wrong, it's GREAT for use with lower-end more simple sites, but if you have a site that say, has a javascript navigation, flash, and a bunch of container divs, it's really not going to do the job.
The above library seems to be the best fit, as about the only way to get high-end sites to work is just to manually write it out yourself using the functions above.
Sorry to bust anyone's bubble. Good luck.
There is XPDF Win32 binary package at SourceForge for pdftotext purpose that works.
I've tried php codes below but didn't work.
domPDF is not so great PDF creator becouse don't support foreign charachters.
I seriously tried to get PDF parsing to work to use it in the indexing for fulltext search for a document management. But none of the pdf2text functions below worked for my test cases (among them an openoffice generated pdf file and a file generated by fpdf).
But I found a REALLY WORKING SOLUTION! On linux systems, install the XPDF package. It comes with a tool called pdftotext. Use php code similar to the following to get the text content of your pdf files:
<?php
$file = "test.pdf";
$outpath = preg_replace("/\.pdf$/", "", $file).".txt";
system("pdftotext ".escapeshellcmd($file), $ret);
if ($ret == 0)
{
$value = file_get_contents($outpath);
unlink($outpath);
print $value;
}
if ($ret == 127)
print "Could not find pdftotext tool.";
if ($ret == 1)
print "Could not find pdf file.";
?>
The solution works on all test cases and is much more powerful than any of the previous pure php functions posted here, although only available on linux.
http://www.digitaljunkies.ca/dompdf/index.php
PHP5 class that converts HTML to PDF. From the website:
"At its heart, dompdf is (mostly) CSS2.1 compliant HTML layout and rendering engine written in PHP. It is a style-driven renderer: it will download and read external stylesheets, inline style tags, and the style attributes of individual HTML elements. It also supports most presentational HTML attributes."
Easiest way to get the text of a pdf is to install xpdf (on redhat yum -y install xpdf)
then run xpdftotext your.pdf - which will then generate your.txt.
For those of us that do not want to pay for a commercial license to use PDFlib in a closed-source project, there are at least two good alternatives: FPDF and TCPDF
http://www.fpdf.org/
PHP4 and PHP5 support
http://sourceforge.net/projects/pdf-php
PHP5 support only
I've improved the codesnipped for the pdf2txt version 1.2.
Now its possible the translate pdf version >1.2 into plain text.
Sven
<?php
// Function : pdf2txt()
// Arguments : $filename - Filename of the PDF you want to extract
// Description : Reads a pdf file, extracts data streams, and manages
// their translation to plain text - returning the plain
// text at the end
// Authors : Jonathan Beckett, 2005-05-02
// : Sven Schuberth, 2007-03-29
function pdf2txt($filename){
$data = getFileData($filename);
$s=strpos($data,"%")+1;
$version=substr($data,$s,strpos($data,"%",$s)-1);
if(substr_count($version,"PDF-1.2")==0)
return handleV3($data);
else
return handleV2($data);
}
// handles the verson 1.2
function handleV2($data){
// grab objects and then grab their contents (chunks)
$a_obj = getDataArray($data,"obj","endobj");
foreach($a_obj as $obj){
$a_filter = getDataArray($obj,"<<",">>");
if (is_array($a_filter)){
$j++;
$a_chunks[$j]["filter"] = $a_filter[0];
$a_data = getDataArray($obj,"stream\r\n","endstream");
if (is_array($a_data)){
$a_chunks[$j]["data"] = substr($a_data[0],
strlen("stream\r\n"),
strlen($a_data[0])-strlen("stream\r\n")-strlen("endstream"));
}
}
}
// decode the chunks
foreach($a_chunks as $chunk){
// look at each chunk and decide how to decode it - by looking at the contents of the filter
$a_filter = split("/",$chunk["filter"]);
if ($chunk["data"]!=""){
// look at the filter to find out which encoding has been used
if (substr($chunk["filter"],"FlateDecode")!==false){
$data =@ gzuncompress($chunk["data"]);
if (trim($data)!=""){
$result_data .= ps2txt($data);
} else {
//$result_data .= "x";
}
}
}
}
return $result_data;
}
//handles versions >1.2
function handleV3($data){
// grab objects and then grab their contents (chunks)
$a_obj = getDataArray($data,"obj","endobj");
$result_data="";
foreach($a_obj as $obj){
//check if it a string
if(substr_count($obj,"/GS1")>0){
//the strings are between ( and )
preg_match_all("|\((.*?)\)|",$obj,$field,PREG_SET_ORDER);
if(is_array($field))
foreach($field as $data)
$result_data.=$data[1];
}
}
return $result_data;
}
function ps2txt($ps_data){
$result = "";
$a_data = getDataArray($ps_data,"[","]");
if (is_array($a_data)){
foreach ($a_data as $ps_text){
$a_text = getDataArray($ps_text,"(",")");
if (is_array($a_text)){
foreach ($a_text as $text){
$result .= substr($text,1,strlen($text)-2);
}
}
}
} else {
// the data may just be in raw format (outside of [] tags)
$a_text = getDataArray($ps_data,"(",")");
if (is_array($a_text)){
foreach ($a_text as $text){
$result .= substr($text,1,strlen($text)-2);
}
}
}
return $result;
}
function getFileData($filename){
$handle = fopen($filename,"rb");
$data = fread($handle, filesize($filename));
fclose($handle);
return $data;
}
function getDataArray($data,$start_word,$end_word){
$start = 0;
$end = 0;
unset($a_result);
while ($start!==false && $end!==false){
$start = strpos($data,$start_word,$end);
if ($start!==false){
$end = strpos($data,$end_word,$start);
if ($end!==false){
// data is between start and end
$a_result[] = substr($data,$start,$end-$start+strlen($end_word));
}
}
}
return $a_result;
}
?>
some code that can be very helpful for starters.
<?php
// Declare PDF File
$pdf = pdf_new();
PDF_open_file($pdf);
// Set Document Properties
PDF_set_info($pdf, "author", "Alexander Pas");
PDF_set_info($pdf, "title", "PDF by PHP Example");
PDF_set_info($pdf, "creator", "Alexander Pas");
PDF_set_info($pdf, "subject", "Testing Code");
// Get fonts to use
pdf_set_parameter($pdf, "FontOutline", "Arial=arial.ttf"); // get a custom font
$font1 = PDF_findfont($pdf, "Helvetica-Bold", "winansi", 0); // declare default font
$font2 = PDF_findfont($pdf, "Arial", "winansi", 1); // declare custom font & embed into file
/*
You can use the following Fontypes 14 safely (the default fonts)
Courier, Courier-Bold, Courier-Oblique, Courier-BoldOblique
Helvetica, Helvetica-Bold, Helvetica-Oblique, Helvetica-BoldOblique
Times-Roman, Times-Bold, Times-Italic, Times-BoldItalic
Symbol, ZapfDingbats
*/
// make the images
$image1 = PDF_open_image_file($pdf, "gif", "image.gif"); //supported filetypes are: jpeg, tiff, gif, png.
//Make First Page
PDF_begin_page($pdf, 450, 450); // page width and height.
$bookmark = PDF_add_bookmark($pdf, "Front"); // add a top level bookmark.
PDF_setfont($pdf, $font1, 12); // use this font from now on.
PDF_show_xy($pdf, "First Page!", 5, 225); // show this text measured from the left top.
pdf_place_image($pdf, $image1, 255, 5, 1); // last number will schale it.
PDF_end_page($pdf); // End of Page.
//Make Second Page
PDF_begin_page($pdf, 450, 225); // page width and height.
$bookmark1 = PDF_add_bookmark($pdf, "Chapter1", $bookmark); // add a nested bookmark. (can be nested multiple times.)
PDF_setfont($pdf, $font2, 12); // use this font from now on.
PDF_show_xy($pdf, "Chapter1!", 225, 5);
PDF_add_bookmark($pdf, "Chapter1.1", $bookmark1); // add a nested bookmark (already in a nested one).
PDF_setfont($pdf, $font1, 12);
PDF_show_xy($pdf, "Chapter1.1", 225, 5);
PDF_end_page($pdf);
// Finish the PDF File
PDF_close($pdf); // End Of PDF-File.
$output = PDF_get_buffer($pdf); // assemble the file in a variable.
// Output Area
header("Content-type: application/pdf"); //set filetype to pdf.
header("Content-Length: ".strlen($output)); //content length
header("Content-Disposition: attachment; filename=test.pdf"); // you can use inline or attachment.
echo $output; // actual print area!
// Cleanup
PDF_delete($pdf);
?>
RedHat 9 + Apache 2.0 + PHP 4.3.2 + Oracle 9i + PDFlib 5.0.1 (binary distribution)
It seems to be a working bundle if you do some magic with ./configure:
RedHat 9:
kernel-2.4.20-18.9
Apache 2.0.46:
./configure --enable-so --enable-rewrite=shared --enable-status --enable-mpm=prefork
PHP 4.3.2:
./configure \
--program-prefix= \
--prefix=/usr \
--exec-prefix=/usr \
--bindir=/usr/bin \
--sbindir=/usr/sbin \
--sysconfdir=/etc \
--datadir=/usr/share \
--includedir=/usr/include \
--libdir=/usr/lib \
--libexecdir=/usr/libexec \
--localstatedir=/var \
--sharedstatedir=/usr/com \
--mandir=/usr/share/man \
--infodir=/usr/share/info \
--with-config-file-path=/etc \
--with-config-file-scan-dir=/etc/php.d \
--without-tsrm-pthreads \ # !!!!!!!!!!!!!!!!!!!!
--with-zlib \
--with-gd \
--enable-gd-native-ttf \
--with-ttf \
--without-mysql \
--with-apxs2filter=/usr/local/apache2/bin/apxs \
--with-oci8 \
--enable-sigchild \
--enable-inline-optimization
Oracle9i:
ln -s $ORACLE_HOME/rdbms/public/nzerror.h $ORACLE_HOME/rdbms/demo/nzerror.h
ln -s $ORACLE_HOME/rdbms/public/nzt.h $ORACLE_HOME/rdbms/demo/nzt.h
ln -s $ORACLE_HOME/rdbms/public/ociextp.h $ORACLE_HOME/rdbms/demo/ociextp.h
If you want to use bundled GD-library then:
1) install following packages: libjpeg, libjpeg-devel, libpng, libpng-devel, freetype, freetype-devel, libtiff, libtiff-devel, zlib, zlib-devel
2) ln -s /usr/lib/libjpeg.so.62 /usr/lib/libjpeg.so
ln -s /usr/lib/libpng.so.62 /usr/lib/libpng.so
It seems to be a working combination, because it is NOT give you:
1) error message in Apache's error_log:
Module compiled with module API=20020429, debug=0, thread-safety=0
PHP compiled with module API=20020429, debug=0, thread-safety=1
2) error message in Apache's error_log:
[notice] child pid 12345 exit signal Segmentation fault (11)
3) MS Internet Explorer can show PDF-output from your PHP-script via Acrobat plug-in and does not crush. No confusing messages about opening "Adobe Acrobat Control for ActiveX".
Hope it will save you some time.
Good luck,
Boris
