ÎÒµÄÒ»¸ö¿Í»§ÓÐÕâÑùµÄÐèÇó:ÉÏ´«Îļþ,¿ÉÒÔÊÇdoc,docx,xls,pdf,txt¸ñʽ,ÏÖÐèÒªÓÃphp¶ÁÈ¡ÕâЩÎļþµÄÄÚÈÝ,È»ºó¼ÆËãÎļþÀïÃæ×ÖÊý.
1.PHP¶ÁÈ¡DOC¸ñʽµÄÎļþ
PHPûÓÐ×Ô´ø¶ÁÈ¡wordÎļþµÄÀà,»òÕßÊÇ¿â,ÕâÀïÎÒÃÇʹÓÃantiword(http://www.winfield.demon.nl/)Õâ¸ö°üÀ´¶ÁÈ¡docÎļþ.
Ê×ÏȽéÉÜÒ»ÏÂÈçºÎÔÚwindowsÏÂʹÓÃ:
1.´ò¿ªhttp://www.winfield.demon.nl/(antiwordÏÂÔØÒ³Ãæ),ÕÒµ½¶ÔÓ¦µÄwindows°æ±¾(http://www.winfield.demon.nl/#Windows),ÏÂÔØantiword windows°æ±¾(antiword-0_37-windows.zip);
2.½«ÏÂÔØÏÂÀ´µÄÎļþ½âѹµ½CÅ̸ùĿ¼ÏÂ;
ÕâÀﻹÓÐÒ»µãÐèҪעÒâµÄ:http://www.informatik.uni-frankfurt.de/~markus/antiword/00README.WINÕâ¸öÁ¬½ÓÀïÓÐwindowsÏ°²×°µÄ˵Ã÷Îļþ.
ÐèÒªÉèÖû·¾³±äÁ¿,ÎҵĵçÄÔ(ÓÒ¼ü)->¸ß¼¶->»·¾³±äÁ¿->ÔÚÉÏÃæµÄÓû§±äÁ¿Àïн¨Ò»¸ö
±äÁ¿Ãû:HOME
±äÁ¿Öµ:c:\homeÕâ¸öĿ¼Ӧ¸ÃÊÇ´æÔÚµÄ,Èç¹û²»´æÔÚ¾ÍÔÚCÅÌÏ´´½¨Ò»¸öhomeÎļþ¼Ð.
È»ºóÔÚϵͳ±äÁ¿,ÐÞ¸ÄPath,ÔÚPath±äÁ¿µÄÖµ×îÇ°Ãæ¼ÓÉÏ%HOME%\antiword.
3.¿ªÊ¼->ÔËÐÐ->CMD ½øÈëµ½antiwordĿ¼;
ÊäÈë antiword -h ¿´¿´Ð§¹û.
4.È»ºóÎÒÃÇʹÓÃantiword ¨Ct ÃüÁî¶ÁÈ¡Ò»ÏÂdocÎļþÄÚÈÝ;Ê×Ïȸ´ÖÆÒ»¸ödocÎļþµ½c:\antiwordĿ¼,È»ºóÖ´ÐÐ
>antiword ¨Ct ÎļþÃû.doc
¾Í¿ÉÒÔ¿´µ½ÆÁÄ»ÉÏÊä³öwordÎļþµÄÄÚÈÝÁË.
¿ÉÄÜÄã»áÎÊÁË,ÕâºÍPHP¶ÁÈ¡wordÓÐʲô¹ØϵÄØ?ºÇºÇ,±ð¼±,ÎÒÃÇÀ´¿´¿´ÈçºÎÔÚPHPÀïʹÓÃÕâ¸öÃüÁî.
<?php
$file = ¡°D:\xampp\htdocs\word_count\uploads\doc-english.doc¡±;
$content = shell_exec(¡°c:\antiword\antiword ¨Cf $file¡±);
?>
ÕâÑù¾Í°ÑwordÀïÃæµÄÄÚÈݶÁÈ¡contentÀïÃæÁË.
ÖÁÓÚÈçºÎÔÚLinux϶ÁÈ¡docÎļþÄÚÈÝ,¾ÍÊÇÏÂÔØlinux°æ±¾µÄѹËõ°ü,ÀïÃæÓÐreadme.txtÎļþ,°´ÕÕÄÇÖÖ·½Ê½°²×°¾Í¿ÉÒÔÁË.
$content = shell_exec ( "/usr/local/bin/antiword -f $file" );
2.PHP¶ÁÈ¡PDFÎļþÄÚÈÝ
phpҲûÓÐרÃÅÓÃÀ´¶ÁÈ¡pdfÄÚÈݵÄÀà¿â.ÕâÑùÎÒÃDzÉÓõÚÈý·½°ü(xpdf).»¹ÊÇÏÈ×öwindowsϵIJÙ×÷,ÏÂÔØ,½«Æä½âѹµ½CÅ̸ùĿ¼ÏÂ.
¿ªÊ¼->ÔËÐÐ->cmd->cd /d c:\xpdf
<?php
$file = ¡°D:\xampp\htdocs\word_count\uploads\pdf-english.pdf¡±;
$content = shell_exec ( "c:\\xpdf\\pdftotext $file -" );
?>
ÕâÑù¾Í¿ÉÒÔ°ÑpdfÎļþµÄÄÚÈݶÁÈ¡µ½php±äÁ¿ÀïÁË.
Linuxϵݲװ·½·¨Ò²ºÜ¼òµ¥ÕâÀï¾Í²»ÔÚÒ»Ò»Áгö
<?php
$content = shell_exec ( "/usr/bin/pdftotext $file -" );
?>
3.PHP¶ÁÈ¡ZIPÎļþÄÚÈÝ
Ê×ÏÈʹÓÃPHP zip½âѹzipÎļþ,È»ºó¶ÁÈ¡½âѹ°üÀïµÄÎļþ,Èç¹ûÊÇword¾Í²ÉÓÃantiword¶ÁÈ¡,Èç¹ûÊÇpdf¾ÍʹÓÃxpdf¶ÁÈ¡.
<?php
/**
* Read ZIP valid file
*
* @param string $file file path
* @return string total valid content
*/
function ReadZIPFile($file = '') {
$content = "";
$inValidFileName = array ();
$zip = new ZipArchive ( );
if ($zip->open ( $file ) === TR ) {
for($i = 0; $i < $zip->numFiles; $i ++) {
$entry = $zip->getNameIndex ( $i );
if (preg_match ( '#\.(txt)|\.(doc)|\.(docx)|\.(pdf)$#i', $entry )) {
$zip->extractTo ( pathinfo ( $file, PATHINFO_DIRNAME ) . "/" .
pathinfo ( $file, PATHINFO_FILENAME ), array (
$entry
) );
$content .= CheckSystemOS ( pathinfo ( $file, PATHINFO_DIRNAME ) . "/" .
pathinfo ( $file, PATHINFO_FILENAME ) . "/" . $entry );
} else {
$inValidFileName [$i] = $entry;
}
}
$zip->close ();
rrmdir ( pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ) );
/*if (file_exists ( $file )) {
unlink ( $file );
}*/
return $content;
} else {
return "";
}
}
?>
4.PHP¶ÁÈ¡DOCXÎļþÄÚÈÝ
docxÎļþÆäʵÊÇÓɺܶàXMLÎļþ×é³É,ÆäÖÐÄÚÈݾʹæÔÚÓÚword/document.xmlÀïÃæ.
ÎÒÃÇÕÒµ½Ò»¸ödocxÎļþ,ʹÓÃzipÎļþ´ò¿ª(»òÕß°Ñdocxºó׺Ãû¸ÄΪzip,È»ºó½âѹ)
ÔÚwordĿ¼ÏÂÓÐdocument.xml
docxÎļþµÄÄÚÈݾʹæÔÚÓÚdocument.xmlÀïÃæ,ÎÒÃǶÁÈ¡Õâ¸öÎļþ¾Í¿ÉÒÔÁË.
<?php
/**
* Read Docx File
*
* @param string $file filepath
* @return string file content
*/
function parseWord($file) {
$content = "";
$zip = new ZipArchive ( );
if ($zip->open ( $file ) === tr ) {
for($i = 0; $i < $zip->numFiles; $i ++) {
$entry = $zip->getNameIndex ( $i );
if (pathinfo ( $entry, PATHINFO_BASENAME ) == "document.xml") {
$zip->extractTo ( pathinfo ( $file, PATHINFO_DIRNAME ) . "/" .
pathinfo ( $file, PATHINFO_FILENAME ), array (
$entry
) );
$filepath = pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ) . "/" . $entry;
$content = strip_tags ( file_get_contents ( $filepath ) );
break;
}
}
$zip->close ();
rrmdir ( pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ) );
return $content;
} else {
return "";
}
}
?>
Èç¹ûÏëҪͨ¹ýPHP´´½¨docxÎļþ,»òÕßÊÇ°ÑdocxÎļþתΪxhtml,pdf¿ÉÒÔʹÓÃphpdocx,(http://www.phpdocx.com/)
5.PHP¶ÁTXT
Ö±½ÓʹÓÃPHP file_get_contentº¯Êý¾Í¿ÉÒÔÁË.
<?php
$file = ¡°D:\xampp\htdocs\word_count\uploads\eng.txt¡±;
$content = file_get_content($file);
?>
6.PHP¶ÁEXCEL
ÏÖÔÚÖ»ÊǶÁÈ¡ÎļþÄÚÈÝÁË,Ôõô¼ÆËãµ¥´ÊµÄ¸öÊýÄØ?
PHPÓÐÒ»¸ö×Ô´øµÄº¯Êý,str_word_count,Õâ¸öº¯Êý¿ÉÒÔ¼ÆËã³öµ¥´ÊµÄ¸öÊý,µ«ÊÇÈç¹ûÒª¼ÆËãantiword¶ÁÈ¡³öÀ´µÄdocÎļþµÄµ¥´Ê¸öÊý¾Í»áºÜ´óµÄÎó²î.
ÕâÀïÎÒÃÇʹÓÃÒÔÏÂÕâ¸öº¯ÊýרÃÅÓÃÀ´¶ÁÈ¡µ¥´Ê¸öÊý
<?php
/**
* statistic word count
*
* @param string $content word content of the file
* @return int word count of the content
*/
function StatisticWordsCount($text = '') {
// $text = trim ( preg_replace ( '/\d+/', ' ', $text ) ); // remove extra spaces
$text = str_replace ( str_split ( '|' ), '', $text ); // remove these chars (you can specify more)
// $text = str_replace ( str_split ( '-' ), '', $text ); // remove these chars (you can specify more)
$text = trim ( preg_replace ( '/\s+/', ' ', $text ) ); // remove extra spaces
$text = preg_replace ( '/-{2,}/', '', $text ); // remove 2 or more dashes in a row
$len = strlen ( $text );
if (0 === $len) {
return 0;
}
$words = 1;
while ( $len -- ) {
if (' ' === $text [$len]) {
++ $words;
}
}
return $words;
}
?>
ÏêϸµÄ´úÂëÈçÏÂ:
<?php
/**
* check system operation win or linux
*
* @param string $file contain file path and file name
* @return file content
*/
function CheckSystemOS($file = '') {
$content = "";
// $type = s str ( $file, strrpos ( $file, '.' ) + 1 );
$type = pathinfo ( $file, PATHINFO_EXTENSION );
// global $UNIX_ANTIWORD_PATH, $UNIX_XPDF_PATH;
if (strtoupper ( s str ( PHP_OS, 0, 3 ) ) === 'WIN') { //this is a server using windows
switch (strtolower ( $type )) {
case 'doc' :
$content = shell_exec ( "c:\\antiword\\antiword -f $file" );
break;
case 'docx' :
$content = parseWord ( $file );
break;
case 'pdf' :
$content = shell_exec ( "c:\\xpdf\\pdftotext $file -" );
break;
case 'zip' :
$content = ReadZIPFile ( $file );
break;
case 'txt' :
$content = file_get_contents ( $file );
break;
}
} else { //this is a server not using windows
switch (strtolower ( $type )) {
case 'doc' :
$content = shell_exec ( "/usr/local/bin/antiword -f $file" );
break;
case 'docx' :
$content = parseWord ( $file );
break;
case 'pdf' :
$content = shell_exec ( "/usr/bin/pdftotext $file -" );
break;
case 'zip' :
$content = ReadZIPFile ( $file );
break;
case 'txt' :
$content = file_get_contents ( $file );
break;
}
}
/*if (file_exists ( $file )) {
@unlink ( $file );
}*/
return $content;
}
/**
* statistic word count
*
* @param string $content word content of the file
* @return int word count of the content
*/
function StatisticWordsCount($text = '') {
// $text = trim ( preg_replace ( '/\d+/', ' ', $text ) ); // remove extra spaces
$text = str_replace ( str_split ( '|' ), '', $text ); // remove these chars (you can specify more)
// $text = str_replace ( str_split ( '-' ), '', $text ); // remove these chars (you can specify more)
$text = trim ( preg_replace ( '/\s+/', ' ', $text ) ); // remove extra spaces
$text = preg_replace ( '/-{2,}/', '', $text ); // remove 2 or more dashes in a row
$len = strlen ( $text );
if (0 === $len) {
return 0;
}
$words = 1;
while ( $len -- ) {
if (' ' === $text [$len]) {
++ $words;
}
}
return $words;
}
/**
* Read Docx File
*
* @param string $file filepath
* @return string file content
*/
function parseWord($file) {
$content = "";
$zip = new ZipArchive ( );
if ($zip->open ( $file ) === tr ) {
for($i = 0; $i < $zip->numFiles; $i ++) {
$entry = $zip->getNameIndex ( $i );
if (pathinfo ( $entry, PATHINFO_BASENAME ) == "document.xml") {
$zip->extractTo ( pathinfo ( $file, PATHINFO_DIRNAME ) . "/" .
pathinfo ( $file, PATHINFO_FILENAME ), array (
$entry
) );
$filepath = pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ) . "/" . $entry;
$content = strip_tags ( file_get_contents ( $filepath ) );
break;
}
}
$zip->close ();
rrmdir ( pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ) );
return $content;
} else {
return "";
}
}
/**
* Read ZIP valid file
*
* @param string $file file path
* @return string total valid content
*/
function ReadZIPFile($file = '') {
$content = "";
$inValidFileName = array ();
$zip = new ZipArchive ( );
if ($zip->open ( $file ) === TR ) {
for($i = 0; $i < $zip->numFiles; $i ++) {
$entry = $zip->getNameIndex ( $i );
if (preg_match ( '#\.(txt)|\.(doc)|\.(docx)|\.(pdf)$#i', $entry )) {
$zip->extractTo ( pathinfo ( $file, PATHINFO_DIRNAME ) . "/" .
pathinfo ( $file, PATHINFO_FILENAME ), array (
$entry
) );
$content .= CheckSystemOS ( pathinfo ( $file, PATHINFO_DIRNAME ) . "/" .
pathinfo ( $file, PATHINFO_FILENAME ) . "/" . $entry );
} else {
$inValidFileName [$i] = $entry;
}
}
$zip->close ();
rrmdir ( pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ) );
/*if (file_exists ( $file )) {
unlink ( $file );
}*/
return $content;
} else {
return "";
}
}
/**
* remove directory
*
* @param string $dir path dir
*/
function rrmdir($dir) {
if (is_dir ( $dir )) {
$objects = scandir ( $dir );
foreach ( $objects as $object ) {
if ($object != "." && $object != "..") {
if (filetype ( $dir . "/" . $object ) == "dir") {
rrmdir ( $dir . "/" . $object );
} else {
unlink ( $dir . "/" . $object );
}
}
}
reset ( $objects );
rmdir ( $dir );
}
}
//µ÷Ó÷½·¨
$file = ¡°D:\xampp\htdocs\word_count\uploads\pdf-german.zip¡±;
$word_number = StatisticWordsCount ( CheckSystemOS ( $file) );
?>
ÆÀÂÛ