![]() ![]() |
Jan 30 2014, 07:42 AM
Post
#1
|
|
|
Newbie ![]() Group: Members Posts: 7 Joined: 27-November 12 Member No.: 1,454 |
I will attempt to develop a simple parsing script(after posting the LinkedIn script) to accomplish this and any assistance or ideas are welcomed from this community of users for this awesome extension. It is important to note that this parsing script is designed to take email content with tables and split into separate articles so much of it will not be needed for my outlined purpose. jom25\administrator\components\com_post_by_email\helpers\parsers\LinkedIn.php CODE <?php /************************************************************** Variables you may use here (no need to explain their meaning): $from, $subject, $content, $created Others: $email: the current email object being imported $params: the parameters associated with it Any change made here will be saved in the resulting article. If you want to skip the default saving routine (if you save the parsed info while in this file) just add the following line: $continue = TRUE; If a singe e-mail message is parsed into several articles, fill in the $articles array ('title'=>..., 'content'=>...) and do not set $continue = TRUE; Each article will be saved separately according to the current email object settings. **************************************************************/ defined('_JEXEC') or die('Restricted access'); $articles = array(); $chunks = explode('<table', $content); for ($z=0; $z<count($chunks); $z++) { if (FALSE !== ($pos = mb_strpos($chunks[$z], 'Started by '))) { #Author $end = mb_strpos($chunks[$z], '</td>', $pos+10); $author = html_entity_decode(strip_tags(substr(trim($chunks[$z]),$pos+10,$end-($pos+10)))); #Title $title = ''; $titlechunks = explode('<td', $chunks[$z-2]); if (!empty($titlechunks[1])) { $title = strip_tags('<td'.$titlechunks[1], '<a><strong>'); #get rid of the style attribute if (FALSE !== ($pos = mb_strpos($title, 'style="'))) { $end = mb_strpos($title, '"', $pos+7); $title = substr($title, 0, $pos).substr($title, $end+1); } $title = trim($title); } #Intro $intro = ''; $introchunks = explode('<td', $chunks[$z+1]); if (!empty($titlechunks[2])) { $intro = strip_tags('<td'.$introchunks[2], '<a><strong><span>'); #get rid of the author if (FALSE !== ($pos = mb_strpos($intro, '<span'))) { $end = mb_strpos($intro, '</span>', $pos+5); $intro = trim(substr($intro, 0, $pos).substr($intro, $end+7)); } #get rid of the style attribute if (FALSE !== ($pos = mb_strpos($intro, 'style="'))) { $end = mb_strpos($intro, '"', $pos+7); $intro = trim(substr($intro, 0, $pos).substr($intro, $end+1)); } } $content = $title.'<br />Started by '.$author; if (!empty($from)) $content .= '<br />Posted by '.$from; if (!empty($intro)) $content .= '<br /><br />'.$intro; #Summarize the info if (!empty($title)) $articles[] = array( 'title' => preg_replace("/[[:blank:]]+/", ' ', strip_tags($title)), 'content' => preg_replace("/[[:blank:]]+/", ' ', $content), 'created' => $created, //left unchanged ); } } if (empty($articles)) $continue = TRUE; ?> |
|
|
|
Jan 30 2014, 08:02 AM
Post
#2
|
|
![]() Web Design Seo ![]() ![]() ![]() ![]() Group: Root Admin Posts: 4,332 Joined: 29-April 09 From: Sofia Member No.: 1 |
supamic, thank you.
-------------------- Правила на форума | Forum Rules | How to receive support. 3D Web Design: Уеб дизайн, Seo оптимизация, Web Site Extensions, Oscommerce Addons, Wordpress plugins and Joomla Extensions. Изработка на уеб сайтове и оптимизация на сайт за търсачки и Seo услуги.
|
|
|
|
Jan 30 2014, 10:51 AM
Post
#3
|
|
|
Newbie ![]() Group: Members Posts: 7 Joined: 27-November 12 Member No.: 1,454 |
Here is my first go at it, the first pattern works, the second pattern doesn't and I haven't gotten the third css script through for test yet, any help or suggestions would be much appreciated! I've removed the lines from top and bottom of script for testing of content lengths on entry and on exit...
echo 'startcontentlength'.strlen($content); echo 'endcontentlength'.strlen($content); The first pattern works so that leads me to believe I have done something wrong in the reg expression pattern for the second and the rest of the process is sound, if not let me know. CODE <?php /*** * @DIY Parser for Post By Email */ defined('_JEXEC') or die('Restricted access'); $patterns = array(); $patterns[0] = '|(Some Specific Text)|e'; $patterns[1] = '/^<a([a-zA-Z0-9 #:;"!=_-]+)>Some Other Specific Linked Text<\/a>$/i'; $patterns[2] = '/^<style type="text/css">([a-zA-Z0-9.,;:*%@"\'\[\]{}#!=_-]+)</style>$/e'; $patterns[3] = '/^<a([[:ascii:]]+)>Some Other Specific Linked Text<\/a>$/ie'; $replacements = array(); $replacements[0] = ''; $replacements[1] = ''; $replacements[2] = ''; $replacements[3] = ''; $content = preg_replace($patterns, $replacements, $content); ?> my test HTML email CODE <style type="text/css">
@media only screen and (max-width: 660px) { table[class=w0], td[class=w0] { width: 0 !important; } #messagebody div.rcmBody table[class=w10], #messagebody div.rcmBody td[class=w10], #messagebody div.rcmBody img[class=w10] { width:10px !important; } #messagebody div.rcmBody table[class*=hide], #messagebody div.rcmBody td[class*=hide], #messagebody div.rcmBody img[class*=hide], #messagebody div.rcmBody p[class*=hide], #messagebody div.rcmBody span[class*=hide] { display:none !important; } #messagebody div.rcmBody .ExternalClass { width: 100%; display:block !important; } #messagebody div.rcmBody table td, #messagebody div.rcmBody table tr { border-collapse: collapse; } #messagebody div.rcmBody .yshortcuts, #messagebody div.rcmBody .yshortcuts a, #messagebody div.rcmBody .yshortcuts a:link,#messagebody div.rcmBody .yshortcuts a:visited, #messagebody div.rcmBody .yshortcuts a:hover, #messagebody div.rcmBody .yshortcuts a span { color: black; text-decoration: none !important; border-bottom: none !important; background: none !important} #messagebody div.rcmBody, #messagebody div.rcmBody td { font-family: 'Helvetica Neue', Arial, Helvetica, Geneva, sans-serif; } #messagebody div.rcmBody .article-title { font-size: 18px; line-height:24px; color: #436E45; font-weight:bold; margin-top:0px; margin-bottom:18px; font-family: 'Helvetica Neue', Arial, Helvetica, Geneva, sans-serif; } #messagebody div.rcmBody .article-content img { max-width: 100% } </style> <p>Some Specific Text</p> <p><a href="http://google.com">Some Other Specific Linked Text</a></p> This post has been edited by supamic: Jan 31 2014, 08:23 AM |
|
|
|
Feb 1 2014, 11:20 AM
Post
#4
|
|
|
Newbie ![]() Group: Members Posts: 7 Joined: 27-November 12 Member No.: 1,454 |
After a lot of struggling with getting my PHP regex to recognize the same patterns I was getting from this online regex tester, I finally found a pattern to include all 128 ascii characters which makes life so much easier. That pattern is a Hexadecimal reference ([\x00-\x7F]+), this includes carriage returns, semi-colons and anything else someone can include in an email.
jom25\administrator\components\com_post_by_email\helpers\parsers\diy.php CODE <?php /*** * @DIY Parser for Post By Email * @author:Supamic * @v0.0.2 */ defined('_JEXEC') or die('Restricted access'); $patterns = array(); $patterns[0] = '/(<a ([\x00-\x7F]+)>Some Other Specific Linked Text<\/a>)/i'; $patterns[1] = '|(<style ([\x00-\x7F]+)<\/style>)|i'; $patterns[2] = '/(<a ([\x00-\x7F]+)>Some less difficult Linked Text<\/a>)/i'; $patterns[3] = '|Some Specific Text|i'; $patterns[4] = '|(To stop receiving emails, <a ([\x00-\x7F]+)>click here<\/a>.)|i'; $patterns[5] = '|(This email was sent to <a ([\x00-\x7F]+)>mrtest@email.com<\/a>)|i'; $patterns[6] = '|(<p> </p>)|i'; $replacements = array('','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','',''); $replacements[3] = "Specific Replacement"; $replacements[5] = "Second Replacement"; foreach($patterns as $k => $v) $content = preg_replace($patterns[$k], $replacements[$k], $content); //$continue = TRUE; ?> Here is my latest test file, I sent this from several email accounts because gmail strips out somethings and leaves others while other webmails encode or strip out different things, so testing from more email accounts the better. CODE <html> <body> <style type="text/css" scoped="scoped"> <!-- @media only screen and(max-width: 660px) { table[class=w0], td[class=w0] { width: 0 !important; } #messagebody div.rcmBody img[class=w10] { width:10px !important; } #messagebody div.rcmBody td[class*=hide] { display:none !important; } #messagebody div.rcmBody .ExternalClass { width: 100%; display:block !important; } #messagebody div.rcmBody table tr { border-collapse: collapse; } #messagebody div.rcmBody, #messagebody div.rcmBody td { font-family:'Helvetica Neue', Arial, Helvetica, Geneva, sans-serif; } #messagebody div.rcmBody .article-content img { max-width: 100% } --> </style> <p><strong><em>Some Specific Text</em></strong></p> <p><a style="max-width: 100%;width:400px !important;font-family: 'Helvetica Neue';" href="http://google.ca">Some Other Specific Linked Text</a> </p> <p><a href="http://google.ca">Some less difficult Linked Text</a> </p> This email was sent to <a style="color: #248acf;" href="mrtest@email.com">mrtest@email.com</a>. To stop receiving emails, <a style="color: #248acf;" href="http://www.testing.ca/unsubscribe?e=c0ff&utm_campaign=crowdfund&n=8">click here</a>. </body> </html> Here is what this test email was converted into for joomla article, from 2 sources... CODE FROM HOST WEBMAIL <p><strong><em>Specific Replacement</em></strong> </p> <p> </p> <p> </p> <p>Second Replacement.</p> <p> </p> <p> </p> <p> </p> FROM GMAIL <p><span style="white-space: pre;"> </span> </p> <p><strong><em>Specific Replacement</em></strong> </p> <p> </p> <p> </p> <p>Second Replacement. <br /><br /> </p> some of the paragraph spacing is created by joomla after parsing. Here are some useful links for understanding preg_replace() better. http://www.php.net/manual/en/ref.pcre.php http://www.php.net/manual/en/reference.pcr...ttern.posix.php http://ca2.php.net/manual/en/function.preg-replace.php http://regex101.com/ http://www.regular-expressions.info/quickstart.html This post has been edited by Web Design Seo: Apr 26 2016, 11:20 AM |
|
|
|
Feb 3 2014, 05:14 AM
Post
#5
|
|
|
Newbie ![]() Group: Members Posts: 7 Joined: 27-November 12 Member No.: 1,454 |
Ive found that the preg_replace regex patterns are sound although some unpredicatable results happen according the order in which they are implemented combined with Joomla code cleanup so to help better understand what occurs while the parsing script run without needing to add a bunch of testing variables to be fed to the cron log I've just added a replacement code for each. Alter your preg_replace line as below, this will replace your patterns with strings like r3 or r25 according to index for easy troubleshooting. I've also since added a line to remove the annoying "Fwd:" in subject titles if they have been left in by accident, str_ireplace() function is case insensitive. CODE $content = preg_replace($patterns[$k], $replacements[$k]."r$k", $content); $subject = str_ireplace('Fwd: ', '', $subject); I will be setting up a separate array to remove(or alter) strings that are static using str_ireplace() because preg_replace() is probably a bit overkill and this will also assist in understanding the order in which the $content variable is changed if we do exact string matches and replacements before doing regular expression identified matches. http://ca1.php.net/manual/en/function.str-ireplace.php This post has been edited by supamic: Feb 3 2014, 06:38 AM |
|
|
|
Jul 22 2014, 03:19 PM
Post
#6
|
|
|
Newbie ![]() Group: Members Posts: 10 Joined: 22-July 14 Member No.: 2,076 |
Thanks supamic - i am busy testing with your script. Do you just have to create your diy.php file and upload to the directory and then select it to be used?
i am having trouble getting it to work? |
|
|
|
Jun 25 2015, 10:54 PM
Post
#7
|
|
![]() Newbie ![]() Group: Members Posts: 1 Joined: 31-October 14 From: UK Member No.: 2,134 |
I've been working on what I hope is the mother-of-all parsers. It uses an open-source library, HTML Purifier to strip, clean, tidy and add classes to elements. I hope someone finds this helpful.
To install:
PBE Setup: In order to make the most of the parser, set the following fields under the General tab in the mailbox setup:
I've tested with Outlook and GMail. But if anyone has any questions, comments or requests - just let me know. CODE <?php /************************************************************** Variables you may use here (no need to explain their meaning): $from, $subject, $content, $created Others: $email: the current email object being imported $params: the parameters associated with it /*** * @DIY Parser for Post By Email * @author:TheITD * @v0.1 */ defined('_JEXEC') or die('Restricted access'); // --------------- EDIT THE ELEMENTS BELOW ACCORDING TO WHICH ONES YOU WANT TO KEEP -------> $allowedHtml = ' a[href|title|target], strong,b,em,i,strike,u, p,ol,ul,li, h1, h2, h3, h4, h5, h6, hr, img[src|width|height|alt|title|class], table[border|cellspacing|cellpadding|width|height|align|summary|bgcolor|backgrou nd|bordercolor], tr[rowspan|width|height|align|valign|bgcolor|background|bordercolor],tbody,thead ,tfoot, td[colspan|rowspan|width|height|align|valign|bgcolor|background|bordercolor|scop e] th[colspan|rowspan|width|height|align|valign|scope] '; // --------------- ADD ANY CLASSES, WITH ASSOCIATED ELEMENTS TO THE FOLLOWING ARRAY -------> $class_list = array( //'p' => 'small', // As an example, this would result in <p class="small"> 'img' => 'tm-article-image uk-border-circle auto-size img-left', 'ul' => 'check', 'ol' => 'uk-list-line' ); // --------------- IT'S POSSIBLE THAT YOU WON'T NEED TO EDIT ANYTHING BELOW THIS LINE -------> // Initialise the html purifier library require_once (__DIR__ . '/htmlpurifier-4.6.0-lite/library/HTMLPurifier.auto.php'); // custom class to manipulate the element classes if(!class_exists('HTMLPurifier_AttrTransform_AnchorClass')) { class HTMLPurifier_AttrTransform_AnchorClass extends HTMLPurifier_AttrTransform { function __construct($class) { $this->class = $class; } public function transform($attr, $config, $context) { // keep predefined class if (isset($attr['class'])) { $attr['class'] .= $this->class; } else { $attr['class'] = $this->class; } return $attr; } } } // set the configuration parameters $config = HTMLPurifier_Config::createDefault(); // Configure the purifier cache if (defined('PURIFIER_CACHE')) { $config->set('Cache.SerializerPath', PURIFIER_CACHE); } else { // Disable the cache entirely $config->set('Cache.DefinitionImpl', null); } $config->set('Core.Encoding', 'UTF-8'); $config->set('HTML.Doctype', 'HTML 4.01 Transitional'); $config->set('HTML.Allowed', $allowedHtml); // Use the above string to define what's acceptable and what's not $config->set('AutoFormat.AutoParagraph', 'true'); // Force text to be contained within a paragraph $config->set('Core.EscapeInvalidTags', 'false'); // Get rid of any metadata, i.e. from MS Word $config->set('AutoFormat.RemoveEmpty', 'true'); // Clean up a bit $config->set('HTML.TidyLevel', 'heavy'); // heavy! $config->set('AutoFormat.Linkify', 'true'); // If allowed, will automatically recognise <a href=.... links // define tag transformations for bold and italic $def = $config->getHTMLDefinition(true); $def->info_tag_transform['b'] = new HTMLPurifier_TagTransform_Simple('strong'); // convert <bold> to <strong> $def->info_tag_transform['i'] = new HTMLPurifier_TagTransform_Simple('em'); // convert italic to <em> $def->info_tag_transform['br'] = new HTMLPurifier_TagTransform_Simple('p'); // convert line breaks to paragraphs // add class to particular elements if($class_list) { foreach($class_list as $elem => $class) { $element = $def->addBlankElement($elem); $new_class = $class; $element->attr_transform_post[] = new HTMLPurifier_AttrTransform_AnchorClass($new_class); // call the function and add the classes } } // now for the good bit! $purifier = new HTMLPurifier($config); $content = $purifier->purify($content); ?> Thanks for a great component - and I hope you're able to get back on the JED list soon, as this has been a fantastic feature for me, for years now. This post has been edited by theitd: Jun 25 2015, 11:07 PM |
|
|
|
Jun 26 2015, 09:07 AM
Post
#8
|
|
![]() Web Design Seo ![]() ![]() ![]() ![]() Group: Root Admin Posts: 4,332 Joined: 29-April 09 From: Sofia Member No.: 1 |
Thank you, @theitd! Will take a look on htmlpurifier.
-------------------- Правила на форума | Forum Rules | How to receive support. 3D Web Design: Уеб дизайн, Seo оптимизация, Web Site Extensions, Oscommerce Addons, Wordpress plugins and Joomla Extensions. Изработка на уеб сайтове и оптимизация на сайт за търсачки и Seo услуги.
|
|
|
|
![]() ![]() |
Similar Topics
|
Lo-Fi Version | Time is now: 1st June 2026 - 11:38 PM |