Welcome Guest ( Log In | Register )

 Forum Rules Extensions Support
 
Reply to this topicStart new topic
> Diy Parser For Post By Email, user collaborations for parsing emails in Post by Email component
supamic
post Jan 30 2014, 07:42 AM
Post #1


Newbie
*

Group: Members
Posts: 7
Joined: 27-November 12
Member No.: 1,454



Included in the extension is a LinkedIn Parsing script that no longer works because of LinkedIn changing their format of newsletters. I will add that script here for anyone to read and decipher. Hopefully from its content we can do more generalized or specific parsing scripts. For example I would like to make a parsing script that strip out certain string segments for all emails being processed but also strip out specific string segments only from certain senders.

I will attempt to develop a simple parsing script(after posting the LinkedIn script) to accomplish this and any assistance or ideas are welcomed from this community of users for this awesome extension.

It is important to note that this parsing script is designed to take email content with tables and split into separate articles so much of it will not be needed for my outlined purpose.

jom25\administrator\components\com_post_by_email\helpers\parsers\LinkedIn.php
CODE

<?php
/**************************************************************

Variables you may use here (no need to explain their meaning):

$from, $subject, $content, $created

Others:
$email: the current email object being imported
$params: the parameters associated with it

Any change made here will be saved in the resulting article.
If you want to skip the default saving routine (if you save the
parsed info while in this file) just add the following line:

$continue = TRUE;

If a singe e-mail message is parsed into several articles, fill in
the $articles array ('title'=>..., 'content'=>...) and do not set
$continue = TRUE;
Each article will be saved separately according to the current
email object settings.

**************************************************************/

defined('_JEXEC') or die('Restricted access');

$articles = array();

$chunks = explode('<table', $content);

for ($z=0; $z<count($chunks); $z++) {
if (FALSE !== ($pos = mb_strpos($chunks[$z], 'Started by '))) {
#Author
$end = mb_strpos($chunks[$z], '</td>', $pos+10);
$author = html_entity_decode(strip_tags(substr(trim($chunks[$z]),$pos+10,$end-($pos+10))));

#Title
$title = '';
$titlechunks = explode('<td', $chunks[$z-2]);
if (!empty($titlechunks[1])) {
$title = strip_tags('<td'.$titlechunks[1], '<a><strong>');
#get rid of the style attribute
if (FALSE !== ($pos = mb_strpos($title, 'style="'))) {
$end = mb_strpos($title, '"', $pos+7);
$title = substr($title, 0, $pos).substr($title, $end+1);
}
$title = trim($title);
}

#Intro
$intro = '';
$introchunks = explode('<td', $chunks[$z+1]);
if (!empty($titlechunks[2])) {
$intro = strip_tags('<td'.$introchunks[2], '<a><strong><span>');
#get rid of the author
if (FALSE !== ($pos = mb_strpos($intro, '<span'))) {
$end = mb_strpos($intro, '</span>', $pos+5);
$intro = trim(substr($intro, 0, $pos).substr($intro, $end+7));
}
#get rid of the style attribute
if (FALSE !== ($pos = mb_strpos($intro, 'style="'))) {
$end = mb_strpos($intro, '"', $pos+7);
$intro = trim(substr($intro, 0, $pos).substr($intro, $end+1));
}
}

$content = $title.'<br />Started by '.$author;
if (!empty($from))
$content .= '<br />Posted by '.$from;
if (!empty($intro))
$content .= '<br /><br />'.$intro;

#Summarize the info
if (!empty($title))
$articles[] = array(
'title' => preg_replace("/[[:blank:]]+/", ' ', strip_tags($title)),
'content' => preg_replace("/[[:blank:]]+/", ' ', $content),
'created' => $created, //left unchanged
);
}
}

if (empty($articles))
$continue = TRUE;
?>

Go to the top of the page
 
+Quote Post
Web Design Seo
post Jan 30 2014, 08:02 AM
Post #2


Web Design Seo
****

Group: Root Admin
Posts: 4,027
Joined: 29-April 09
From: Sofia
Member No.: 1



supamic, thank you.


--------------------
Правила на форума | Forum Rules | How to receive support. 3D Web Design: Уеб дизайн, Seo оптимизация, Web Site Extensions, Oscommerce Addons, Wordpress plugins and Joomla Extensions. Изработка на уеб сайтове и оптимизация на сайт за търсачки и Seo услуги.
Go to the top of the page
 
+Quote Post
supamic
post Jan 30 2014, 10:51 AM
Post #3


Newbie
*

Group: Members
Posts: 7
Joined: 27-November 12
Member No.: 1,454



Here is my first go at it, the first pattern works, the second pattern doesn't and I haven't gotten the third css script through for test yet, any help or suggestions would be much appreciated! I've removed the lines from top and bottom of script for testing of content lengths on entry and on exit...
echo 'startcontentlength'.strlen($content);
echo 'endcontentlength'.strlen($content);


The first pattern works so that leads me to believe I have done something wrong in the reg expression pattern for the second and the rest of the process is sound, if not let me know.

CODE
<?php
/***
* @DIY Parser for Post By Email
*/
defined('_JEXEC') or die('Restricted access');

$patterns = array();
$patterns[0] = '|(Some Specific Text)|e';
$patterns[1] = '/^<a([a-zA-Z0-9 #:;"!=_-]+)>Some Other Specific Linked Text<\/a>$/i';
$patterns[2] = '/^<style type="text/css">([a-zA-Z0-9.,;:*%@"\'\[\]{}#!=_-]+)</style>$/e';
$patterns[3] = '/^<a([[:ascii:]]+)>Some Other Specific Linked Text<\/a>$/ie';

$replacements = array();
$replacements[0] = '';
$replacements[1] = '';
$replacements[2] = '';
$replacements[3] = '';

$content = preg_replace($patterns, $replacements, $content);

?>




my test HTML email
CODE
<style type="text/css">
@media only screen and (max-width: 660px) {
table[class=w0], td[class=w0] { width: 0 !important;
}
#messagebody div.rcmBody table[class=w10], #messagebody div.rcmBody td[class=w10], #messagebody div.rcmBody img[class=w10] { width:10px !important;
}
#messagebody div.rcmBody table[class*=hide], #messagebody div.rcmBody td[class*=hide], #messagebody div.rcmBody img[class*=hide], #messagebody div.rcmBody p[class*=hide], #messagebody div.rcmBody span[class*=hide] { display:none !important;
}
#messagebody div.rcmBody .ExternalClass { width: 100%;
display:block !important;
}
#messagebody div.rcmBody table td, #messagebody div.rcmBody table tr { border-collapse: collapse;
}
#messagebody div.rcmBody .yshortcuts, #messagebody div.rcmBody .yshortcuts a, #messagebody div.rcmBody .yshortcuts a:link,#messagebody div.rcmBody .yshortcuts a:visited, #messagebody div.rcmBody .yshortcuts a:hover, #messagebody div.rcmBody .yshortcuts a span {
color: black;
text-decoration: none !important;
border-bottom: none !important;
background: none !important}
#messagebody div.rcmBody, #messagebody div.rcmBody td { font-family: 'Helvetica Neue', Arial, Helvetica, Geneva, sans-serif;
}
#messagebody div.rcmBody .article-title { font-size: 18px;
line-height:24px;
color: #436E45;
font-weight:bold;
margin-top:0px;
margin-bottom:18px;
font-family: 'Helvetica Neue', Arial, Helvetica, Geneva, sans-serif;
}
#messagebody div.rcmBody .article-content img { max-width: 100% }
</style>
<p>Some Specific Text</p>
<p><a href="http://google.com">Some Other Specific Linked Text</a></p>


This post has been edited by supamic: Jan 31 2014, 08:23 AM
Go to the top of the page
 
+Quote Post
supamic
post Feb 1 2014, 11:20 AM
Post #4


Newbie
*

Group: Members
Posts: 7
Joined: 27-November 12
Member No.: 1,454



laugh.gif YAY WORKING SCRIPT!!!

After a lot of struggling with getting my PHP regex to recognize the same patterns I was getting from this online regex tester, I finally found a pattern to include all 128 ascii characters which makes life so much easier. That pattern is a Hexadecimal reference ([\x00-\x7F]+), this includes carriage returns, semi-colons and anything else someone can include in an email.

    TIPS
  • My PHP version: 5.3.28
  • Some things to notice in the script, some start and end with / and some with |, I have been unable to discern any difference in my testing but let me know if there is a difference and if its useful.
  • the "i" after ending delimiter(/ or |) will allow the pattern recognition to proceed without case sensitivity
  • on pattern 3, specific text does not need to be in brackets or in pattern 6 you can see it can, useful distinction for more complex preg_replace usage
  • Replacements array all set to strip out any pattern match but if you'd like to replace the text with something match up the array index key
  • Easiest way to test on a remote server is to modify your post by mail cron script to output and append to a file that you can refresh easily from sFTP connection, this is where your parsing script will show errors or any testing variable echos or var_dumps
  • */1 * * * * /opt/php53/bin/php -q /home/.../com_post_by_email/cron.post_by_email.php >> /home/user/logs/0-cron.log
  • as you can see I also set my cron to run every 1 min for testing, my host hates that and resets it frequently
  • also important to note that I do not allow <div>'s or any <table> tags through my allowed HTML tags set in Post by Email config so no need to worry about them here


jom25\administrator\components\com_post_by_email\helpers\parsers\diy.php
CODE
<?php
/***
* @DIY Parser for Post By Email
* @author:Supamic
* @v0.0.2
*/
defined('_JEXEC') or die('Restricted access');

$patterns = array();

$patterns[0] = '/(<a ([\x00-\x7F]+)>Some Other Specific Linked Text<\/a>)/i';
$patterns[1] = '|(<style ([\x00-\x7F]+)<\/style>)|i';
$patterns[2] = '/(<a ([\x00-\x7F]+)>Some less difficult Linked Text<\/a>)/i';
$patterns[3] = '|Some Specific Text|i';
$patterns[4] = '|(To stop receiving emails, <a ([\x00-\x7F]+)>click here<\/a>.)|i';
$patterns[5] = '|(This email was sent to <a ([\x00-\x7F]+)>mrtest@email.com<\/a>)|i';
$patterns[6] = '|(<p>&nbsp;</p>)|i';

$replacements = array('','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','');
$replacements[3] = "Specific Replacement";
$replacements[5] = "Second Replacement";

foreach($patterns as $k => $v)
$content = preg_replace($patterns[$k], $replacements[$k], $content);

//$continue = TRUE;
?>



Here is my latest test file, I sent this from several email accounts because gmail strips out somethings and leaves others while other webmails encode or strip out different things, so testing from more email accounts the better.

CODE
<html>
<body>
<style type="text/css" scoped="scoped">
<!-- @media only screen and(max-width: 660px) {
table[class=w0], td[class=w0] {
width: 0 !important;
}
#messagebody div.rcmBody img[class=w10] {
width:10px !important;
}
#messagebody div.rcmBody td[class*=hide] {
display:none !important;
}
#messagebody div.rcmBody .ExternalClass {
width: 100%;
display:block !important;
}
#messagebody div.rcmBody table tr {
border-collapse: collapse;
}
#messagebody div.rcmBody, #messagebody div.rcmBody td {
font-family:'Helvetica Neue', Arial, Helvetica, Geneva, sans-serif;
}
#messagebody div.rcmBody .article-content img {
max-width: 100%
}
-->
</style>

<p><strong><em>Some Specific Text</em></strong></p>
<p><a style="max-width: 100%;width:400px !important;font-family: 'Helvetica Neue';" href="http://google.ca">Some Other Specific Linked Text</a>
</p>
<p><a href="http://google.ca">Some less difficult Linked Text</a>
</p>
This email was sent to <a style="color: #248acf;" href="mrtest@email.com">mrtest@email.com</a>.
To stop receiving emails, <a style="color: #248acf;" href="http://www.testing.ca/unsubscribe?e=c0ff&utm_campaign=crowdfund&amp;n=8">click here</a>.
</body>
</html>


Here is what this test email was converted into for joomla article, from 2 sources...
CODE

FROM HOST WEBMAIL
<p><strong><em>Specific Replacement</em></strong>
</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>Second Replacement.</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>

FROM GMAIL
<p><span style="white-space: pre;"> </span>
</p>
<p><strong><em>Specific Replacement</em></strong>
</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>Second Replacement. <br /><br />
</p>


some of the paragraph spacing is created by joomla after parsing.

Here are some useful links for understanding preg_replace() better.
http://www.php.net/manual/en/ref.pcre.php
http://www.php.net/manual/en/reference.pcr...ttern.posix.php
http://ca2.php.net/manual/en/function.preg-replace.php
http://regex101.com/
http://www.regular-expressions.info/quickstart.html

This post has been edited by Web Design Seo: Apr 26 2016, 11:20 AM
Go to the top of the page
 
+Quote Post
supamic
post Feb 3 2014, 05:14 AM
Post #5


Newbie
*

Group: Members
Posts: 7
Joined: 27-November 12
Member No.: 1,454



blink.gif TESTING UPDATE

Ive found that the preg_replace regex patterns are sound although some unpredicatable results happen according the order in which they are implemented combined with Joomla code cleanup so to help better understand what occurs while the parsing script run without needing to add a bunch of testing variables to be fed to the cron log I've just added a replacement code for each.

Alter your preg_replace line as below, this will replace your patterns with strings like r3 or r25 according to index for easy troubleshooting.
I've also since added a line to remove the annoying "Fwd:" in subject titles if they have been left in by accident, str_ireplace() function is case insensitive.
CODE

$content = preg_replace($patterns[$k], $replacements[$k]."r$k", $content);

$subject = str_ireplace('Fwd: ', '', $subject);

I will be setting up a separate array to remove(or alter) strings that are static using str_ireplace() because preg_replace() is probably a bit overkill and this will also assist in understanding the order in which the $content variable is changed if we do exact string matches and replacements before doing regular expression identified matches.

http://ca1.php.net/manual/en/function.str-ireplace.php


This post has been edited by supamic: Feb 3 2014, 06:38 AM
Go to the top of the page
 
+Quote Post
bossies
post Jul 22 2014, 03:19 PM
Post #6


Newbie
*

Group: Members
Posts: 9
Joined: 22-July 14
Member No.: 2,076



Thanks supamic - i am busy testing with your script. Do you just have to create your diy.php file and upload to the directory and then select it to be used?
i am having trouble getting it to work?
Go to the top of the page
 
+Quote Post
theitd
post Jun 25 2015, 10:54 PM
Post #7


Newbie
*

Group: Members
Posts: 1
Joined: 31-October 14
From: UK
Member No.: 2,134



I've been working on what I hope is the mother-of-all parsers. It uses an open-source library, HTML Purifier to strip, clean, tidy and add classes to elements. I hope someone finds this helpful.

To install:

  1. From within the following directory, JOOMLA_ROOT/administrator/components/com_post_by_email/helpers/parsers/
  2. Download and unzip HTML Purifier Lite (http://htmlpurifier.org/download)
  3. Create a file called cleanup.php and copy & paste the code below
  4. Make any necessary changes between lines 19 - 42


PBE Setup:

In order to make the most of the parser, set the following fields under the General tab in the mailbox setup:

  1. Parser: choose the cleanup.php file you just created
  2. Under Attachments, select 'Yes' to Show Attachments
  3. Default Position: Top
  4. Show attachment image: Yes


I've tested with Outlook and GMail. But if anyone has any questions, comments or requests - just let me know.

CODE
<?php
/**************************************************************

Variables you may use here (no need to explain their meaning):

$from, $subject, $content, $created

Others:
$email: the current email object being imported
$params: the parameters associated with it

/***
* @DIY Parser for Post By Email
* @author:TheITD
* @v0.1
*/
defined('_JEXEC') or die('Restricted access');

// --------------- EDIT THE ELEMENTS BELOW ACCORDING TO WHICH ONES YOU WANT TO KEEP ------->

$allowedHtml = '
a[href|title|target],
strong,b,em,i,strike,u,
p,ol,ul,li,
h1, h2, h3, h4, h5, h6, hr,
img[src|width|height|alt|title|class],
table[border|cellspacing|cellpadding|width|height|align|summary|bgcolor|backgrou
nd|bordercolor],
tr[rowspan|width|height|align|valign|bgcolor|background|bordercolor],tbody,thead
,tfoot,
td[colspan|rowspan|width|height|align|valign|bgcolor|background|bordercolor|scop
e]
th[colspan|rowspan|width|height|align|valign|scope]
';

// --------------- ADD ANY CLASSES, WITH ASSOCIATED ELEMENTS TO THE FOLLOWING ARRAY ------->

$class_list = array(
//'p' => 'small', // As an example, this would result in <p class="small">
'img' => 'tm-article-image uk-border-circle auto-size img-left',
'ul' => 'check',
'ol' => 'uk-list-line'
);

// --------------- IT'S POSSIBLE THAT YOU WON'T NEED TO EDIT ANYTHING BELOW THIS LINE ------->

// Initialise the html purifier library
require_once (__DIR__ . '/htmlpurifier-4.6.0-lite/library/HTMLPurifier.auto.php');

// custom class to manipulate the element classes
if(!class_exists('HTMLPurifier_AttrTransform_AnchorClass')) {

class HTMLPurifier_AttrTransform_AnchorClass extends HTMLPurifier_AttrTransform
{
function __construct($class) {
$this->class = $class;
}

public function transform($attr, $config, $context)
{
// keep predefined class
if (isset($attr['class'])) {
$attr['class'] .= $this->class;
} else {
$attr['class'] = $this->class;
}
return $attr;
}
}
}

// set the configuration parameters
$config = HTMLPurifier_Config::createDefault();

// Configure the purifier cache
if (defined('PURIFIER_CACHE')) {
$config->set('Cache.SerializerPath', PURIFIER_CACHE);
} else {
// Disable the cache entirely
$config->set('Cache.DefinitionImpl', null);
}

$config->set('Core.Encoding', 'UTF-8');
$config->set('HTML.Doctype', 'HTML 4.01 Transitional');
$config->set('HTML.Allowed', $allowedHtml); // Use the above string to define what's acceptable and what's not
$config->set('AutoFormat.AutoParagraph', 'true'); // Force text to be contained within a paragraph
$config->set('Core.EscapeInvalidTags', 'false'); // Get rid of any metadata, i.e. from MS Word
$config->set('AutoFormat.RemoveEmpty', 'true'); // Clean up a bit
$config->set('HTML.TidyLevel', 'heavy'); // heavy!
$config->set('AutoFormat.Linkify', 'true'); // If allowed, will automatically recognise <a href=.... links

// define tag transformations for bold and italic
$def = $config->getHTMLDefinition(true);
$def->info_tag_transform['b'] = new HTMLPurifier_TagTransform_Simple('strong'); // convert <bold> to <strong>
$def->info_tag_transform['i'] = new HTMLPurifier_TagTransform_Simple('em'); // convert italic to <em>
$def->info_tag_transform['br'] = new HTMLPurifier_TagTransform_Simple('p'); // convert line breaks to paragraphs

// add class to particular elements
if($class_list) {
foreach($class_list as $elem => $class) {
$element = $def->addBlankElement($elem);
$new_class = $class;
$element->attr_transform_post[] = new HTMLPurifier_AttrTransform_AnchorClass($new_class); // call the function and add the classes
}
}

// now for the good bit!
$purifier = new HTMLPurifier($config);
$content = $purifier->purify($content);
?>


Thanks for a great component - and I hope you're able to get back on the JED list soon, as this has been a fantastic feature for me, for years now.

This post has been edited by theitd: Jun 25 2015, 11:07 PM
Go to the top of the page
 
+Quote Post
Web Design Seo
post Jun 26 2015, 09:07 AM
Post #8


Web Design Seo
****

Group: Root Admin
Posts: 4,027
Joined: 29-April 09
From: Sofia
Member No.: 1



Thank you, @theitd! Will take a look on htmlpurifier.


--------------------
Правила на форума | Forum Rules | How to receive support. 3D Web Design: Уеб дизайн, Seo оптимизация, Web Site Extensions, Oscommerce Addons, Wordpress plugins and Joomla Extensions. Изработка на уеб сайтове и оптимизация на сайт за търсачки и Seo услуги.
Go to the top of the page
 
+Quote Post

Reply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

Collapse

> Similar Topics

  Topic Replies Topic Starter Views Last Action
No new Pinned: Topic has attachmentsPost By Email Pro
Pro version of Post By Email component
105 Ivan Stamenov 79,333 14th September 2018 - 07:23 AM
Last post by: Web Design Seo
No new Post By Email Not Working Since Joomla 3.5.0 Update [solved]
5 tompuchner 12,736 2nd May 2016 - 07:04 AM
Last post by: Web Design Seo
No New Posts Post By Email Using Mms On A Mobile Phone
1 alexilio 7,752 13th April 2016 - 09:44 AM
Last post by: Web Design Seo
No New Posts Post By Email Jomsocial - Unable To Post To Wall
3 uglykidjoe 7,112 11th February 2016 - 07:25 AM
Last post by: Web Design Seo
No New Posts Post By Email Doesn't Works
1 cranky69 11,598 1st December 2015 - 10:30 AM
Last post by: Web Design Seo
No New Posts Post By Email Pro Won't Process Messages But Not Published
All messages are processed but some of them are not published
1 sconello 7,462 9th November 2015 - 11:53 AM
Last post by: Web Design Seo
No new Post By Email - "content Between Strings" Extraction Not Working
4 ritual_advert 7,793 4th November 2015 - 07:25 AM
Last post by: Web Design Seo
No new Post By Email For Joomla 3.4
PBE is Fetching emails but no generates no K2 items
9 thomaslab 4,517 14th October 2015 - 08:56 AM
Last post by: Web Design Seo
No New Posts Post By Email Upgrade From Joomla 2.5 To Joomla 3
Retaining Post By Email Parameters During Joomla Upgrade
3 Mike Kiy 3,284 3rd August 2015 - 01:48 PM
Last post by: Web Design Seo
No new Post By Email - Unique Alias Error
[solved]
6 joe772 4,418 29th July 2015 - 02:12 PM
Last post by: joe772


 



RSS Lo-Fi Version Time is now: 24th September 2018 - 09:02 PM
Clicky Web Analytics