PHP Crawler Dev Help

Hi everyone,

I was wondering if there was a resource for PHP development. I've done a lot of web searching and most of what I'm finding is rather inadequate. What I mean by resource is, something along the lines of hire a coder or even just some hand-holding while I iron out the kinks in my code as some of what I'm doing is still beyond my current skills.

Any ideas/suggestions would be greatly appreciated. Thanks!

10 Replies

Hello,

http://www.php.net/manual/en/

together with the full function documentation available on that site, makes it a great reference.

There are thousands upon thousands of resources available on the internet regarding PHP. If you cannot find what you are searching for, you are most likely not being specific enough in your search terms.

If you are a beginner, picking up a book might be an idea.

If you have specific questions, I am sure someone will be able to give a hint or two here, or in the IRC channel.

Regards,

Ovron

Don't do it in PHP. It has horrible memory management and plenty bugs that leak memory. If you want to execute long running processes (like crawlers), PHP is your enemy.

@mst:

Don't do it in PHP. It has horrible memory management and plenty bugs that leak memory. If you want to execute long running processes (like crawlers), PHP is your enemy.

{{citation needed}}

@Ovron:

{{citation needed}}

Years of experience and specialization in screen scraping. PHP is excellent for web applications, shell scripts and practically anything that doesn't run for a while or requires multithreading. I have written crawlers in PHP several times. The results are always horrible, and if you can't afford the performance hit of restarting the process (e.g. batch processing 250k+ URL contents), PHP isn't you friend.

@mst:

@Ovron:

{{citation needed}}

Years of experience and specialization in screen scraping. PHP is excellent for web applications, shell scripts and practically anything that doesn't run for a while or requires multithreading. I have written crawlers in PHP several times. The results are always horrible, and if you can't afford the performance hit of restarting the process (e.g. batch processing 250k+ URL contents), PHP isn't you friend.

Thanks for a helpful response first of all. Secondly, what would you recommend doing it in, as opposed to PHP. I'm not attached to the idea but I was hoping to keep it in PHP for the simple reason that I'd like to be able to initiate a crawl from the admin section of my site. Thanks again!

As you're running a VPS, I'd personally use python:

http://www.example-code.com/python/spid … rawler.asp">http://www.example-code.com/python/spider_simpleCrawler.asp

@mst:

Don't do it in PHP. It has horrible memory management and plenty bugs that leak memory. If you want to execute long running processes (like crawlers), PHP is your enemy.

While this is true, I think the biggest failing here is to assume that a crawler needs to be a "long running process." I would suggest making a scheduler script, using a database like MySQL to build a queue, and using many short-lived php tasks. The result? Better resource usage and the crawler need only handle 1 small task at a time. The aggregate data can be stored in another database. There's very little reason that a crawler would need cross-crawl data, and just about any reason I can think of can be resolved using a queuing database. A very simple structure would be queueUid(primary key) | queuePriority(allows for more complex queuing) | queueData(serialized array; can include instructions, cookie data, referrer, or anything else you would want to pass on)

This, imo, is far better than just running a single long-term process in python. For one, multi-threading is far easier, and you can very easily control the number of threads. This method is far easier to multi-thread than a single long-term process multi-threaded in any language.

````
import os

print "Hello from process A"

if not os.fork():
print "Hello from process B"

````

How much easier does it get?

@Azathoth:

import os

print "Hello from process A"

if not os.fork():
  print "Hello from process B"

How much easier does it get?

if ($pid = pcntl_fork())
    echo "Hello from process A";
else
    echo "Hello from process B";

PHP

hello jefe78,

Although mst is true to some extend, I would still suggest you to stick with something easy and comfortable. If you have some experience with php, stick to it. Its easier to code in something you already know and then later port it to a different language for performance.

Well buddy I myself am working on a couple of php project too. If you get stuck somewhere just shoot me a message and I will try my best to guide you.

Take Care. :)

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct