Welcome to Intel® Software Network Quick Login | Join | Help |
Search in Intel® Software Network Forums
in Go

Memory topics

Last post 06-26-2008, 6:01 AM by Alexey Kukanov. 3 replies.
Sort Posts: Previous Next
 06-23-2008, 2:49 AM 30257474  

Memory topics

Hi!
I am evaluating TBB (so far only parallel_for) for possible use in our application.
So far, I am very satisfied with the results, but I have a few questions:




1. Since our application runs in real time, and has high demands on reliability, we
have a general rule that memory allocation/deallocation on the heap is forbidden during
run time. I have tried to read the soruce code (task.cpp), and as far as I could see,
TBB obeys this rule in the sense that parallel_for can cause memory to be allocated
on the heap the first few times it is called, but that this memory is re-used, so that
after a while, no more memory is allocated on the heap. Please let me know if this
understanding is correct!
2. One step in our computations requires a memory working area. So far, we have solved
this by declaring a fairly large (25 kB) static object. I understand that this is not
feasible in connection with parallel_for, since threads may access this object
concurrently. For now, I solved this by putting the memory on the stack, declaring
it as a local variable. But I suppose it is inefficient to repeatedly put such a
large object on the stack. Is there a better way to do it, e.g. by declaring thread
specific static objects, so that only one object per thread is created during the
entire program life time?
3. Are the threads created by parallel_for equivalent to Windows threads created by
_beginthread in the sense that all synchronization commands (e.g. spin_mutex or
CriticalSection) can be used to synchronize both kinds of threads?
 
 
 06-23-2008, 3:35 PM 30257521 in reply to 30257474  

Re: Memory topics

Thanks for your interest in Intel Threading Building Blocks. Regarding your questions:

1.      Since a prime philosophy driving TBB is to maximize cache reuse, early allocation and stable memory use are important design goals.  Real time memory allocation is minimized but not forbidden.  For example, the scalable allocator on first allocation whacks off a big chunk of memory, which it divides into separate pools for per-thread allocation, so you’d see some heap activity right at the beginning.  Likewise, data structures like concurrent_hash_map do an initial allocation sufficient to handle a nominal group of entries but uses binary growth should more space be needed. The threads used to handle TBB tasks are also allocated to a pool initially and reused, minimizing allocation thrash.  If your program stabilizes in its resource use, heap allocations should also be fairly stable.  

2.      The latest release of Intel TBB provides a means to set thread stack size to enable the stack allocation for larger buffers such as in your current practice.  It sounds, though, like you’d really like to have some Thread Local Storage allocated for each of the pool threads.  One of the new Intel TBB 2.1 features, the task_scheduler_observer, provides the hooks you need to set up a per-thread storage area for the TBB worker threads.  I explain how to do it in my under the hood blog series.

3.      The parallel_for doesn’t actually create any threads.  Those are created and pooled when you create the task_scheduler_init object and stay around as long as that object exists.  These are native threads, spawned in whichever Intel TBB supported OS your program happens to be running.  The parallel_for submits a task to the TBB scheduler, which under parallel_for and using the blocked_range can split that task into a bunch of smaller ones and allocate pool threads to execute various subranges of the original range.  But just because you can, it doesn’t necessarily mean you should.  Intel TBB enabled programs run most efficiently when you let its unfair scheduler maximize parallelism and minimize cache thrashing by avoiding synchronization as much as is feasible for the algorithm.

 
 06-25-2008, 7:59 AM 30257634 in reply to 30257521  

Re: Memory topics

Thanks a lot!
I am still a bit concerned about the heap activities (due to the words "should" and
"fairly" in the answer!). I guess it is very application dependent, but consider the
following code snippet, borrowed from the TBB tutorial:

const int size = 1000;
void g(double& x); // Does some heavy job on x
class Worker
{
   Worker(double *a) : m_a(a) {}
   void operator() (const tbb::blocked_range<size_t>& r) const
   {
      for (size_t i = r.begin(); i != r.end(); ++i)
      {
         g(m_a[i]);
      }
   }
private:
   double* const m_a;
};
void myFunction(double *a)
{
   // Read values to a from somewhere
   parallel_for(tbb::blocked_range<size_t>(0, size, 50), Worker(a));
   // Do something clever with the result
}

Assume that the only thing the program does is to repeatedly call myFunction
(with a pre-allocated array), and that the left-out parts are harmless. Then,
is there a guarantee that after the first few calls, no more memory is allocated
on the heap?

 
 06-26-2008, 6:01 AM 30257694 in reply to 30257634  

Re: Memory topics

baffe:
Then, is there a guarantee that after the first few calls, no more memory is allocated on the heap?

Formally, there is no such guarantee. In TBB, task stealing is random-based and thus task distribution between threads varies from run to run. As the pools of reusable task objects are per thread, any particular thread in any particular run repetition may fall short of available task objects and request memory allocation.

Practically, after several runs I would expect memory consumption to stop noticeable increase, though sometimes allocation of a couple more tasks still can happen as I described above. If TBB is used together with the TBB memory allocator, allocation of additional task objects will have little overhead in average, because the memory allocator serves requests for small objects (such as parallel_for tasks) from a preallocated block of virtual memory, without any kernel calls (unless the preallocated block ends), and by using fast algorithms. So it might be not that bad even if happens in the middle of execution.

You might run some experiments to check memory consumption behavior. If you decide to do that, try creating a test that is close to how you would actually use parallel_for and/or other TBB constructs in your application; an example that needs about 20 task objects (as the one above) will definitely have different memory behavior than that requiring thousands of objects.


Alexey Kukanov
TBB developer
 
View as RSS news feed in XML

Shortcuts


Tags For This Post

...

Community Tags

...