外文翻译--An efﬁcient model for investigating speciﬁc site binding of transcription factors

资源描述

《外文翻译--An efﬁcient model for investigating speciﬁc site binding of transcription factors》由会员分享，可在线阅读，更多相关《外文翻译--An efﬁcient model for investigating speciﬁc site binding of transcription factors（4页珍藏版）》请在金锄头文库上搜索。

1、An efficient model for investigating specific sitebinding of transcription factorsDavid J. Barnes and D. ChuSchool of ComputingUniversity of Kent, Canterbury, UKEmail: d.j.barnes,d.f.chukent.ac.ukAbstractWe present an individual agent-based model oftranscription factor binding to DNA. We explicitly

2、representevery single transcription factor and every single binding sitein our simulation. In order to obtain statistically significantresults from our model, it is necessary to perform a largenumber of simulations. This has previously been hampered by thecomputational demands of individual agent-ba

3、sed models. Herewe present an approach to make such simulations feasible usingapproaches related to dynamic memory allocations in computersystems.I. INTRODUCTIONInterdisciplinary approaches are becoming increasingly im-portant as a means to tackle the challenges posed by realworld problems (climate

4、change, management of resources,sustainable energy, etc.) but also by fundamental sciences.Systems and synthetic biology are the prototypes of these newsciences that amalgamate previously separate fields of enquiryinto a new discipline.One of the challenges of the new science of systems biologyis to

5、 understand the design principles of living systems: tobe able to model them accurately and to understand theirfundamental properties. Central to this is the need to under-stand gene-networks. Quite deservedly, gene networks havereceived significant attention in the past. From a biologicalpoint of v

6、iew, they are important in that they are at the heartof biological control. Therefore, they are of great interest froma synthetic biology/engineering point of view, in as much asthey are potentially components of bio-molecular computers.Gene networks are sets of interconnected genes, implement-ing t

7、he major functionality of the bio-molecular computationprocess. Each gene expresses a protein, which could thenactivate or repress one of the other genes in the network, orperform another useful function within the cell. In this sense,one can interpret genetic control of biological processes as acom

8、putational process.Both from a biological and from a computer science point ofview, it is then interesting to ask questions about fundamentalproperties of the system, such as the speed and accuracy ofcomputation in gene networks. In this contribution we willconcentrate on the question of speed. One

9、aspect that hascommanded significant attention from bioscientists, physicistsand systems biologists is the time required for regulatoryproteins transcription factors (TFs) to find their targetbinding site on the genome. The problem is as follows: Inorder to turn a gene on (or indeed repress it) the

10、TF needs tolocate a specific binding site. TFs do not only bind to theirspecific site but to some degree with all parts of the DNA.When binding to the DNA a TF actually binds to an l-longsequence of nucleotides. The binding strength depends on thematch between the bound sequence and an optimal patte

11、rnwhich represents the sequence of the specific binding site. Thecloser the match, the higher the affinity. While the bindingaffinity to specific sites is much higher than to most non-specific sites, the contribution of the latter is still significantenough to potentially distract a TF from locating

12、 its specificsite. Furthermore, there are millions of non-specific sites andonly few of the specific and active sites for each particularTF. Therefore, even though a TF spends very little time beingbound to each of the non-specific sites, it may take a significanttime to sample all of them before ev

13、entually the specific siteis found. The process of a TF finding its specific binding sitenecessarily limits the speed of a biological computation.This problem, which has been know for a long time, wasfirst addressed by Berg1, who proposed a random walkmodel of facilitated diffusion. The idea of this

14、 model is thatthe TF performs a mixed 1D and 3D random walk. The1D random walk explores a small adjacent neighborhoodof the DNA, while the 3D random walk allows the TF toexplore far-away, unconnected parts of the genome. It hasbeen suggested by Mirny and Wunderlich2, 3 that the mostefficient explora

15、tion of the genome, in the sense that it offersthe fastest location of the specific binding site, is achievedwhen the 3D and 1D components are weighted approximatelyequally.While these results are interesting, particularly because theyhave been obtained by analytic methods, they only considerone typ

16、e of TF. In reality, within each cell many different typesof TF will be performing their respective random walks. Wemight expect, therefore, that TFs get in each others way andthus affect the nature of the random walk. A particular problemis the so-called road-block effect, whereby the 1D randomwalk

17、 of a TF is blocked by another bound TF. Depending onthe density of TFs on the genome, this effect is potentiallysignificant, and has so far not been addressed.While addressing this problem is mathematically intractable,the alternative of simulation is potentially very expensivecomputationally. The

18、movement of many TFs on a realisticallysized genome has been found to limit the scalability of978-1-4244-4713-8/10/$25.00 2010 IEEEmodels4.In this article we will describe an efficient approach to build-ing a computer simulation model of TFs finding their specificbinding sites. This new approach wil

19、l allow realistically sizedsimulations, and significantly expand the scope of previousmodels. The essence of the model is an approach to memorymanagement that makes the problem scalable, regardless ofgenome occupancy. A theme is that runtime speed is obtainedthrough the use of memory. In section II

20、we will describe themodel. In section III, we describe some preliminary simulationresults to show the practicality of the approach, and we reportconclusions in section IV.II. THE MODELThe movement dynamics of TFs involves a search across adiscrete (but very high) number of spatially organised bindin

21、gsites. This suggests the potential for an individual agent-basedmodelling approach. The environment of the TF agents isa non-metric space, that is there is no measure of distancebetween the agents. Embedded in this space is the DNA itself,which is represented as a string of the symbols a,c,g,t with

22、continuous boundary condition. For all simulations reportedhere we used the genome of E.coli K125. At any given time,every agent is either bound to one of the binding sites of thegenome, or suspended in the non-metric space. We think ofthe space as a reservoir of currently unbound TFs.We define two

23、types of agents, namely focal and non-focalTFs. We are primarily interested in the former, yet the latterare important in that their presence on the DNA could interferewith the search dynamics of the focal TFs. The number of non-focal TFs is kept constant during a specific simulation run(for reasons

24、 of computational efficiency), whereas the focalTFs are created and degraded with user-defined rates, henceparticle numbers within the cell fluctuate over time.Focal TFs have a definite binding motif m that is used todetermine their binding energy and, hence, their mean bindingtime at every DNA bind

25、ing site in the model. If the length ofthe binding motif m is l then the binding free energy to aparticular sequence is calculated as follows:Fs=l?i=1imi,si(1)Here, miis the i-th entry of the motif m, sithe correspondingbase of the actual binding sequence s and ithe empiricallydetermined weighting f

26、actor of the binding motif. In contrast,non-focal TFs do not have specific binding sites; rather, theyshare low, position-dependent affinities to all sites on the DNA.The affinity values for both types of TF are pre-calculatedfor every position on the DNA and stored in arrays of thesame length as th

27、e DNA, making binding-time calculation veryefficient.The model update algorithm is event based, with threedistinct events available at each step:Create a focal TF.Bind a TF of either type to the DNA.Unbind a TF from the DNA.Unbind events can result in complete unbinding into thereservoir or short, l

28、ocal 1D movements. Essential for thereliability of the model is to design the update algorithm sothat the behaviour of the model is correct with respect tothe choice of parameters (in the sense that it reproduces thestatistics implied by the various binding and unbinding rates).To achieve this, we h

29、ave adapted the Gillespie algorithm6 toschedule events.A. Scheduling of eventsAt model initialization all non-focal TFs are created andseeded onto random locations on the DNA via bind events attime zero. If there is insufficient space then the excess endup in the reservoir. Then the creation times o

30、f all focal TFsare determined according to a user-defined rate, and creationevents scheduled accordingly. Their lifetime is also determinedat creation with a random number drawn from an exponentialdistribution with a mean of 1 over the deletion rate.When created, a focal TF will immediately attempt

31、to bindto a site on the genome with a user-defined probability; anysuch attempt is successful with a probability p = Nfree/Nrangewhere Nrangeis the total number of binding sites in rangeand Nfreeis the number of unoccupied sites in that range.We specify Nrangebecause the initial bind attempt for a f

32、ocalTF takes place within a limited, user-defined birth range onthe DNA. This models the effect that (in bacterial cells)transcription and translation are performed in one step andhence proteins are produced close to their gene.If the newly-created TF does not attempt to bind, or is notsuccessful at

33、 binding, then it is placed in the reservoir andmay have the opportunity to attempt a general bind (i.e., oneover the full range of the DNA) at a later time. The rangerestriction only applies to the initial binding attempt of a focalTF. General binding is used both to seed initial occupancy ofthe DN

34、A with non-focal TFs, and to support binding of bothtypes of TF from the reservoir. A random available bindingsite is chosen from the full length of the DNA.At the completion of every update step there is a probabilitythat an unbound TF might attempt to bind from the reservoir.The time to the bind e

35、vent is drawn from an exponentialdistribution with a mean of 1 over a value that depends uponthe number of unbound TFs Tu, the number of availablebinding sites Nfreealong the full range of the DNA, and aconstant factor k:P(bind) = (kNfreeTu)(2)A new binding event will only be scheduled if it wouldoc

36、cur before the next already scheduled event. This is becausethe binding probability depends on the current availability ofbinding sites which generally changes over time.The duration time of a DNA-protein bond depends on theaffinity of the type of TF for its binding site; specifically, forfocal TFs

37、this affinity is determined from equation 1.At every unbind event, the next state of the TF is determinedstochastically. Assuming that the TF has not reached the end ofits life (in which case it would be completely deleted), witha user-defined probability one of the following options willapply to it

38、:the TF will attempt to make a one place move left orright (an immediately scheduled bind event);the TF will attempt a short move within a user-definedrange either side of the previous binding site (an imme-diately scheduled bind event);the TF goes into the reservoir.Either move could fail, due to r

39、oadblocks, and lead to theTF going into the reservoir. It should be clear from the abovedescription that, on each iteration, the heart of the event loop isprimarily concerned with: placing a TF on the DNA; removinga TF from the DNA; or both. Therefore, identifying freesections on the DNA is a potent

40、ial performance bottleneckthat could prevent scaling of the method to realistic sizes ofboth DNA and numbers of TFs.B. The memory modelThe key to efficient implementation of binding and move-ment is the fast identification of available binding sites i.e., not just empty bases but runs of bases that

41、are at leastas long as the binding motif (see eq 1) and can thus supportbinding of a TF. A naive representation of the DNA mightbe an array of Boolean values, one for each possible site,recording whether a site is currently occupied by a boundTF or not. In this implementation, an attempt to bind wou

42、ldinvolve the generation of a random number within the desiredlocation range and a check as to whether that location is freeor not. If it is not free then options might be: abandoningthe attempt immediately; searching from that location in oneor other direction until a free site is found; or identif

43、ying afresh random location and repeating the process until a freesite is found. While simple to implement, the weakness ofthis approach is immediately clear as the time to find a freelocation is dependent upon the occupancy of the DNA. Indeed,even when there are plenty of free individual bases, the

44、re areno guarantees that a long enough consecutive run will existto allow a TF to bind, and the approach outlined above mustensure that a search in vain will ultimately terminate.Using this scheme the time to locate a free binding sitedepends on the occupancy of the DNA, and scales poorlywith the si

45、ze of the genome. In this model we thereforeuse a different approach that can find binding sites withina time independent of the occupancy. Rather than an un-structured array of Boolean status values we maintain a datastructure that records all the remaining bindable sections ofthe DNA, as (position

46、,length) pairs. Note that becausebinding and unbinding occur at irregular intervals, sectionsof binding sites are occupied and freed according to noparticular regular pattern. The resulting space managementproblem is akin to dynamic storage allocation in programruntime environments7, as opposed to s

47、tack (last-in, first-out)memory management, for instance. A significant difference,however, is that traditional allocation algorithms, such as firstfit and best fit, are inapplicable in this context, because theFig. 1.DNA section illustrating fragmentation and de-fragmentation duringTF binding and u

48、nbinding: a) Two bound TFs, fragments and an availablesection; b) A third TF binds, resulting in two new fragments; c) A TF unbinds,fragments become available again.memory manager must always allocate a particular section offree space that has been selected by the bind event, ratherthan having a fre

49、e choice. In common with dynamic memoryallocation, available space quickly becomes “fragmented”. Forinstance, consider a run of l + n unoccupied sites, where l isthe length of a TF to be bound and n = l (Fig. 1a). Thissequences offers n + 1 potential binding sites before a bindbut anywhere between 0

50、 and n l + 1 sites after the bind,depending on where the bind takes place within the run andthe size of n in comparison to l. If the TF were to bind acrossthe middle of the section then the two fragments either sidemay well be too short to support another TF (Fig. 1b). As aresult, the data structure

51、 recording bindable sections must besupplemented by a similar data structure recording unbindablefragments. For both we use the set associative container fromthe C+ STL8, which provides efficient access via its keywhich, in our case, is the binding position.Note that a fragment resulting from the bi

52、nd of one TF maybecome usable before that particular TF unbinds as a resultof the earlier bound TF occupying the adjacent section at theother end of the fragment becoming unbound (Fig. 1c). Indeed,most of the complexity of the bind-unbind cycle exists at thepoint where a TF unbinds and the section i

53、t occupies becomesavailable again. Before being returned to the set of availablesections it must be reunited with any fragments at either end.In addition, the newly freed section may now be contiguouswith another already available section, in which case the twocan be coalesced into one.III. RESULTSI

54、n previous versions of the model 4 the size of the genomeand the number of TFs have been clear limiting factors forthe simulation. The memory-model discussed above has ledto a significant streamlining of the simulation execution time,so as to make realistic model sizes feasible where they wereinfeas

55、ible before.In this example, the genome has 4,639,675 potential bind-ing sites. With up to at least 1,000,000 TFs, the simulationprocessed 100,000 events per second, independent of the actualnumber of TFs. We used a Mac Pro 3GHz quad-core IntelXeon processor with 8G memory running Mac OSX 10.4.We se

56、t up the simulation such that the TFs were createdwithin a distance of size 2, 100 and 400 bases away from 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0 5 10 15 20 25 30 35 40 45DensityBinding free energyFig. 2.A histogram of the binding free energies as calculated from eq. 1.The energies are Gaussian

57、 distributed.a central location. In all the experiments we report, the TFswere produced with a rate of 0.01 and were broken down ata rate of 0.0009. The binding free energies of the bindingsites turned out to be distributed according to a Gaussiandistribution (see Fig. 2). To test the model we perfo

58、rmed anumber of experiments to understand the mean first bindingtime to the specific binding site located at 4,540,692, which isone of the specific binding sites for the protein FimB in E.coli.For each release area we performed 10,000 experiments, eachreturning the time it took from starting the sim

59、ulation to thetime one of the focal TFs bound to the specific binding site;from these 10,000 time points we calculated the mean. Doingthis for different conditions we obtained Fig. 3, which showsthe mean time to the first binding event as a function of thedistance of the centre of the drop point. Th

60、e experimental datashows a strong dependence of the mean first binding time onthe drop point, suggesting that the location of production of TFis a crucial factor for the efficiency of TF binding. However,it should be noted that in the present example, the region overwhich TFs are dropped is very sma

61、ll compared to the totalsize of the DNA. Hence, the data presented here should beinterpreted cautiously.IV. DISCUSSION ANDCONCLUSIONIn this contribution we have presented a model that supportsefficient simulation of the process of TFs finding their specificand non-specific binding sites. One of the

62、problems that wehad identified was that realistic simulations involving roadblocks are computationally extremely demanding. For thisreason, modeling of specific binding site localisation has beenrestricted to mathematically tractable but unrealistic models.Here we have made the first steps towards a

63、 computationallyfeasible implementation. One of the bottlenecks we haveidentified is the localisation of free binding sites on the DNA.By adapting approaches from dynamic memory allocationwe were able to achieve speedups with respect to a naivealgorithm of many orders of magnitude.Apart from finding

64、 an efficient simulation implementation, 100 1000 10000 100000 1e+06 4.540200e+06 4.540400e+06 4.540600e+06 4.540800e+06 4.541000e+06 4.541200e+06Mean time to find binding sitePosition of DropDrop Size 100Drop Size 1Fig. 3.Two example simulations showing the mean time required to locatethe specific

65、binding site as a function of the drop size and drop position.another challenge of modeling this system is the fact thatthe biological rules of the system are underdetermined. Forexample, on a crowded genome, it will often be the casethat one TF tries to move but is blocked by another (a roadblock).

66、 In this case, the computer modeler has to formulatea rule that stipulates what should happen in this case. Un-fortunately, despite the advance of single-molecule imagingtechnologies9, that have provided us with movies of howTFs scan non-specific binding sites in live cells, precisely whathappens wh

67、en a TF is blocked by another on the DNA isunknown. As our preliminary simulation showed, the resultsof simulations are sensitive to these assumptions. Hence, itis even more important to further drive the development ofefficient simulation algorithms to be able to compare theeffects of different ass

68、umptions systematically.REFERENCES1 R. Winter, O. Berg, and P. von Hippel, “Diffusion-driven mechanismsof protein translocation on nucleic acids. 3. The escherichia coli lacrepressor operator interaction: kinetic measurements and conclusions.”Biochemistry, vol. 20, no. 24, pp. 69616977, Nov 1981.2 Z

69、.WunderlichandL.Mirny,“Spatialeffectsonthespeedandreliabilityofprotein-DNAsearch.”NucleicAcidsResearch,vol. 36, no. 11, pp. 35703578, Jun 2008. Online. Available:http:/dx.doi.org/10.1093/nar/gkn1733 M. Slutsky, M. Kardar, and L. Mirny, “Diffusion in correlated randompotentials, with applications to

70、DNA,” Physical Review E, vol. 69, no. 6,p. 061903, Jun 2004.4 D. Chu, N. Zabet, and B. Mitavskiy, “Models of transcription factor bind-ing: Sensitivity of activation functions to model assumptions,” Journal ofTheoretical Biology, vol. 257, no. 3, pp. 419429, 2009.5 U. of Wisconsin Madison, “E. coli

71、genome project,” 2009. Online.Available: http:/www.genome.wisc.edu/6 D. T. Gillespie, “Exact stochastic simulation of coupled chemical reac-tions,” Journal of Physical Chemistry, vol. 81, no. 25, pp. 23402361,1977.7 D. E. Knuth, Art of Computer Programming, Volume 1: FundamentalAlgorithms (3rd Editi

72、on).Addison-Wesley Professional, July 1997.8 S. Meyers, Effective STL: 50 Specific Ways to Improve the Use of theStandard Template Library, ser. Addison-Wesley Professional ComputingSeries.Addison Wesley, July 2001.9 J. Elf, G. Li, and X. Xie, “Probing transcription factor dynamicsatthesingle-moleculelevelinalivingcell.”Science,vol.316,no.5828,pp.11911194,May2007.Online.Available:http:/dx.doi.org/10.1126/science.1141967

展开阅读全文

外文翻译--An efﬁcient model for investigating speciﬁc site binding of transcription factors

最新文档